February 24th, 2011

the universe is expanding

the idea that google, and the wayback machine — and all those other search engines that pride themselves on being current with the internet — have to crawl endlessly, and more and more often, to pick up on all the many many changes in the architecture and content of the web is really fascinating.

we all know, of course, that the internet is growing at a frenetic pace; people are uploading and sharing millions of videos, billions of photos, and trillions of bytes of text as

  • more and more people gain access to the web (thus making a good case for higher and more frequent use: if your grandmother suddenly got internet access, wouldn’t you want to send her more photos of yourself?)
  • it becomes *easier* to create, share, and access content (both in terms of hardware: smartphones, laptops, the ipad, even cheap digital cameras and ripped CDs and USB everything… and software: instagram, facebook, picasa, flickr, twittr, tumblr…)  
  • the residence paradigm for content shifts from the personal storage device (floppy disk, hard drive, external drive) to the cloud (dropbox, google docs/picasa, mobileme, plus the abovenamed services…)

to cycle back for a sec, take, also, a non-search-engine example as food for thought: the library of congress has committed to archiving all public tweets. but the vast growth from the number of tweets back in 2007 to the number of tweets today means that they’d have to keep looking, keep chasing, keep downloading, faster and faster and faster…

of course, not everyone is publicly harrowed about the rapid pace of change: google’s web crawlers, at least, operate on a somewhat secret, presumably steady schedule; their domination of the search market means that web-content-makers will wait patiently until the next time the spiders visit — and they’re presumably doing a good enough job that websearchers out there still come back to google for their search work.

but as the amount of *new* searchable content out there, and even the *types* of searchable content (just think about all the ephemeral/immediate-access-required genres — news stories, local businesses, deals, images, flight itineraries, tweets, books/articles — that google has added to its search results over the years…) grow exponentially, google has to keep up somehow, no? (of course, there is some outdated stuff out there; just search for google maps images of places before and after natural disasters… but more often than not, google, or its competitors, are on top of things. and, as i just commented, they’re even adding new kinds of things! (see: digitization.)) 

to move from “how much” to “what”: machines don’t think about what to archive; they want it all. there’s no difference, in a sense, between the hashtag #jan25 (demarcating tweets about the recent egyptian revolution) and the hashtag #fail (which might commemorate any number of ironic goofups, big and small, in the lives of regular people). everyone [who wants to be found and recorded] is found and recorded, and this makes the internet a very public, very democratic space.

but then there’s the rather more complicated question of indexing, of making *sense* of all that data, of making it accessible: first, either parsing the new stuff into known categories, or rearranging everything in new ways — and then guaranteeing that the content is quickly and equitably reachable. (think back to analog: what’s the point of hoarding boxes of papers if no-one knows what’s in any given box, and if finding something in a given box means having to look through every box?)

and that begs the question: who’s organizing it all? who thinks up, and designs, and pays for, storage, and curation, and assurance of access in perpetuity? into whose hands should we entrust the responsibility for our collective information legacy? sure, it’s usually corporations that can throw good money at these sorts of problems — one google engineer i talked to recently commented that the google search engine backend has had to be entirely rebuilt twice during his time at the company, because it was really hard to keep using an outdated product (or is google search more a service?) in entirely new circumstances… but the monopoly google’s own innovative nature has allowed it to gain is also unnerving to advocates of disinterested parties being the gatekeepers to the world’s information. google is, after all — in the words of another employee i talked to recently — a very large and very profitable advertising company.     

sometimes the density and complexity of the information/data/knowledge soup we’re swimming in (not to mention the size of the soup bowl) feels very overwhelming.

and that’s why this blog.