Amazon appears to be readying a new cloud search service. This could be many different things: full-scale, programmatic access to the crawlable web index, an indexing service, a hosted search platform for specific sites and verticals.
I personally hope it is some combination of all of the above. In particular, I’d like to see Amazon follow the model of Common Crawl and expand on it. That is, keep a running archive of the web, and allow Amazon Web Services customers to access the data to add additional value on top of it.
We’ll provide more updates as they are made available. Watch this post for changes. [Update: no launch today for cloud search (1/18/2011) as Amazon launched DynamoDB instead.]
So, although Amazon didn’t launch a cloud service today, we hope they do. This is a short list of what we’d like to see. I’m confident that a release from Amazon would not contain all of the below features but I’ll list them anyway as I think these are all elements that should be commodified by a major platform company. There is a lot more interesting work to be done on the web and it’s a shame to see so much effort spent on re-inventing the wheel every time a new application for search technology becomes necessary.
Programmatic Access to a Web Scale Crawl
Amazon already has all the pieces in place to support programming access to an archived copy of the indexable web: EC2, S3, Elastic MapReduce. All they need to do (!) is crawl the web and make it available for everyone to use. As mentioned above, Common Crawl has started a similar project but they don’t appear, for the time being, to have nearly enough coverage on the web.
SEOmoz with Open Site Explorer and Majestic SEO have done a great job of building a link graph database but they have also invested millions of dollars in building their crawling infrastructure. The real value of their platforms come from the value they add on top of the crawled data, so imagine what companies who wish to build similar products could do if they didn’t have to worry about the crawl itself.
Real Time Indexing for Site Search and Verticals
Both in the open source community and in commercial software, indexing technologies (that don’t have to deal with crawling) have probably achieved the most success. A recent startup, IndexTank (acquired by LinkedIn), built on top of the popular open source Lucene indexing technology to add real time support and easy to use, on demand indexing. It was a great service and LinkedIn did the community as a whole an even greater service by open sourcing the IndexTank code after the acquisition.
Amazon could provide a similar service on top of the AWS infrastructure. This would provide the advantage of making this indexing technology available to existing customers.
The type of research and analytics we do at Ginzametrics with our own indexing tools go far beyond what traditional index technologies, such as Lucene, support out of the box. We are creating many different fields on the documents we index and had to write our own aggregation layer to make these indexes useful in a reporting context.
It would be great to see a toolkit emerge that allowed developers to create highly specialized indexes and the ability to report on them in an ad hoc fashion. I haven’t seen any technologies that do this at all (which is why we had to build our own). We may consider releasing something as open source in the future but Amazon could frankly do a much better job.
An additional extension to customized indexing above would be for Amazon to release pre-built indexing engines for the various social networks: Facebook, Twitter, Google Plus and more. Similar to web crawling, a lot of effort is being spent by developers re-writing essentially the same code: connect to some network, index the graph, report on basic measurements (tweets, likes, fans, etc.). This is another layer that should be a commodity service offered by a provider such as Amazon.