Forum OpenACS Q&A: offtopic: Google now indexing images

In a show of characteristic coolness, google is now indexing graphics on the web. From playing with it, I'm guessing they're using a mix of filenames, OCR, and that madcap linking relevance that google does so well. And for simple queries, it's shockingly effective.
For example, let's find, say, a member of OpenACS. Like, Ben

Really, it's quite demoralizing to compare that to our own difficulties of full-text searching of Postgres & Oracle. But nonetheless cool.

I should go down to Mountain View tomorrow and see if worship is permitted at their offices.

Posted by Luke Francl on

As long as we're off topic, I'll take this opportunity to link to my own image search engine demo, which unlike Google, performs image query by content!

Clearly, it's not as useful as Google's (it doesn't even run an RDBMS!), but I think it has potential. See the page for more details on how it works (i.e., who I stole my ideas from) and what its limitations are.

Posted by Rafael Calvo on
Excelent Luke!

how well does it scale?

Posted by Luke Francl on
It should scale pretty well. The researchers who invented this were
able to search 20,000 images in under .5 seconds in 1995, so clearly
we should be able to do better than that in 2001.

The processing of an image is O(n^2),
because it has to look at each pixel. However, since all images are
rescaled to 128x128 pixels, that's a constant time operation (Hmm...I
really need to look up how fast resizing an image is. It uses an
affine tranform. Anyone know off-hand?). I also need to calculate how
fast a wavelet decomposition is. I think it's O(some_large_multiple * n).

Seaching is similar, as during processing, the signs of the 60
absolute largest wavelet coeffients are extracted and the "name" (160
bit SHA hash for unique identification) of
the associated image are put into one of 6 "search arrays" -- an array of
arrays, with an array for each possible (x,y) location of the
coeffients. There's a search array for each sign of coeffient
for each of the 3 color channels (hence 6). So a search looks through
3*60 of the cells of the 6 different 128 * 128 arrays.

To compile a list of matches, the query image is processed, and for
each of those top 60 * 3 coeffients, you grab the list of images that
match at each coeffients x,y position, color channel, and sign. Then,
update the score of each of the images in the list for that
coeffient. After you're done scoring all the matches, you have to
sort it, which is hopefully O(n *log n), and grab the top m largest
scores. In the original paper, they used a heap-select algorithm to
get the top scores in linear time, but I couldn't find any good
pseudo-code for that algorithm. For the other selection algorithms I
looked at, I calculated that until you have a great number of scores,
it's cheaper to just sort the list. I could be wrong though. Sorting
the list was easy (already implemented) whereas implementing a select
algorithm required more work for me. :)

Obviously right now, it's not very scalable. I tried to get the thing
working in Tomcat so the "database" would stay in memory (it uses
about 2K of memory per image), but the stupid thing didn't work, so I
wrote a little wrapper script that just calls the Java program from
the command lines -- which not only means that it has to read the
"database" in from memory everytime it starts, and write it out when
it finishes, but that only one person can use it at a time, so that
this process isn't screwed up. Not very optimal.

Anyway, right now I'm trying to get a job based on my little toy. But
if that doesn't pan out, or if whoever I hire lets me work on side
projects, I'd love to integrate this (in a significantly improved
form!) into the photodb module (as you've no doubt noticed, I took
most of my images from :)

Thanks for asking! I really need to add a section on this to the "how
it works" page.