The Future of Search

6 11 2008

2 weeks ago the TDM team and I organized a discussion panel hosting the world finalists of The Star Challenge, an international search engine competition focused on multimedia search. Each team had one representative on the panel, most of which consist of researchers in the field of media search.

They are:
Prof Wu Xihong (Team SHRC)
Prof Shinichi Sato (NII)
Neo Shi Yong (NUS) – Winner
Laurent Besacier (LIG)
Prof Thomas Huang (UIUC)

I’m going to try a new approach to writing about this event by writing what I found interesting about the discussion in point form. Hopefully the material here makes sense.

Visual search:
(includes both video and image)
– Problem faced in visual search is that unlike words, there is no grammar to connect things together. This is the biggest challenge facing image search at the moment until we find the “grammar”.

– Someone suggested using a bottom up approach like the DNA where you break it down into its bits of 1s and 0s to understand the content. Prof Huang says that this is harder than decoding a DNA as it is not that easy to understand the bits of 1s and 0s after it has been broken down. Sometimes these 1s and 0s mean nothing and are not unique representations of things.

– Current technologies store a huge database of certain images. Example given: Researcher stored and sorted thousands of pictures of celebrities in database gathered from Yahoo! News. Now if you put in a picture of a celebrity, the system will do a comparison and return results based on this database. The same approach can be applied to video search as well if there is enough data.

– Visual search on YouTube and Flickr rely on human tagging. A lot of times people have to provide data and systems build relationships among images via text around the data. Example: A computer knows that a picture of a smashed car is related to an accident based on the words on the page describing the accident. A more efficient system would be to enable computers to form an understanding of images and videos without relying on human tagging / text data around it.

– Example from audience: Son wants to find all videos of David Beckham playing soccer. It is a simple request but a complex task as the system has to understand what “playing soccer” means before producing the correct results.

– Another challenge is finding the context that these data are in. When looking for details about an image of a dress, the search would be better if there is information on what occasion the dress is for (e.g: a formal dinner, pool party, summer vacation).

Audio search:
– Audio recognition technologies are not that accurate yet. For humans, even if we don’t know the language we can still tell what language it is from the pattern of the sounds. Example: We don’t know Korean but can tell when a person is speaking Korean or not if we’ve heard some bits of Korean before. Computers at the moment are not able to do this. They need to be fed with lots of data about a language before being able to make the recognition.

– There is a lot of demand for this technology in American hospitals where patients speaking many different native languages come into the hospital. Doctors will be able to understand what they’re saying and translate it back into the patient’s native language.

– Example of another problem: Recording the sound of animals / insects and wanting to find out what creatures made those sounds and the reasons behind it. Computer is unable to tell the difference between a person screaming in joy and screaming in fear.

I find the final end message by the panel really interesting. After all the talk about the technicalities and algorithms needed to make media search more intuitive and accurate for humans, the panel concluded that it all comes down to marketing and user experience that will determine how successful / popular a search engine will be. In other words, a good product isn’t really a good product unless the user is able to see and experience its goodness. 😀

Resources you might want to check out:
Jiin Joo’s thought of the panel
The Star Challenge




4 responses

7 11 2008

Audio search seems to be more feasible since we have linguistics (and computational linguistics). I bet the challenge was to get computers to recognise the variations in pronunciation of phonemes and then do a simple search matching phonemes to words/languages.

Wait, can we ourselves tell the difference between a scream of joy or fear without other cues?

7 11 2008

Hi Jian!

I personally think that looking at current technologies, audio search looks like the one who will be applicable much sooner than video search which makes its feasibility higher. The feasibility of video search is actually equally high as well if we ignore the fact that it might take 10 more years before it is perfected. Example of its applicability would be if there’s a patient exhibiting awkward behaviour due to a rare disease, it might be possible to find details about it by feeding it into a niche search engine focused on the medical field. A doctor might be able to learn more about the disease and precautionary measures he needs to take before an expert from another country gets back to him with advice.

Haha to tell you the truth, humans are pretty accurate when it comes to telling the difference between various expressions without visual cues based on what we’ve learned through experience. There are times when we mistaken what we hear but these are pretty rare. The challenge would be to have the computer understand these media materials like people.

7 11 2008

I think it will be challenging trying to create an intelligent enough machine to discern between the wheat and the chaff, since humans do a lot of sorting out in our minds based on context and meaning. When it comes to pictures, it is very difficult for a machine to know whether a scene of a man falling off a cliff represents tragedy or comedy for instance. For audio search, I agree that it is probably simpler but how often do we look for audio files on the internet vis-a-vis visual files?

And then, there is the complication of having both audio and visual elements together (like in videos), or even worst together with translated texts in a different language!

8 02 2009
TinEye: Reverse Image Search « Bits & Bytes make a Bitbot

[…] my friend told me about a new image search engine, I was skeptical. Following the discussion at the Future of Search forum last year, it was well understood that there is still the issue of defining the grammar that […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: