Visual Search Engine Sees Past Words
Multimedia libraries such as the World Wide Web will someday benefit from the first search engine that comprehends images and video without the aid of text or metadata.
"There are many pictures and videos needed to be annotated for retrieval. However, the annotations given by different persons for the same picture are different," said researcher Ying Dai at Japan's Iwate Prefectural University.
Non-textual image search tools such as TinEye currently rely on visual similarity to locate identical copies of a picture strewn across the Internet in different locations and formats. Meanwhile, conventional search engines miss out on millions of images due to limited and idiosynchratic metadata. The first search engine designed to discover images related by meaning, or "semantic message" could eventually perform novel tasks such as associating images like a human and identifiying videos based on a single shot.
"This software can annotate a picture by some key words, and associates these key words with other unseen words automatically," Dai said.
A unique framework that fuses visual and conceptual content allows the software to promote images with high semantic association even when their visual resemblance is low: for example, palm trees against a blue sky compared to a tree trunk flanking a dense blurred forest. At the same time, statistical relationships calculate how much to restrain content that looks similar yet shares little meaning.
Because these concepts visibly appear as index categories on the search engine interface, a user searching for videos about an ancient city could have success following associated words such as "hard," "building" and "street" to find the target selection even if words "ancient" and "city" are not learned by the machine. Like a poet, the software uses knowledge of a concept with no language to grasp at "unseen" words, thus powering an increasingly narrow search.
At the same time, the software's potential to learn concepts outside its lexicon means it can begin applying these words to other images and searches, allowing "many, many categories" to be queried, Dai said.
"In the future, we will use the tools such as WordNet to construct the relations among the learned words and the left words," Dai said. WordNet [http://wordnet.princeton.edu/wordnet/] is a dictionary package of the English language used by search engine and database developers.
Web page data usually already offers information on "when" and "where," making "who," "what" and "how" (impressions such as "calm," "active," "hard") descriptors of special interest. In a current prototype, key words such as "landscape," "flowers" and "vegetables" compete within the domain "natural" for how much their associated values correspond to a new image.
Researchers tested the software's ability to accurately retrieve content using 400 random images from Sozaijiten, a popular data set of pictures published in Japan, and 1400 random keyframes from Video Traxx HD 1, a royalty-free video clip cache. Experiments confirmed that combining visual similarity and semantic relevance factors produced retrieval results superior to either one functioning alone in the algorithm.
"According to my knowledge, it is certain there are no methods similar with ours in the literature," Dai said.





