Monday, June 23, 2008

A little knowledge

Asking the right question of a search engine can bring rewarding results, but humans still have an inside track on some important queries
It has been more than a decade since Larry Page and Sergey Brin founded Google. Since that time, the Web search engine has become the company’s most widely known product, and the company itself has become a major American corporation. Google has been used by millions of people worldwide to find information relating to subjects from semiconductors to soybeans.

Google indexes billions of Web pages using keywords and operators that link keywords so that users can search for information. Given a little knowledge of what one is looking for, the search engine can be a powerful tool. If you know that you need to perform an FFT on an FPGA, for example, simply typing “FPGA FFT” into the search engine will return many relevant results. Although a more formal structure such as “field programmable gate array fast Fourier transform” may be more meaningful, such an expression will not return as relevant a list of results. In such cases, a little knowledge about the subject and acronyms may be the fastest way to find the information you need.

For the Google search engine to crawl the billions of pages now available relies on the ability of programmers to encode the data currently on Web sites in a hypertext markup language (HTML). Because this task is time-consuming, Web sites such as Wikipedia (www.wikipedia.org) have developed their own markup language. This so-called “wiki markup” offers a simplified alternative to HTML and is used to describe pages within wiki Web sites. Because it is easy to use, this markup language has attracted great interest from researchers and students.

Perhaps Wikipedia’s most interesting feature is that the details described on each page are intimately linked to other Wikipedia sites. Typing a keyword such as “color” into the site brings you a page on the perceptual property of color, along with links to other Wikipedia pages such as color theory, the spectrum of light, and electromagnetic radiation. Despite offering such a wonderful resource, Wikipedia is still limited.

What is required is a more semantic representation of knowledge such as the Resource Description Framework (RDF) from the World Wide Web Consortium (www.w3.org). This can be used to represent both knowledge and facts contained in a document or resource to alleviate any potential subjectual misconceptions, a fact not overlooked by Powerset (www.powerset.com), a company that has introduced a search tool that uses semantic language representation to present the user with a more natural way of quizzing the Wikipedia database.

Rather than typing selected words such as “FFT” or “field programmable gate array” into the tool, more natural expressions such as “How do I implement an FFT in an FPGA?” returns results that are more pertinent— although limited to the Wikipedia site. Currently, two of Powerset’s competitors in the race toward semantic-based search—Haki (www.hakia.com) and Twine (www.twine. com)—also offer such semantic-based search tools that are not limited to Wikipedia’s site.

Despite these innovations, mining information in an intelligent fashion requires the information database to be well constructed. A visit to the “Ask the Experts” section of the Automated Imaging Association (www.machinevisiononline. org) Web site, for example, revealed that one reader had enquired about the difference between a machine-vision camera and a conventional photographic camera. A simple question, but even when asked, a result specifically answering the question was not returned by Powerset—merely another list of Web sites more relevant than those returned by Google. Luckily, the reader’s question had been answered very accurately by a human being!

While those envious of Google’s success are trying to better the company’s search engine with knowledge-based data-representation tools, only relatively few information databases currently use either the RDF format or RDF query languages such as SPARQL. And, even when such standards are fully adopted, it will be many years before any computerbased program can answer your questions directly. Until then, a little knowledge may not be as dangerous as you think.