If one wished to find a book in a library before the digital age, one would seek a librarian and request the book either by title or by author. The librarian would then go through stacks of books and retrieve the desired text. If one knew that the library organized its books by ISBN, one might be savvy enough to ask for a specific ISBN to spare the librarian from opening its big reference book and search through that prior to meandering through the shelves.
Fast forward to the digital age. For many years, this is how traditional database-style searching worked. You'd key in “Great Expectations” or “Charles Dickens” and voilà, the book (or list of books) would be at your fingertips. The problem, however, is that the world is not a utopia and information is often messy. What if you typed “great expectations” with lowercase letters, or “the great expectations,” or even worse, introduced spelling mistakes – “great expectaions”? Database-style searching is fantastic for highly structured data (eg. searching through ISBNs) but when it comes to semantically unclear queries, let alone semantically unclear data, it is not only inefficient, it is simply not a good solution.
Full-text search is a field that uses some database-style searching techniques, but deals with messy text in the real, unwieldy world. In fact, full-text search is not only efficient at dealing with lowercase letter searches and spelling mistakes – it allows a user to find a keyword within a large corpus (ie. “Charles Dickens had a bad hair day” would return results for Charles Dickens, plus any results that include the subsequent words) and, most importantly, can perform this task speedily over millions of documents. One is empowered to find the metaphorical needle in the haystack, without inconveniencing the librarian in the slightest.
At the heart of a full-text search engine is something called an inverted index. This is nothing more than a term-document incidence matrix. Sounds like complicated stuff, but in reality the underlying principle can be understood by an infant. Consider the opening words of Charles Dickens' Tale of Two Cities: “It was the best of times, it was the worst of times.” An inverted index of these words would look like the following:
|Term||Tale of Two Cities|
If we added a sentence from Oscar Wilde's The Picture of Dorian Gray, “the ugly and the stupid have the best of it in this world” to our index, our matrix might look as follows:
|Term||Tale of Two Cities||The Picture of Dorian Gray|
There are several problems with this, however. Firstly, if we have millions of books to search through, we will need to assign document IDs to the book titles (otherwise we will have the same problem we started with, only with titles). Secondly, this matrix does not account for word repetition, otherwise known as term frequency. A better inverted-index might look like this:
Ah, much better! It is now immediately apparent that words such as “the”, “it” and “of” have a high frequency and are present in both documents.
There are hundreds of different ways one can decide to build an inverted index, using additional fields such as a skip-pointers, tuning it for performance or for optimal disk-space usage – but they are essentially all variations on a theme. The crux of the matter remains unchanged: the inverted index is to full-text search what electromagnetic suspension is to bullet trains. It underpins the system while allowing for optimal speed and performance.
The nice thing about having an inverted index, is that one can now tackle the following problems more easily:
1) Spelling mistakes can be matched against an index of terms as opposed to an unsorted corpus of millions of texts.
2) Potentially unnecessary stop-words such as “a”, “the”, “this”, “or” can be easily discarded. Alternatively, as in the case of the famous band “The Who”, they can just as easily be utilized and prioritized.
3) One can now look for keywords within books' titles, bodies, captions and footnotes.
4) Words in the index can be stemmed, so that if a user searches for “cats” documents containing the singular “cat” will also be part of the result set.
5) Words in the index can be normalized, so that acronyms such as “C.A.T” are dealt with differently from the noun “cat”, abbreviations such as “'tis” can refer to its expanded equivalent of “it is” and spelling alternatives such as “color” and “color” can be programmed to match the same term in the inverted index.
Full-text search algorithms make use of this database-style inverted-index and apply ranking functions such as the Okapi BM25 algorithm developed in London's City University in the '80s and '90s to not only return the list of documents containing the specified search query, but also to weigh the results and tell the user which one she is probably looking for.
What's more full-text search engines are clever, and make use of Artificial Intelligence techniques to learn from a users's clicks, inferring which result should be given a greater weight. In fact, the more a full-text search engine is used, the better its result set will become with time.
The reasons to use a full-text search engine are not limited to this.
Spelling suggestions, result highlighting, faceted navigation and in recent years, geolocation searches are just a number of other reasons why database-style searching is not only simplistic, it's obsolete. Full-text search is the de-facto method of finding text within huge data sets. Yet like all human endeavors, studying and analyzing one field inevitably opens a pandora's box. New problems are revealed, unthinkable situations surface and unprecedented corner-cases are to be dealt with. A very large amount of these have been studied, tackled and solved through years of research. Many others remain unsolved. For instance, as computers become more performant and people's tempers become shorter, many are beginning to ask the computer questions in their natural language expecting the first result to be the answer they seek. Full-text search is being used to find images and bleeding edge research is even beginning to map facial recognition algorithms to text (and vice-versa).
The options seem endless and the bar keeps raising in the field of full-text search. One thing is certain: full-text search has made incredible what librarians once believed to be impossible.