For the past few months, Techapilla has been working on implementing the SOLR search engine into an implementation of the Fez/Fedora software. This has provided an interesting insight into how search engines determine relevance. So the so-called “Google-gate” scandal caused the raising of a Techapillan brow.

Most people would assume that search relevancy was determined by how many times a particular term occurred in a document.  Possibly in some simple cases this is indeed so. But as Techapilla has learned, search relevancy algorithms are incredibly complex beasts.  It is not a matter of 2 + 2 = 4, but more like 2 + 2 – 5 + 3 – 4/5 + 2*2 – 6  + 3(2 +3 -4) – 0.1 = 2.1. And that’s a simple example, without factoring in any of the “if this, then that; otherwise something elses”.

Words which occur frequently in a document may equate to greater relevance. But not necessarily. Algorithms may give more weight to words which occur less frequently, under the supposition that the more commonly a word occurs, the likelier it is to be trivial (e.g. “because”). Relevancy may be boosted for words which occur first in a phrase, or discounted for word length (shorter words may be deemed to be more trivial).

Algorithms may factor in the length of a document (the shorter a document, the more relevant a term in that document may be considered). They may try to compensate for perceived typing errors, which may help to explain why “climate guatemala” keeps appearing in autosuggests. They may give boosts to word stems rather than a whole word (e.g. the “climate” of “climategate” may be ranked higher. And of course, what the configurers of one search engine may deem worthy of a relevance boost, configurers of another may consider just the opposite. This balancing and counter-balancing involved in search algorithms has been called the “yin and yang of search” .

Someone at Google manipulating the autosuggestions? Pleeeeaaaase. Unlikely. If someone there really wanted to stop people from reading about Google-gate, they’d have better success if they suppressed search results.

Advertisements