google


Google Alerts are a popular way of keeping up with favoured topics on the Web. However, they are not comprehensive or complete, so I’m also in the habit of doing daily searches for specific topics, and narrowing down by date. Unfortunately, it is very easy for website owners to manipulate things so that their pages reappear in search results day after day, even when the user restricts results to pages changed in the past 24 hours and the pages haven’t actually changed. It is possible to bypass this problem, to some extent, by using browser plugins or Google’s block feature; although the latter doesn’t always work.

Based on my experience, while most of the sites whose pages are “updated” every 24 hours appear to do this as a deliberate strategy to pull in more visitors, some seem to be legitimate sites that are just badly designed.

This past week or so, my daily searches have become unusable. From a fairly consistent set of 30 or so results every day, I’m now getting 150 or so results using the same search parameters. All due to one very legitimate site. I won’t mention the name of the site or the strategy it is using as I don’t want to provide information that may be used to trick users. (Although I’ll tell you if you email).

Anyway, I’ve blocked this site from my searches; shame, as it often contains very relevant information.

In the Younger Techapillan’s cancer journey this past while, Techapilla has been constantly amazed at the depths to which some people will sink to attract traffic to their websites. The latest scam to trigger the Techapillan ire is the existence of “content mills”, which provide free content that can be copied and pasted into websites without acknowledgement of the content mill. Frequently, such content carries the byline of the fraudulent website owner.

As a web-savvy academic librarian, Techapilla is well aware of the existence of content mills and similiar sites, such as those that sell essays for assignments. This time, having noticed that Google Alerts was throwing up the same content for “osteosarcoma”, with exactly the same misspelt word, on an almost daily basis, Techapilla was inspired to do some further investigation.

Techapilla is not going to promote content mills or fraudulent websites, so no links provided. But here are the Techapillan findings –

  • 2970 Google hits for the misspelt phrase (“when doctors access osteoarthritis and osteoporosis”). This gets whittled to 43 when similiar results are omitted. Examination of these 43 results reveal that all articles obviously come from a single source
  • 335 Google hits for the corrected phrase ( “when doctors * osteoarthritis and osteoporosis”). This gets whittled to 63 when similiar results are omitted. Once again, examination reveals that all 63 articles have a common source. Note that this search should have yielded more than 2970 results, since it is a broader search than the first – presumably the difference is due to Google using a different algorithm for wildcard searches. This search also highlighted slight wording differences among the articles – either from editing, or running through a translator. Techapilla strongly suspects the original article was written in a foreign language and run through a translator, given the awkward language of most of the articles
  • 4 obvious content mill sites in the first 40 Google hits for the misspelt phrase
  • The majority of sites which had used the farmed article were ostensibly health sites
  • Quite a lot of “This site may harm your computer” links in the hits
  • Searches on slight rewordings of the misspelt phrase yielded additional hits, including the same article that had been cleaned up a bit more or rejigged to “fit” another disease (e.g. osteomyelitis).
  • Visiting a sampling of the sites revealed that while a few obviously tried to be legitmate health sites (shame about their lack of medical knowledge), most sites were fraudulent, with links on the site all leading to commercial sites (“affordable weddings”, “hot winter vacations”)
  • Techapilla is not a medical professional, but has learnt enough about osteosarcoma in the past year or two to confidently state that the article/s examined as part of this task are complete and utter junk

It would be an interesting exercise to trace back the original article, and to run one of the offspring through Turnitin. An exercise for another day.

In the meantime, some guidelines to help Techapillan readers evaluate the quality of information resources.

For the past few months, Techapilla has been working on implementing the SOLR search engine into an implementation of the Fez/Fedora software. This has provided an interesting insight into how search engines determine relevance. So the so-called “Google-gate” scandal caused the raising of a Techapillan brow.

Most people would assume that search relevancy was determined by how many times a particular term occurred in a document.  Possibly in some simple cases this is indeed so. But as Techapilla has learned, search relevancy algorithms are incredibly complex beasts.  It is not a matter of 2 + 2 = 4, but more like 2 + 2 – 5 + 3 – 4/5 + 2*2 – 6  + 3(2 +3 -4) – 0.1 = 2.1. And that’s a simple example, without factoring in any of the “if this, then that; otherwise something elses”.

Words which occur frequently in a document may equate to greater relevance. But not necessarily. Algorithms may give more weight to words which occur less frequently, under the supposition that the more commonly a word occurs, the likelier it is to be trivial (e.g. “because”). Relevancy may be boosted for words which occur first in a phrase, or discounted for word length (shorter words may be deemed to be more trivial).

Algorithms may factor in the length of a document (the shorter a document, the more relevant a term in that document may be considered). They may try to compensate for perceived typing errors, which may help to explain why “climate guatemala” keeps appearing in autosuggests. They may give boosts to word stems rather than a whole word (e.g. the “climate” of “climategate” may be ranked higher. And of course, what the configurers of one search engine may deem worthy of a relevance boost, configurers of another may consider just the opposite. This balancing and counter-balancing involved in search algorithms has been called the “yin and yang of search” .

Someone at Google manipulating the autosuggestions? Pleeeeaaaase. Unlikely. If someone there really wanted to stop people from reading about Google-gate, they’d have better success if they suppressed search results.