By Adrian Midgley 09/01/2002 HTML 01/01/2003
Google is the third of the great search engines to come out of Stanford University.
First was YAHOO (Yet Another Hierarchically Organised Object) which has a search engine in fact it now uses Google, but is strictly a human catalogue. THis has good and bad points. I think the size of the Web has passed that which Yahoo can handle.
Second IIRC was Excite, whose "more like this" feature struck me as amazing. My fantasy was that one would run a search engine of similar characteristic on the practice collection of medical records, and when a patient suddenly and unexpectedly died, or became oddly ill, one would tell it to give one the two patients most like that index patient, and then look to see if they were about to hit a calamity.
Google's basic cleverness, apart from its simplicity, and the fact it runs on a large array of commodity computers, using a sensible operating system (Linux, on 1800 PCs at th last estimate I saw) lies in the page ranking and relevance technology it has.
The problem they solved was how to decide which links _not_ to follow when they spider the Web. The relevance ranking also helps the page you want to be amazingly often on the first page of results.
(or in the case of Adrian Midgley, on all of several pages, but enough of ego surfing)Their original papers are around on the Web, of course, and can be found using their tool.
In the context of the medical domain, it is possible to spider the whole of the medical web, provided it is connected, or we know the URL of a node within each cluster that only has connections within itself. There are beleived to be more than several, and in fact probably many, volumes of the Web that have no functioning external connections. NHS Net is in some sense one of these...
... and provided the site-owners permit it. The BMJ doesn't, so one has to go to multisearch techniques...
Some distance down the coast from San Francisco is San Diego. Certainly as far as Santa Monica the drive is utterly wonderful, although we did it in the opposite direction.
The university there (UCSD) took note of the number of websites on its campuses, and decided there was a need for a tool that would index them. The Open Source (Gnu licenced) system named Ht:/Dig (Hypertext digger, and Diego go into that name) was the result. It is a search engine for a small internet, and as such is eminently suitable for single sites, and for the NHS Net.
Bandolier uses it, as do many other academic orgs, and having taken to using it in the Practice, I built a search engine that had as its compass the NHS Net and an assortment of WWW sites outside it, and have had it running for some time. A little while later, the NHS Logistics Authority constructed an NHS Net search engine which I am pleased to see uses the same technology.
Like most working networking technology it is of course based on Unix, or in this case Linux...
Meanwhile in France, the core of the search engine - the inverted index generator - became available as Mifluz. This is a reasonable potential successor to the enormously useful English (but Windows) program Idealist, which gives you a similar quick find for random records on one or a network of computers. I love it, but wish to leave Windows behind now, so I am working on the use of Mifluz.
Copernic, IIRC, is a multisearch system.SUMsearch is a useful project and is found where you would expect, but has lately become dependent on Internet Explorer.
I simulate this with a Python script running locally, and which I may place on my Internet server,which submits a single query term to a list of search engines.
I chose to include the BMJ's search engine, since I am often interested in that content, and others including OMNI and HON. And Google since it is as said above very very good for medical stuff that is not sequestered in one of the bubbles that are making the medical web in to a foam rather than a free information space.
Two new search engines that have been built since Google (which for those that don't know is a term for a 1 followed by 100 zeroes, or "lots") are SurfWax which functions on generic browsers, but has some added features which use the Borg's technology, and Teoma www.teoma.com which came from a project at Rutgers University in New Jersey in 1998