Wednesday, July 7, 1999 Published at 18:05 GMT 19:05 UK
Web engines could do better
How to find your target in 800 million pages
The performance of search engines used to find material on the Web is deteriorating, new research has suggested.
A study conducted by Dr Steve Lawrence and Dr C. Lee Giles shows that no engine is now logging more than about 16% of the publicly indexable Web.
The estimated combined coverage of the 11 engines used in the study is 335 million pages, or 42% of the estimated total number of pages.
This performance is "substantially" worse than when the researchers last did a survey in December 1997.
The researchers also accuse the engines of a bias towards US Websites and sites that have more links to them - in other words, the more 'popular' sites. The engines are also more likely to index commercial sites than educational sites, they say.
Lawrence and Giles believe this bias has damaging and divisive consequences.
"Search engine indexing and ranking may have economic, social, political, and scientific effects," they say.
"For example, indexing and ranking of online stores can substantially effect economic viability; delayed indexing of scientific research can lead to the duplication of work or slower progress; and delayed or biased indexing may affect social or political decisions."
Lawrence and Giles, of NEC Research in Princeton, New Jersey, publish their research in the latest edition of the science journal Nature.
They estimate there are now around 800 million pages on the Web, encompassing about 15 terabytes of data (about 6 terabytes of textual content, after removing HTML tags, comments, and extra whitespace); it also contains about 180 million images (three terabytes).
About 83% of sites contain commercial content and 6% contain scientific/educational content. Only 1.5% of sites contain pornographic content.
The researchers say that greater attention should be paid to the accessibility of information on the web, in order to minimise unequal access to information, and maximise the benefits of the Web for society.
Because the overlap between the engines remains relatively low, the men recommend using metasearch engines such as MetaCrawler, which combine the results of multiple searches.
The estimated combined coverage of the engines used in the study is 335 million pages, or 42% of the estimated total number of pages. Hence a substantial improvement in web coverage can be obtained using metasearch engines such as MetaCrawler, which combine the results of multiple searches."
The study found the Northern Light search engine to have the greatest coverage.