Page last updated at 10:07 GMT, Monday, 18 August 2008 11:07 UK

Archives aided by anti-spam tools

Screengrab of Recaptcha page, Recaptcha
Obscured text systems are a widely used anti-spam tool.

Crumbling texts and books are being digitised thanks to anti-spam tools.

To thwart spammers many websites force visitors to transcribe obscured words or characters before they get access.

Now instead of random words many sites are taking text from old books and documents that have been scanned by character reading software.

The words supplied are those the software cannot read but humans can, helping to complete the conversion of old texts to digital form.

Site seeing

The obscured text systems are called Captchas (Completely Automated Public Turing test to tell Computers and Humans Apart) and are widely used by websites to stop scammers and spammers exploiting them to send out junk mail or harvest addresses.

It is estimated that Captcha schemes are used about 100 million times every day.

Created by Luis von Ahn at Carnegie Mellon University in Pittsburgh, the Recaptcha project scoops up words that optical character reading software has marked as unreadable by computers.

In some documents, where ink has faded and paper has yellowed, the character reading software can flag up to 20% of words as indecipherable.

The hard-to-read words are then farmed out to the many thousands of sites that have signed up to be Recaptcha partners.

Words are supplied to sites along with a control word that aims to ensure the person answering is human.

The responses to the obscured text are added to a database and particularly mangled text will be put before several people to ensure it is read accurately.

Reporting in the journal Science the Recaptcha team says the scheme is about 99.1% accurate - as good as professional transcribers and beyond the limit demanded by archivists.

About 40,000 sites have signed up to use words supplied by Recaptcha and it now collects about four million responses every day.

In the last year it has helped resolve more than 440 million words and has just helped to complete the conversion of the entire archive of the New York Times from 1908 into digital form.


SEE ALSO
PC stripper helps spam to spread
30 Oct 07 |  Technology
Why typewriters beat computers
30 May 08 |  Magazine
Writing the history of virtual worlds
15 Aug 08 |  Technology
Computer pioneer aids spam fight
08 Jan 03 |  Technology
Spam blights e-mail 15 years on
31 Mar 08 |  Technology
Spam reaches 30-year anniversary
02 May 08 |  Technology
Cats to help thwart net spammers
07 Mar 07 |  Technology

RELATED INTERNET LINKS
The BBC is not responsible for the content of external internet sites


FEATURES, VIEWS, ANALYSIS
Has China's housing bubble burst?
How the world's oldest clove tree defied an empire
Why Royal Ballet principal Sergei Polunin quit

BBC navigation

BBC © 2014 The BBC is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.

Americas Africa Europe Middle East South Asia Asia Pacific