Skip to main content
BBC NEWS / TECHNOLOGY
Graphics VersionBBC Sport Home
News Front Page | Africa | Americas | Asia-Pacific | Europe | Middle East | South Asia | UK | Business | Health | Science & Environment | Technology | Entertainment | Also in the news | Have Your Say |
10:07 GMT, Monday, 18 August 2008 11:07 UK

Archives aided by anti-spam tools

Screengrab of Recaptcha page, Recaptcha

Crumbling texts and books are being digitised thanks to anti-spam tools.

To thwart spammers many websites force visitors to transcribe obscured words or characters before they get access.

Now instead of random words many sites are taking text from old books and documents that have been scanned by character reading software.

The words supplied are those the software cannot read but humans can, helping to complete the conversion of old texts to digital form.

Site seeing

The obscured text systems are called Captchas (Completely Automated Public Turing test to tell Computers and Humans Apart) and are widely used by websites to stop scammers and spammers exploiting them to send out junk mail or harvest addresses.

It is estimated that Captcha schemes are used about 100 million times every day.

Created by Luis von Ahn at Carnegie Mellon University in Pittsburgh, the Recaptcha project scoops up words that optical character reading software has marked as unreadable by computers.

In some documents, where ink has faded and paper has yellowed, the character reading software can flag up to 20% of words as indecipherable.

The hard-to-read words are then farmed out to the many thousands of sites that have signed up to be Recaptcha partners.

Words are supplied to sites along with a control word that aims to ensure the person answering is human.

The responses to the obscured text are added to a database and particularly mangled text will be put before several people to ensure it is read accurately.

Reporting in the journal Science the Recaptcha team says the scheme is about 99.1% accurate - as good as professional transcribers and beyond the limit demanded by archivists.

About 40,000 sites have signed up to use words supplied by Recaptcha and it now collects about four million responses every day.

In the last year it has helped resolve more than 440 million words and has just helped to complete the conversion of the entire archive of the New York Times from 1908 into digital form.



E-mail this to a friend
Related to this story:
PC stripper helps spam to spread (30 Oct 07 |  Technology )
Why typewriters beat computers (30 May 08 |  Magazine )
Writing the history of virtual worlds (15 Aug 08 |  Technology )
Computer pioneer aids spam fight (08 Jan 03 |  Technology )
Spam blights e-mail 15 years on (31 Mar 08 |  Technology )
Spam reaches 30-year anniversary (02 May 08 |  Technology )
Cats to help thwart net spammers (07 Mar 07 |  Technology )

RELATED INTERNET LINKS
Recaptcha
Science
The BBC is not responsible for the content of external internet sites



SEARCH BBC NEWS: 

News Front Page | Africa | Americas | Asia-Pacific | Europe | Middle East | South Asia | UK | Business | Health | Science & Environment | Technology | Entertainment | Also in the news | Have Your Say |

NewsWatch | Notes | Contact us | About BBC News | Profiles | History

^ Back to top | BBC Sport Home | BBC Homepage | Contact us | Help | ©