Page last updated at 11:04 GMT, Thursday, 10 December 2009

Rare words 'author's fingerprint'

A simple analysis of the words in a book is an indication of who wrote it

Analyses of classic authors' works provide a way to "linguistically fingerprint" them, researchers say.

The relationship between the number of words an author uses only once and the length of a work forms an identifier for them, they argue.

Analyses of works by Herman Melville, Thomas Hardy, and DH Lawrence showed these "unique word" charts are specific to each author.

The work is published in the New Journal of Physics.

Researchers also suggest each author pulls their works from a hypothetical "meta book". One description of this concept might be a framework for the way an author uses language. It is from this framework that all their works are ultimately derived.

In 1935, the Harvard University linguist George Kingsley Zipf demonstrated a mathematical relationship between the frequency of a word in a text and its rank in the list of an author's most used words.

So, the second most frequent word in a book occurs half as often as the first, the third most frequent occurs one-third as often, and so on.

The rule laid the groundwork for many mathematical analyses of words, in which the Zipf law seemed to be a universal property of English - and by extension, of language itself.

Building on that idea, researchers at Umea University in Sweden have found that language use isn't as universal as Zipf's law might suggest.

They have used a related approach that comes up with a unique identifier for each author.

Hardy measure

Clearly, a longer written work has more unique words - words that appear just once in the text.

However, even the best writer's vocabulary will at some point run out of words that have not yet been used.

Thomas Hardy
Thomas Hardy's curve looked less word-rich than Herman Melville's

The researchers gathered together the complete works of Hardy, Melville, and Lawrence, and measured that dependence - counting the number of new unique words as a particular author's works get longer and longer.

They used sections from books of varying lengths, randomly pulled from novels, alongside shorter works and short stories.

They found that the authors had distinctly different "unique word" curves.

The team suggests that a work by an unknown author could therefore be compared to prior works, with the curve acting as a linguistic "fingerprint".

Source material

The meta book concept proposed by the authors is not simply the list of all the words they know, but also the "distribution" of those words produced by an author, whether in drafting an e-mail or writing War and Peace.

"It doesn't matter if I pull out 10,000 words from a book of 100,000 or from a book of 200,000, I get the same behaviour; you always simply pull a piece out of your very, very big 'meta book', which is just a representation of your style," said Sebastian Bernhardsson, who led the work.

"That story you're writing right now is a piece of that big book and that is what you're pulling out," he told BBC News.

Not everyone is convinced, however. Computational linguist Rob Malouf, from San Diego State University in the US, says that the work is the latest in a long history of a search for a single measure to identify authors.

"Modern authorship identification techniques use a collection of factors to describe a writer's style," he told BBC News.

"I'd certainly want to see [the new approach] applied to more than three authors before I'd be convinced. It's highly unlikely that any single number is going to work; it would be like trying to identify people by their weight."

For its part, the team will continue the analyses with different English authors, and with authors in different languages. As their collection of fingerprints grows, Mr Bernhardsson said, they will try to identify the authors of anonymous works.

But not every result is a happy one, he added.

"It's a fun and interesting exercise, but I've plotted my own thesis in this sense and it was kind of discouraging comparing to some more famous authors."

Print Sponsor

Letters and e-mails share pattern
31 Oct 05 |  Science & Environment
Mobile phones expose human habits
04 Jun 08 |  Science & Environment
How are you spelling that?
24 Aug 05 |  Magazine

The BBC is not responsible for the content of external internet sites

Has China's housing bubble burst?
How the world's oldest clove tree defied an empire
Why Royal Ballet principal Sergei Polunin quit


Americas Africa Europe Middle East South Asia Asia Pacific