By Jonathan Fildes
Science and technology reporter, BBC News
Over time Professor Roy's son learns how to say the word 'ball' (footage: MIT Media Lab)
"Can you think of a more complicated question to ask?" says Deb Roy, as he explains the genesis of his work.
In 2005, the artificial intelligence researcher at the Massachusetts Institute of Technology (MIT) Media Lab set out to understand how children learn to talk.
"We wanted to understand how minds work and how they develop and how the interplay of innate and environmental influence makes us who we are and how we learn to communicate."
It was a big task and after years of research, scientists around the world had only begun to scratch the surface of it.
But now, Professor Roy is beginning to get some answers, thanks to an unconventional approach, an accommodating family and a house wired with technology.
And the research may even have kick-backs for everything from robotics to video analysis.
The question of how infants learn to speak is hotly debated. At its simplest level the argument comes down to "nature versus nurture".
On one side, scientists argue that children have an innate hard-wired ability to learn language, while on the other side, researchers argue that language is learned through interactions with the people and environment around them.
The first task we set for ourselves was to transcribe everything my son heard or said from nine to 24 months
Between the two extremes is a spectrum of opinion.
Professor Roy wandered into this debate as someone originally more interested in robots than children.
"I was initially inspired by how children learn language as a new way of building machines," he says.
But looking through the raft of prior research on the effect of environment on language, he noticed a common problem; previous studies only offered snapshots of a child's development.
"Every parent knows that a child can change a lot in a week or a month," he told BBC News.
"If you're interested in the process of development then it is important to have a continuous view."
It is a problem recognised by other linguists as well.
"Current samples that the field works with - typically an hour of recorded speech a week - are one to two orders of magnitude too small for our scientific purposes," Professor Steven Pinker of Harvard University told BBC News.
So, Professor Roy, who by then had a child on the way, set about solving the conundrum. His solution: wire up his house with 11 cameras, 14 microphones and terabytes of storage and record every waking moment of his soon-to-arrive son.
It was christened the Human Speechome project and immediately drew comparisons with its genetic counterpart.
"Just as the Human Genome Project illuminates the innate genetic code that shapes us, the Speechome Project is an important first step toward creating a map of how the environment shapes human development and learning," said Frank Moss, the director of MIT's Media Lab at the time.
Professor Pinker, who is also an adviser to the project, said: "In developmental psychology there has long been a trade-off between gathering lots of data from a small number of children, or a small amount of data from a much larger number of children.
"Roy is simply pushing this trade-off to an extreme - a truly massive amount of data from a single child."
Now, a quarter of million hours of recordings later, Professor Roy is beginning to tease apart the masses of data and look for answers.
To extract meaningful patterns from the 200GB (gigabytes) of data that flowed daily onto the racks of hard drives in the basement, the team created a series of software tools.
The first, ominously called Total Recall, allows a researcher to quickly scan through any part of the data. All 25 recordings from the microphones and cameras are shown as separate channels.
HUMAN SPEECHOME PROJECT
11x 1 megapixel fisheye lens cameras. Swithced on by motion sensors.
14x omnidirectional microphones recording CD quality sound
1000m (3000ft) wires connect recorders to servers in basement
Record from 8am -10pm every day for 3 years
PDAs in each room can be used to control recording
'Oops' button wipes last few minutes of recording
Sound is represented as a spectrograph, while the video is processed to show only movement, creating a ribbon of colour, which looks like the flow of traffic at night and represents the accumulated motions of life in the Roy household.
While useful for getting a sense of when and where action may have taken place, the team needed another set of tools to delve deeper into the data.
"The first task we set for ourselves was to transcribe everything my son heard or said from nine to 24 months," he says.
He estimates that there is somewhere between 10 to 12 million words of speech to transcribe.
"For anyone that has transcribed speech, they will know that is a laborious and slow process," he says, with a degree of understatement.
Initially his team tried to use off-the-shelf speech recognition software, but soon realised that they were not up to the job of extracting words from often-noisy environments.
"We realised that the state of the art is not even close to good enough," he told the BBC.
Automatic systems could have error rates of up to 90%, he said.
At the other extreme, Professor Roy also experimented with human transcribers, but that also came with its own problems.
"It would take an average of 10 hours to find and transcribe one hour of speech," he told the BBC.
HUMAN SPEECHOME IN NUMBERS
90,000 hours of video recorded
140,000 hours of audio recordings
Approx 200GB of data collected every day
150 TB of raw data collected over course of project
When you are trying to analyse 16 months of video from 14 microphones, those kinds of ratios don't seem attractive.
Instead, the researchers created a piece of software called Blitzscribe, which finds speech in the recordings and breaks it down into easily transcribed sound bites.
"We have automated components assisting human annotators," he said.
The net result is that we have reduced 10 hours down to two hours."
The analysis also takes into account how a word was said - called prosody - and who said it.
To date, the team have already transcribed more than four million words.
"It's already the most complete transcript of everyday life at home than any recording ever made."
A similar human-computer system, called TrackMarks, has also been developed to analyse the video and gives information such as where people are in relation to one another and the orientation of their heads.
Software visualises how care givers interact with the child over time
Although the data sets are still incomplete, Professor Roy says they are already beginning to see interesting results.
For example, his team has been able to begin to tease apart a process he calls "word births", the time when a baby first begins to use a word.
By analysing the length, and hence complexity, of sentences spoken by caregivers to his son, he believes that he has shown that adults subconsciously simplify sentences until the child understands the word.
Once it has been understood, the adults then build up the complexity of the sentences containing the word.
"We essentially meet him at this point of the birth of the word and gently pull him into language," he says.
The Speechome Recorder can be fitted in any home
Professor Roy stresses it is an initial result and has not been validated by the scientific community. However, he says, it shows the kind of questions that can be answered with the data and tools he now has.
But winning over the rest of the scientific community might be his most difficult job.
It remains to be seen whether other scientists will accept his conclusions as they are based on the analysis of just one child and, as Professor Roy admits, are unlikely to be reproduced because of time and cost.
In part to address this criticism, he has developed a stand-alone device - called the Speechome recorder - that can be easily put into homes with out 1000m (3000ft) of wiring in the walls and converting the basement into a data centre.
The devices look like floor lamps and contain an overhead microphone and camera, with another lens at eye level for children.
The base of the device holds a touch-screen display and enough storage to hold several months of recordings.
Their first deployment will be in six pilot studies of children with autism where they will be used to monitor and quantify the children's response to treatment.
"I'm really excited - this is the future of the project," says Professor Roy.
But he also has his eye on other possible spin-offs.
For example the video-analysis algorithms designed for the project could be used in automated systems to monitor CCTV cameras and extract information about particular events.
He is also working with architects to visualise how people move around an environment and how changes to building design affect that.
The results are being fed into creating a semi-automated architectural design system.
"This could be really interesting if you're designing a retail space or if you are an architect and have a design and want to know whether it will work or how to change it."
However, Professor Roy has never forgotten his roots in robotics and still hopes to bring the project full-circle.
"What if we can build a machine that can step into the shoes of a child and learn in human-like ways," he asks.
"Imagine transferring that into a video game character or into a domestic robot that can now learn to communicate and interact in social ways.
This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.