Volunteer Transcribers Put Millions of Pages Online
A 15th-century scribe hard at work.
CREDIT: Courtesy of Wikimedia
It used to be a part of a medieval monk's ascetic life of work and prayer. It was a surprisingly tough job, as one 10th-century monk wrote: "Only try to do it yourself and you will learn how arduous is the writer's task. It dims your eyes, makes your back ache and knits your chest and belly together. It is a terrible ordeal for the whole body." Yet historical archives are now recruiting thousands of online volunteers to transcribe texts for them. People donate their free time to reading and typing up everything from historical bird-spotting cards to turn-of-the-century menus from New York City restaurants to every public-domain book in the world.
Sure, these modern volunteers are probably more comfortable than their monastic brethren were. They can work anywhere there's an Internet connection and they can stop any time. But transcription still isn't easy sometimes. Often it requires volunteers to decipher poor handwriting or ink-smudged printed pages. It's a slow, tedious task. "It's not appealing for everybody. You've got to be a nitpicker," said Martin Straesser, president of Distributed Proofreaders Foundation, which provides texts for a free online library of public domain books called Project Gutenberg. Nevertheless, Distributed Proofreaders has more than 2,000 active members from around the world.
This strange mash-up of the medieval and modern is at the crossroads of several rising trends. As human culture moves online, so do our historical records and literature, because people want to be able to study and read online. Researchers are also realizing that they can use computer programs to sort and find interesting patterns in large datasets, including historical literature, so they want these records in a form that a sorting program can recognize.
At the same time, machines still can't transcribe as well as people can, and experts say programs won't be able to compete with people at reading most texts for a long time. Straesser thinks it will be 20 years before accurate reading programs are cheap enough for free projects such as Project Gutenberg.
Why texts need real people
In 2006, at the Family History Technology Workshop at Brigham Young University, BYU computer scientists Douglas Kennard and William Barrett presented a reading program they had built. They had taught the program's artificial intelligence how to read by giving it 200 pages of George Washington's letters along with correct, human-typed transcriptions. When tested with new pages of George Washington letters it hadn't seen before, the program could only recognize one letter out of three.
"And that is on a clear, consistent hand," said Ben Brumfield, a software developer in Austin, Tex., who has built software for human transcribers to use. He originally wrote his program, called FromThePage, for his own extended family to transcribe the diary of one of its ancestors from the 1910s. The San Diego Museum of Natural History and Southwestern University now use the software, which Brumfield makes available for free. "I do not think that this is something computers will be able to do in the foreseeable future," Brumfield told InnovationNewsDaily. So it's up to volunteers to get texts transcribed.
Of course, many archives already post scans of their holdings online. Unlike simple photos of pages, however, transcribed texts are tag-able and searchable, which makes them much more useful than paper pages. With FromThePage, Brumfield can find every time his progenitor talked about quilting, for example, or he could set two search variables to answer questions such as, "What farming chores did people do on snowy days?" Researchers at the North American Bird Phenology program have run computer analyses of past bird-spotting records and found that some bird species are embarking on their springtime migrations earlier in the year than they did a century ago, perhaps in response to global warming.
Though work for any one transcriber is slow, collectively, volunteers can get these enhanced versions of pages online at an industrial rate. At the time of writing, Old Weather volunteers have transcribed 839,084 pages of British Royal Navy ship logs since October 2010. Distributed Proofreaders has added 22,625 books to Project Gutenberg since September 2000. All this work contributes to science and history, as well as offering free access to the world's literature.
How well does this really work?
The biggest worry experts have is the accuracy of these volunteer transcriptions. "That's a huge issue," Brumfield said. "There's still an immense amount of distrust of the general public's ability to decipher difficult handwriting" or older spelling and punctuation conventions.
There are few studies to check if such worries are warranted, but the project managers InnovationNewsDaily contacted were pleased with the quality of the work of their volunteers. "We have an editor who is assigned to check in on things as they come in. They're pretty good," said Sharon Leon, who directs a George Mason University effort to transcribe the papers of the U.S. War Department from before 1800. Some projects try to mitigate problems by having several people check the same page. Old Weather has three volunteers independently transcribe each page, then its science team checks entries where volunteers disagree. They've found their volunteers are 97-percent accurate.
Even if amateur transcribers are second best, however, there is often no better option. There isn't enough funding to pay experts to transcribe the world's written words. George Mason University started the War Department project because "we knew that nobody would pay for transcription of 45,000 documents," Leon said. "We knew there were a lot of enthusiasts out there."
Managers seek to understand what galvanizes the hundreds or thousands of anonymous people on the Internet volunteering for what was once the uncelebrated, belly-knotting labor of monks. Straesser liked the chance to read obscure books and chat in forums with Distributed Proofreaders' friendly online community. He also finds it satisfying to discover mistakes. The North American Bird Phenology program ran a participant-satisfaction survey in 2010 and found the top two reasons people volunteered were the importance of the program and a love of nature.
Transcribing can also be an especially immersive way of reading. Brumfield said he's not a bird-watcher, but transcribing for the North American Bird Phenology program can nevertheless bring him into the scene. "Suddenly it's 1890 and I'm out in the marshes of Minnesota, staring at a duck," he said. "It's just a powerful experience."
Corrected February 6: Previous version of story listed Martin Straesser as a board member of Distributed Proofreaders Foundation, but did not specify he is president of the board.