Saturday, December 14, 2013

Fahrenheit 451, Release2.0

I have never approved of Google, because its business model is based on theft and the  glorification of thievery. But like crap from China in stores or crooks in stock brokerages, some economic situations are too pervasive to be avoided, however much we should.

Lately I acquired my first e-reader, a Nook Simple Touch. It’s an outdated model, so it was being sold cheap. I prefer printed books, as I discovered years ago after downloading a few volumes from guttenberg.org onto a PC.

On the other hand, I fly a lot, usually with 20 pounds of books, coming if not going, because when I land in a new place I look up a bookstore. In order to save weight and, especially, space, it seemed like a good idea to carry my reading in a lightweight, small e-reader.

The Nook is acceptable, barely, as a book. I carry a laptop, too, but I don’t like reading books on a laptop.

Besides buying digital versions of some new books, when I got the Nook I looked through the free library of old books. In theory, this is a wonderful idea for readers. I was able to download four volumes of the Potash & Perlmutter stories written by Montague Marsden Glass a century ago. Printed copies of Potash & Perlmutter are hard to come by.

I also looked for English translations of any of the books of David Friedrich Strauss. These are almost impossible to find and cost hundreds or thousands of dollars. No soap, but I did find two volumes of Christian apologetics, contemporary with Strauss’s publications 150 years ago, purporting to refute him. So I downloaded them.

The first one I opened was a translation from French of “An Answer to Dr Strauss‘ Life of Christ” by the Protestant theologian Athanase Coquerel. On each page, it says “Digitized by Google,” part of Google’s effort to commit millions of old books to cyberspace.

A compete hash they made of it, too.  Using a copy from a Harvard University library, some klutz who couldn’t figure out how to get a page onto a scanner produced a weirdly distorted title page. At least it was readable. Not so the text, which was submitted to the indignities of optical character reading.

I used OCR in the newsroom for a while in the late ‘70s, and the accuracy then, not high, was better than what Google achieved in whatever year Coquerel was scanned.

I am not discounting the difficulties of scanning a book from 1845, which was priced at a shilling and slovenly printed on bad paper untreated with titanium dioxide, so that today the contrast between browned page and faded ink is not strong. Still, knowing that to be the situation, someone needed to take responsibility to have a text editor correct the misreadings, especially since I understand that some libraries (with Google’s encouragement) are discarding their paper copies now that Google has done them the favor of preserving the text in the cloud.

Only Google hasn’t done that. I have not bothered to do a precise statistical analysis. The result was so bad it isn’t worth it.

Probably 80-85% of the words in the text were scanned correctly, but no more than half the sentences are free of errors. Some gremlins are irritating but minor, like inserting * or spaces into words.

No more than half the sentences are fully readable, even as the reader supplies emendations. And recall that I have been an editor for half a century. I doubt many readers could supply the gaps and reconstruct the text as well as I could.

In many places (particularly at the original page breaks), some text has simply disappeared. There is no way to tell if it is a line or a paragraph.

Worse, when dealing with proper names, the error rate rises to about 98% (near 100% in the case of Arabic numerals). If the name isn’t obvious from context, and often it isn’t, then it is near impossible to fix it. In endnotes, even if the author and title can be guessed, the trashing of the numerals makes the page reference impossible to guess.

Here is an example, far from the worst, from Note VI:

“NOTE yi.

“Thf loyth^ $aUed Olshauseni be it historical or philosophSca],
embellishes the idea which it contains, by mixing up vith it circum-
 sfcasoes of little importance, dravn ftom the usages and opinions of different nations. (De integritate posterioris Petri EpistoU. Sec part cap. V. $3.)”

I avoided showing the worst because I didn’t want to spend half an hour carefully retyping gibberish.

I cannot say how many thousands, perhaps millions, of volumes Google has vandalized, or whether any of these losses are remediable. It is like going back to a scriptorium of the Dark Ages where sleepy monks introduced inscrutable errors into texts, and whatever information was in the master copy was lost forever as surely as if it had been burned.

1 comment:

  1. I vastly prefer my iPad to printed books.

    It is no heavier than most, props itself up, always has exactly the right amount of light, always has a highlighter and dictionary handy, and has text search.

    I no longer have to worry about running out of reading material, or not getting the NYT or WSJ. Reading a newspaper at the table, which, because I travel so much, I frequently do, is far easier if it isn't actually a paper. Or, in the case of the NYT, news, either.

    I'm trying hard to think of one area in which books are superior. As it turns out, there is one. Oddly, it is frequently possible to get the dead tree version for less than digital.

    ReplyDelete