The OED, Big Data, and Crowdsourcing

OED-OMGThe term “big data” was included in the most recent quarterly online update of the Oxford English Dictionary (OED). So now we have a most authoritative definition of what recently became big news: “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Beyond succinct definitions, the enchanting beauty of the OED, at least for those who love words and their history, lies in the collection of quotations illustrating the forms and uses of each word from the earliest known instance of its occurrence to more recent ones.

As someone who has been somewhat preoccupied with uncovering the historical antecedents for our present day usage of the term big data (see A Very Short History of Big Data), I was delightfully surprised to find out that the OED team has discovered that the earliest use of the term happened in 1980, seventeen years before the publication of the first paper in the ACM digital library to use (and define) “big data.” Sociologist Charles Tilly wrote in a 1980 working paper surveying “The old new social history and the new old social history” that “none of the big questions has actually yielded to the bludgeoning of the big-data people.” While the context is the increasing use of computer technology and statistical methods by historians, it is clear that Tilly used the term not to describe specifically the magnitude of the data but as a flourish of the pen following the words “big questions.” The meaning of the sentence would not change if he used only the word “data.”

While I’m quite sure that Tilly did not have in mind big data as it is defined by the OED itself, the context of his discussion is very relevant to today’s debates regarding big data and data science. In the section of the article from which the “big data” quote is taken, Tilly paraphrases the discussion in a 1979 paper by historian Lawrence Stone of the use of quantitative methods in historical research and attempts to make it a “science.”

Stone’s criticism of “cliometricians,” whose “special field is economic history,” reads like a description of the work of many “quants”—in Wall Street, academia, or government—in the forty-five years since he issued his warning: “[Their] great enterprises are necessarily the result of team-work, rather like building the pyramids: squads of diligent assistants assemble data, encode it, programme it, and pass it through the maw of the computer, all under the autocratic direction of a team-leader. The results cannot be tested by any of the traditional methods since the evidence is buried in private computer-tapes, not exposed in published footnotes. In any case the data are often expressed in so mathematically recondite a form that they are unintelligible to the majority of historical profession. The only reassurance to the bemused laity is that the members of this priestly order disagree fiercely and publicly about the validity of each other’s findings.”

Anticipating today’s doubts about the effectiveness of big data and concerns about the ratio of signal to noise, Stone concludes “in general, the sophistication of the methodology has tended to exceed the reliability of the data, while the usefulness of the results seem—up to a point—to be in inverse correlation to the mathematical complexity of the methodology and the grandiose scale of data-collection.” (For a recent enthusiastic embrace of the application of data science to the humanities and a rebuttal, see Leon Wieseltier and Steven Pinker Debate the Quantified Society)

As Tilly hinted in the title to his paper, the new on many occasions is a very familiar old. Just scratch the surface and you find that the “revolution”—a word which we now tend to use liberally to describe any technological development—nicely delivers us to some place in the past while providing a soothing sense of moving forward. Indeed, the first sense of the word “revolution” in the OED is “The action or fact, on the part of celestial bodies, of moving around in an orbit or circular course” or simply “The return or recurrence of a point or period of time.”

Another word added to the OED online in the recent update affirms the notion that (almost) everything old is new again. While “crowdsourcing” was coined by Jeff Howe in 2006, this “new” (revolutionary?) practice launched the OED a century and a half ago:

In July 1857 a circular was issued by the ‘Unregistered Words Committee’ of the Philological Society of London, which had set up the Committee a few weeks earlier to organize the collection of material to supplement the best existing dictionaries. This circular, which was reprinted in various journals, asked for volunteers to undertake to read particular books and copy out quotations illustrating ‘unregistered’ words and meanings—items not recorded in other dictionaries—that could be included in the proposed supplement. Several dozen volunteers came forward, and the quotations began to pour in.

The volume of the “unregistered” material was such that in January 1858, The Philological Society decided that “efforts should be directed toward the compilation of a complete dictionary, and one of unprecedented comprehensiveness.” It took a while, but in April 1879, the newly-appointed editor James Murray issued an appeal to the public, asking for volunteers to read specific books in search of quotations to be included in the future dictionary. Within a year there were close to 800 volunteers and over the next three years, 3,500,000 quotation slips were received and processed by the OED team.

James Murray and his crowdsourced big data files

James Murray and his crowdsourced big data files

Was this the first big-data-crowdsourcing project?

[Originally published on Forbes.com]