In an endnote to his 2008 book The Numerati, author Stephen Baker acknowledges that data is the plural of the singular noun datum, but says he’s decided to use data as singular in his book. “[I]n many fields, data is treated as a singular noun, just as the singular word sand stands for lots of individual bits of silica,” he writes.
On this blog, I also consistently use the word as a singular noun. When I say “this data is interesting,” instead of “these data are interesting,” it probably grates on the nerves of a few language-savvy readers. As a guy who likes both data and words, I make no apologies. To my ear, the old singular form “datum” sounds more quaint with every passing year, and not for arbitrary reasons or due to linguistic laziness. It’s a natural result, I believe, of our changing view of what data are. (Is.)
In his book The Stuff of Thought, Stephen Pinker observes that humans from early childhood display superb mental agility in drawing a conceptual distinction between an object (eg., pebble) and a substance (eg., gravel). We capture this distinction in our language as the difference between a “count noun” (a pebble, two pebbles), and a “mass noun” (gravel, some gravel, more gravel). The English language, more than other languages, draws sharp borders around the two types of nouns.
We seem to distinguish between things that are bounded (“delineated by a fixed shape,” made up of countable individuals) such as horses, and things that are unbounded (a multitude of individuals that are inseparable and uncountable, or a continuous mass like dust or goop) such as gravel or hair or glue. Pinker says, however, that our noun usage is reflective of “cognitive attitudes,” rather than physical properties. Therefore, we also see the distinction applied to “things” that aren’t made of matter at all: “opinions” is a count noun, while “advice” is a mass noun. As well: “stories” (count) vs. “fiction” (mass), “songs” (count) vs. “music” (mass) — and so on.
The ability to construe these differences begins as early as age three, according to experiments Pinker describes in his book. And these studies seem to suggest that our conceptual choices result from the way we have heard others describe things — the way others have used nouns. What’s also remarkable is that so many English speakers tend to agree on these usages, despite the fact they have had to be learned on a case-by-case basis, and may differ over time and even from dialect to dialect.
None of the examples Pinker chooses, drawn from everyday speech, are subject to differences of opinion like the “data is / data are” question is, however. So here’s what I think about “data”. Are you ready?
If data is a count noun, (that is, the plural of datum), then we have no choice but to say, “These data are interesting.” And if it’s a mass noun (that is, more of a substance than a collection of delineated individuals), then we are correct in saying, “This data is interesting.” The distinction is clear for nearly every thing we refer to in the run of a day. Yet, there remains some dispute or uncertainty about this word “data”, because the word is in a state of transition.
Half a century or more ago, William Strunk Jr. and E.B. White said in their pithy little tome, The Elements of Style, that “data” was most certainly plural and “best used with a plural verb.” They also noted, however, that the word “is slowly gaining acceptance as a singular.”
And so it is, now more than ever. Have a look at the chart below, produced by Google’s Books Ngram Viewer. The Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. (I wrote about it in the post Chart frequency of words and terms from books with Google, 17 Dec 2010.) You can click on the image to go directly to the chart in Google and play with the settings. This chart compares published instances of “data is” with “data are”, from 1950 to 2005:
Back in the 1950s, when Strunk & White’s famous guide was published in the edition we’re familiar with today, “data are” was firmly in the lead, as a percentage of published usages. Since 1985, however, the traditional usage has lost some ground, and “data is” may one day take the lead. (Try comparing the terms “datum” and “data point”.)
In E.B. White’s day, data were difficult and expensive to collect — every datum was recorded by hand, and calculations were done by hand, too. That’s all changed. Today we say we are deluged by a “flood” of data, or that we are sitting on a “mountain” of data. Both images suggest mass — a liquid that bathes us, an undifferentiated heap of ore that we sit on or mine into. And all this data is digested by computers like whales snarfing up plankton (another mass noun).
The more we hear data spoken of this way, the further our brains are rewired (Pinker-like) to think of “data” as a singular noun. Data these days is less akin to facts (a count noun) and more akin to knowledge (a mass noun). I agree with those who say that common usage doesn’t make something right. But language evolves, and any usage that goes against the grain of our conceptual grasp of reality is not going to survive.