The First Law of Big Data

EMC released today the 5th annual Digital Universe study from IDC.  So now we have five years’ worth of estimating, with a consistent methodology, the amount of data created and copied annually in the world. It turns out that the amount of digital data created each year has grown by a factor of 9 in the last five years. And since IDC uses the same methodology to forecast the next five years, it looks like data will grow by a factor of 61 over the ten-year period, 2005 t0 2015. 2010 also marked the 1oth anniversary of the first study to quantify the amount of digital data created annually, the “How much information?” study from UC Berkeley. It used a different methodology than IDC’s and did not count copies. Still, it is interesting to note that the estimate for the amount of digital data created annually was about 1.5 exabytes (1.5 billion gigabytes). If we assume 3 copies for each original (a generally accepted assumption in 1999), we get about 4.5 exabytes. This means that the amount of data created and copied annually grew by a factor of 273 (4.5 exabytes to 1227 exabytes) over the last decade. Big Data, indeed!

Even if we use the smallest (and most reliable) growth rate we have, the 9X by which data grew from 2005 to 2010, we get close to 100,000 exabytes of data in 2020 (assuming the 9x growth rate stays constant throughout the 2010s).

Now, it’s possible that the growth rate will actually slow down in the coming years, as the massive amount of analog data that has been accumulating before the year 2000 has already been digitized. This was highlighted by a study by Martin Hilbert and Priscila Lopez published earlier this year which estimated that in 1986, 99.2% of all storage capacity was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles (in 2002, digital data storage surpassed non-digital for the first time). But does the end of digitization portends a (radical) slow down in the growth of data? Only if the analog-to-digital conversion was the only factor responsible for the growth rates we have observed in the last decade.

I believe there has been another important factor driving the growth of digital data, one that will continue to support the same kind of growth rates we have seen in the last ten years and may even accelerate them.  It’s called the World Wide Web.

The Web is a data generating machine in and of itself, a perpetual motion machine violating the laws of thermodynamics. Some of this perpetual motion was captured by another study, published last fall and earlier this year, by Roger Bohn, Jim Short, and Chattanya Baru, measuring the flows of information through enterprise servers and into consumers’ eyes and ears (Americans consumed 3600 exabytes of data in 2008).

The Web has created a digital platform that makes the consuming, creating, and moving of data far easier than it has ever been, making any additional member in the Internet community (over 2 billion at 30% penetration of world population), a contributor to the exponential growth of data. There were 1 billion Internet users in 2005, which means the Internet population doubled in the last 5 years.  Using IDC’s estimates, each person on the Internet in 2005 produced/consumed 130 gigabytes on average. In 2010, each person on the Internet produced/consumed 613 gigabytes. Which brings me to

The First Law of Big Data: Each additional person on the Internet accelerates the rate of growth of data.

The Web links 1. consumers/producers of data to other consumers/producers of data; 2. devices producing/consuming data to other devices producing/consuming data; and 3. data to data. And all three are linked together in one giant Big Data cloud. Each person joining the Internet today immediately creates hundreds of links (with or without their knowledge) with each link representing megabytes and gigabytes of data.

Which is why Google today announced Google+: There’s gold in them links!

Of course, since we are talking about the future, we must also consider less-rosy scenarios. What if after several people die of starvation while online, the government steps in and mandates a limit on digital  storage set at 1982 levels?

Full Disclosure: I conceived of the Digital Universe study while working for EMC and worked with IDC on it from 2007 to 2010. For the 2011 study, I was paid to provide editorial assistance to IDC. The opinions above are mine.

Other observations about the 2011 study in the blogoshpere so far: Lucas Mearian, Chuck Hollis, and Lara O’Reilly.