Great Introduction to Hadoop (Video)

Adam Shook at the SpringOne2GX 2013 in Santa Clara, CA

About these ads
Posted in Big Data Analytics, Big Data education, Hadoop, Training | 2 Comments

Amit Bendov, CEO of SiSense, on the Myths of Big Data

Amit Bendov, CEO of SiSense, at the GigaOm Structure event, on people doing big data analysis on a shoe-string budget.

Posted in Big Data Analytics, SiSense | Leave a comment

The Web at 25: The Value of Open

Tim Berners-Lee, 2009 Photo Credit: Webb Chappell

Tim Berners-Lee, 2009 Photo Credit: Webb Chappell

25 years ago today (March 12, 1989), Tim Berners-Lee circulated a proposal for “Mesh” (later to be known as the World Wide Web) to his management at CERN. 45 years ago this year (October 29, 1969), the first ARPANET (later to be known as the Internet) link was established between UCLA and SRI.

The Internet started as a network for linking research centers. The World Wide Web started as a way to share information among researchers at CERN. Both have expanded to touch today a third of the world’s population because they have been based on open standards.

Creating a closed and proprietary system has been the business model of choice for many great inventors and some of the greatest inventions of the computer age. That’s where we were headed towards in the early 1990s: The establishment of global proprietary networks owned by a few computer and telecommunications companies, whether old (IBM, AT&T) or new (AOL). Tim Berners-Lee’s invention and CERN’s decision to offer it to the world for free in 1993 changed the course of this proprietary march, giving a new—and much expanded—life to the Internet (itself a response to proprietary systems that did not inter-communicate) and establishing a new, open platform, for a seemingly infinite number of applications and services.

As Bob Metcalfe told me in 2009: “Tim Berners-Lee invented the URL, HTTP, and HTML standards… three adequate standards that, when used together, ignited the explosive growth of the Web… What this has demonstrated is the efficacy of the layered architecture of the Internet. The Web demonstrates how powerful that is, both by being layered on top of things that were invented 17 years before, and by giving rise to amazing new functions in the following decades.”

Metcalfe also touched on the power and potential of an open platform: “Tim Berners-Lee tells this joke, which I hasten to retell because it’s so good. He was introduced at a conference as the inventor of the World Wide Web. As often happens when someone is introduced that way, there are at least three people in the audience who want to fight about that, because they invented it or a friend of theirs invented it. Someone said, ‘You didn’t. You can’t have invented it. There’s just not enough time in the day for you to have typed in all that information.’ That poor schlemiel completely missed the point that Tim didn’t create the World Wide Web. He created the mechanism by which many, many people could create the World Wide Web.”

“All that information” was what the Web gave us (and what was also on the mind of one of the Internet’s many parents, J.C.R. Licklider, who envisioned it as a giant library). But this information comes in the form of ones and zeros, it is digital information. In 2007, 94% of storage capacity in the world was digital, a complete reversal from 1986, when 99.2% of all storage capacity was analog. The Web was the glue and the catalyst that would speed up the spread of digitization to all analog devices and channels for the creation, communications, and consumption of information.  It has been breaking down, one by one, proprietary and closed systems with the force of its ones and zeros.

Metcalfe’s comments were first published in ON magazine which I created and published for my employer at the time, EMC Corporation. For a special issue (PDF) commemorating the 20th anniversary of the invention of the Web, we asked some 20 members of the Inforati how the Web has changed their and our lives and what it will look like in the future. Here’s a sample of their answers:

Guy Kawasaki: “With the Web, I’ve become a lot more digital… I have gone from three or four meetings a day to zero meetings per day… Truly the best will be when there is a 3-D hologram of Guy giving a speech. You can pass your hand through him. That’s ultimate.”

Chris Brogan: “We look at the Web as this set of tools that allow people to try any idea without a whole lot of expense… Anyone can start anything with very little money, and then it’s just a meritocracy in terms of winning the attention wars.”

Tim O’Reilly: “This next stage of the Web is being driven by devices other than computers. Our phones have six or seven sensors. The applications that are coming will take data from our devices and the data that is being built up in these big user-contributed databases and mash them together in new kinds of services.”

John Seely Brown: “When I ran Xerox PARC, I had access to one of the world’s best intellectual infrastructures: 250 researchers, probably another 50 craftspeople, and six reference librarians all in the same building. Then one day to go cold turkey—when I did my first retirement—was a complete shock. But with the Web, in a year or two, I had managed to hone a new kind of intellectual infrastructure that in many ways matched what I already had. That’s obviously the power of the Web, the power to connect and interact at a distance.”

Jimmy Wales: “One of the things I would like to see in the future is large-scale, collaborative video projects. Imagine what the expense would be with traditional methods if you wanted to do a documentary film where you go to 90 different countries… with the Web, a large community online could easily make that happen.”

Paul Saffo: “I love that story of when Tim Berners-Lee took his proposal to his boss, who scribbled on it, ‘Sounds exciting, though a little vague.’ But Tim was allowed to do it. I’m alarmed because at this moment in time, I don’t think there are any institutions our there where people are still allowed to think so big.”

Dany Levy (founder of DailyCandy): “With the Web, everything comes so easily. I wonder about the future and the human ability to research and to seek and to find, which is really an important skill. I wonder, will human beings lose their ability to navigate?”

Howard Rheingold: “The Web allows people to do things together that they weren’t allowed to do before. But… I think we are in danger of drowning in a sea of misinformation, disinformation, spam, porn, urban legends, and hoaxes.”

Paul Graham: “[With the Web] you don’t just have to use whatever information is local. You can ship information to anyone anywhere. The key is to have the right filter. This is often what startups make.”

How many startups and grown-up companies today are entirely based on an idea first flashed out in a modest proposal 25 years ago? And there is no end in sight for the expanding membership in this club, now also increasingly including the analogs of the world. All businesses, all governments, all non-profits, all activities are being eaten by ones and zeros. Tim Berners-Lee has unleashed an open, ever-expanding system for the digitization of everything.

We also interviewed Berners-Lee in 2009. He said that the Web has “changed in the last few years faster than it changed before, and it is crazy to for us to imagine this acceleration will suddenly stop.” He pointed out the ongoing tendency to lock what we do with computers in a proprietary jail: “…there are aspects of the online world that are still fairly ‘pre-Web.’ Social networking sites, for example, are still siloed; you can’t share your information from one site with a contact on another site.” But he remained both realistic and optimistic, the hallmarks of an entrepreneur: “The Web, after all, is just a tool…. What you see on it reflects humanity—or at least the 20 percent of humanity that currently has access to the Web… No one owns the World Wide Web, no one has a copyright for it, and no one collects royalties from it. It belongs to humanity, and when it comes to humanity, I’m tremendously optimistic.”

The Pew Research Center is marking the 25th anniversary of the Web in a series of reports. Berners-Lee says in a press release issued today by the World Wide Web Consortium: “I hope this anniversary will spark a global conversation about our need to defend principles that have made the Web successful, and to unlock the Web’s untapped potential. I believe we can build a Web that truly is for everyone: one that is accessible to all, from any device, and one that empowers all of us to achieve our dignity, rights and potential as humans.”

See also Berners-Lee post on Google’s official blog: “…today is a day to celebrate. But it’s also an occasion to think, discuss—and do. Key decisions on the governance and future of the Internet are looming, and it’s vital for all of us to speak up for the web’s future. How can we ensure that the other 60 percent around the world who are not connected get online fast? How can we make sure that the web supports all languages and cultures, not just the dominant ones? How do we build consensus around open standards to link the coming Internet of Things? Will we allow others to package and restrict our online experience, or will we protect the magic of the open web and the power it gives us to say, discover, and create anything? How can we build systems of checks and balances to hold the groups that can spy on the net accountable to the public? These are some of my questions—what are yours?”

Posted in World Wide Web | 1 Comment

The Web at 25: Tim Berners-Lee on the Web of Data

Tim Berners-Lee, 2009

Tim Berners-Lee, 2009 Photo Credit: Webb Chappell

In 2009, on the occasion of the 20th anniversary of the Web, Jason Rubin and I talked to Tim Berners-Lee about his invention and its future, the Semantic Web, which he described as “the Web of data.”

Twenty years on, the World Wide Web has proven itself both ubiquitous and indispensible. Did you anticipate it would reach this status, and in this time frame?

Tim Berners-Lee: I think while it’s very tempting for us to look at the Web and say, “Well, here it is, and this is what it is,” it has, of course, been constantly growing and changing—and it will continue to do so. So to think of this as a static “This is how the Web is” sort of thing is, I think, unwise. In fact, it’s changed in the last few years faster than it changed before, and it’s crazy for us to imagine this acceleration will suddenly stop. So yes, the 20-year point goes by in a flash, but we should realize that, and we are constantly changing it, and it’s very important that we do so.

I believe that 20 years from now, people will look back at where we are today as being a time when the Web of documents was fairly well established, such that if someone wanted to find a document, there’s a pretty good chance it could be found on the Web. The Web of data, though, which we call the Semantic Web, would be seen as just starting to take off. We have the standards but still just a small community of true believers who recognize the value of putting data on the Web for people to share and mash up and use at will. And there are other aspects of the online world that are still fairly “pre-Web.” Social networking sites, for example, are still siloed; you can’t share your information from one site with a contact on another site. Hopefully, in a few years’ time, we’ll see that quite large category of social information truly Web-ized, rather than being held in individual lockdown applications.

You mentioned a “small community” of people who see the value of the Semantic Web. Is that a repeat occurrence of the struggle 20 years ago to get people to understand the scope and potential impact of the World Wide Web?

It’s remarkably similar. It’s very funny. You’d think that once people had seen the effect of Web-izing documents to produce the World Wide Web, doing likewise with their data would seem the next logical step. But for one thing, the Web was a paradigm shift. A paradigm shift is when you don’t have in your vocabulary the concepts and the ideas with which to understand the new world. Today, the idea that a web link could connect to a document that originates anywhere on the planet is completely second nature, but back then it took a very strong imagination for somebody to understand it.

Now, with data, almost all the data you come across is locked in a database. The idea that you could access and combine data anywhere in the world and immediately make it part of your spreadsheet is another paradigm shift. It’s difficult to get people to buy into it. But in the same way as before, those who do get it become tremendously fired up. Once somebody has realized what it would be like to have linked data across the world, then they become very enthusiastic, and so we now have this corps of people in many countries all working together to make it happen.

Do you see the Semantic Web as enabling greater collaboration between and among parties, as opposed to the point-to-point or point-to-many communication that seems more prevalent in the current Web?

The original web browser was a browser editor and it was supposed to be a collaborative tool, but it only ran on the NeXT workstation on which it was developed. However, the idea that the Web should be a collaborative place has always been a very important goal for me. I think harnessing the creative energy of people is really important. When you get people who are trying to solve big problems like cure AIDS, fight cancer, and understand Alzheimer’s disease, there are a huge number of people involved, all of them with half-formed ideas in their minds. How do we get them communicating so that the half of an idea in one person’s head will connect with half of an idea in somebody else’s head, and they’ll come up with the solution?

That’s been a goal for the Web of documents, and it’s certainly a goal for the Web of data, where different pieces of data can be used for all kinds of different things. For example, a genomist may suspect that a particular protein is connected to a certain syndrome in a cell line, search for and find data relating to each area, and then suddenly put together the different strains of data and discover something new. And this is something he can do with the owners of the respective pieces of data, who might never have found each other or known that their data was connected. So the Web of data will absolutely lead to greater collaboration.

Is your vision of the Semantic Web one in which data is freely available, or are there access rights attached to it?

A lot of information is already public, so one of the simple things to do in building the new Web of data is to start with that information. And recently, I’ve been working with both the U.K. government and the U.S. government in trying not only to get more information on the Web, but also to make it linked data. But it’s also very important that systems are aware of the social aspects of data. And it’s not just access control, because an authorized user can still use the right data for the wrong purpose. So we need to focus on what are the purposes for accessing different kinds of data, and for that we’ve been looking at accountable systems.

Accountable systems are aware of the appropriate use of data, and they allow you to make sure that certain kinds of information that you are comfortable sharing with people in a social context, for example, are not able to be accessed and considered by people looking to hire you. For example, I have a GPS trail that I took on vacation. Certainly, I want to give it to my friends and my family, but I don’t necessarily wish to license people I don’t know who are curious about me and my work and let them see where I’ve been. Companies may want to do the same thing. They might say, “We’re going to give you access to certain product information because you’re part of our supply chain and you can use it to fine-tune your manufacturing schedule to meet our demand. However, we do not license you to use it to give to our competition to modify their pricing.”

You need to be able to ask the system to show you just the data that you can use for a given task, because how you wish to use it will be the difference in whether you can use it. So we need systems for recording what the appropriate use of data is, and we need systems for helping people use data in an appropriate way so they can meet an ethical standard.

Ultimately, what is one of the most significant things the Semantic Web will enable?

One thing I think we’ll be able to do is to write intelligent programs that run across the Web of data looking for patterns when something went wrong—like when a company failed, or when a product turned out to be dangerous, or when an ecological catastrophe happened. We can then identify patterns in a broad range of data types that resulted in something serious happening, and that will allow us to identify when these patterns recur, and we’ll be better able to prepare for or prevent the situation.

I think when we have a lot of data available on the Web about the world, including social data, ecological data, meteorological data, and financial data, we’ll be able to make much better models. It’s been quite evident over the last year, for example, that we have a really bad grasp of the financial system. Part of the reason for that might be that we have insufficient data from which to draw conclusions, or that the experts are too selective in which data they use. The more data we have, the more accurate our models will be.

After 20 years, what about the Web—either its current or future capabilities—excites you the most?

One of the things that gets me the most excited are the mash-ups, where there’s one market of people providing data and there’s a second layer of people mashing up the data, picking from a rich variety of data sources to create a useful new application or service. A classic example of a mash-up is when I find a seminar I want to go to, and the web page has information about the sponsor, the presenter, the topic, and the logistics. I have to write all that down on the back of an envelope and then go and put it in my address book; I have to put it in my calendar; I have to enter the address in my GPS—basically, I have to copy this information into every device I use to manage my life, which is inefficient and time-consuming. This is because there is no common format for this data to become integrated into my devices.

Now, the vision of Semantic Web is that the seminar’s web page has information pointed at data about the event. So I just tell my computer I’m going to be attending that seminar and then, automatically, there is a calendar that shows things that I’m attending. And automatically, an address book I define as having in it the people who have given seminars that I’ve attended within the last six months appears, with a link to the presenter’s public profile. And automatically, my PDA starts pointing towards somewhere I need to be at an appropriate time to get me there. All I need to do is say, “I’m going to that seminar,” and then the rest should follow.

The Web is such a mélange of useful, noble content and stuff that runs the gamut from the mundane to the grotesque. Do you think humanity is using this incredible invention of yours appropriately?

Yes. The Web, after all, is just a tool. It’s a powerful one, and it reconfigures what we can do, but it’s just a tool, a piece of white paper, if you will. So what you see on it reflects humanity—or at least the 20 percent of humanity that currently has access to the Web.

As a standards body, the W3C is not interested in policing the Web or in censoring content, nor should we be. No one owns the World Wide Web, no one has a copyright for it, and no one collects royalties from it. It belongs to humanity, and when it comes to humanity, I’m tremendously optimistic. After 20 years, I’m still very excited and extremely hopeful.

[First published in ON magazine]

Posted in Data Discovery, Interviews, World Wide Web | 3 Comments

The Tom Davenport Guide to Big Data

DAvenport_BDBig Data at Work: Dispelling the Myths, Uncovering the Opportunities, is a new book from Tom Davenport, a veteran observer of the data analysis scene. It’s a required reading for managers that need a straightforward, hype-free introduction to big data, a clear and clarifying “signal” in the incredible noise around the confusing and mislabeled term. If Viktor Mayer-Schönberger’s and Kenneth Cukier’s book was last year’s definitive text on the subject for general audiences, Big Data at Work is the 2014 definitive guide to starting and managing the big data journey in small and large organizations.

Davenport discusses in the book the experiences of the early adopters of big data, how to develop a strategy and a plan of action regarding big data, what skills a data scientist needs, and how big data will change traditional management behaviors. He also provides a “manager-focused” overview of big data technologies, explains what is needed to succeed with big data, and outlines lessons learned (and some “lessons not learned”) from the experiences of startups, online firms, and large companies. He also offers the concept of “analytics 3.0” to describe how companies can combine the best of small data and traditional analytics with the big data approach.

This last bit, Davenport’s attempt to suggest an evolutionary path forward, while discussed only briefly as the books’ conclusion, is emblematic of the core strength of the entire book. No breathless talk about the “big data revolution” here (I think the word “revolutionary” is mentioned only once or twice—how refreshing).

In a world and business environment constantly replenished with ideas, tools, and entities that are genuinely new, it is important to distinguish the new from the old and understand what is old in the new. A good grasp of past developments is key to developing a better and more useful guidance regarding what steps to take and what to expect. This is especially important for established enterprises and their executives, who need to understand how the new phenomenon relates to their previous investments in related ideas that were “revolutionary” and “transformative” just a few years ago (before big data we had analytics, business intelligence, and data mining, to name just a few predecessors).

Tom Davenport is in a unique position to do just that in the context of big data. When he says that big data is “perhaps the most sweeping change in what we do to get value from data since the 1980s,” he not only provides the best definition of “big data” I have yet to encounter, but also demonstrates his intimate understanding of its evolutionary nature. He has been observing, since the 1980s, the constant trend—the ever-growing deluge of digital data—and the incremental changes in our ability to manage it better. In addition, he has been one of a handful of influential thinkers who have tried to understand the impact of this data deluge on the practice of management.

At the forefront of guiding managers through the complex and changing interrelations between information technology and management for many years—from business process re-engineering to enterprise resource planning (ERP) to knowledge management to business analytics—Davenport in this book writes the new, still-unfolding chapter in this history. He correctly points to “online firms” (e.g., Google, Facebook) as the originators of everything big data, not only in terms of the new tools and technologies and “the function of data science” they have developed but also in their new attitude to data and its analysis and their new, data-driven management practices.

There is one more dimension to the inventiveness of these firms which Davenport does not discuss as such but is, I believe, a very important part of what “big data” means and what impact it may have in the coming years. Google, Facebook, and other web-natives did not follow the traditional IT purchasing decision practices of other enterprises. They created their IT infrastructure on their own and did not buy it from established IT vendors. Similar to the way the new “data warehousing” and “enterprise resource planning” technologies of the 1990s, driven by a new attitude towards data and its mining, gave rise to a new type of IT buying decision favorable to vendors focused on only one element of the IT infrastructure (e.g., Oracle, SAP, EMC, Cisco), so did big data alter the way IT is bought (or developed in-house) and managed in web-native companies. Is this a dimension of big data that is going to be adopted by other companies as they adopt its other facets such as a new attitude towards data and new management practices? Is big data going to fundamentally change the IT landscape and the practice of IT?

Maybe these are questions to be examined in Davenport’s next book which no doubt, given his publishing history, will come out in 18 months or so. Other possible topics for discussion in the next book (which could be about the new Next Big Thing—the Internet of Things) could be the enterprise-related challenges of big data that are talked about only briefly in Big Data at Work.

For example, data privacy issues that executives must understand and have a good grasp of their potential solutions. Privacy is part of the larger issue of “data governance,” the comprehensive set of data and risk management policies and processes that every enterprise today must establish and follow—and few have. Recently, Varonis Systems (VRNS), a data governance provider (about which I wrote last year), had the first successful big data (and for that matter, tech-related) IPO of 2014.

Another topic that deserves a more detailed discussion is the ideology of big data, especially the misguided belief that the collection of data is a goal in itself and that the data can speak to us and answer questions we never knew we should have asked. At least this reader would have liked to hear more from the level-headed Davenport about his important warning in this regard: “Sifting [through a big pile of data] without a purpose can become very expensive and time-consuming. It’s far better to have a hypothesis in mind—particularly before gathering a lot of data, and even before analyzing it.”

There is everything required in Big Data at Work, however, to get any reader started on the big data journey, from busy executives to students who would like to understand what role it will play in their future career. “It’s not how much data you have, but what you do with it that counts,” says Davenport. This is not only a great summation of the topic of big data, but also of the book, a practical and highly useful guide to a “phenomenon… of substantial importance to many organizations” and individuals.

[Originally published on]

Posted in Big Data Analytics, Data Science | Leave a comment

Bernard Marr on What is Big Data

Posted in Big Data Analytics | Leave a comment

Top 10 Big Data Pure-Plays

Wikibon published recently the “Big Data Vendor Revenue and Market Forecast 2013-2017” report which lists more than 70 big data vendors with total 2013 revenues of $18.6 billion, growing at an annual rate of 58%. Here are the top ten big data vendors that derived 100% of their 2013 revenues from big data products and services (and Wikibon’s estimates for 2012 revenues):

Palantir                        $418 million    ($191 million in 2012; revised from original estimate of $78 million)

Pivotal                         $300                    (a new listing in this year’s report)

Splunk                         $283                   ($186)

Mu Sigma                   $160                   ($114)

Actian                          $138                   ($46)

Opera Solutions       $124                     ($118)

Mark Logic                 $96                      ($69)

Syncsort                      $75                     (new)

Cloudera                    $73                      ($56; revised from $61)

MongoDB                   $62                      (36)

Source: Wikibon 2014

Palantir, which former CIA chief David Petraeus described to FORBES as “a better mousetrap when a better mousetrap was needed,” more than doubled its revenues last year, in addition to raising more than $300 million in funding. New on the list this year are Pivotal, a new entity consisting of EMC’s and VMware’s big data-related assets, and Syncsort, founded in 1968 as a developer of mainframe software, but recently focusing on big data-related tools. MongoDB, a developer of a document-oriented database which raised $150 million in 2013, was formerly known as 10gen.  Cloudera, one of the most prominent companies in the big data space, fell far short of its previous performance of doubling its revenues each year.

Wikibon defines big data as “data sets whose size, type, and speed-of-creation make them impractical to process and analyze with traditional database technologies and related tools in a cost- or time-effective way.”

[Originally published on]

Posted in Big data market | Leave a comment