LucidWorks: Bringing Search to Big Data

Grant Ingersoll

“Search is the UI for data today,” Grant Ingersoll, Chief Scientist for LucidWorks, told the audience at the recent IE big data conference in Boston. Anyone on the Internet is familiar with the search box and can find the data they are looking for without knowledge of any programming language. This could make search a key factor in brining big data to a very broad audience, giving direct access to data and putting analysis tools in the hands of decision-makers. But, observed Ingersoll, “search is not being talked about in the big data space.”

Driven by its mission “to bring search back front and center,” LucidWorks wants to be Google for the enterprise. Created in 2008 as Lucid Imagination, its founding team included several key contributors and “committers” (lead developers) to the Lucene project and LucidWorks employees are involved with other Apache Software Foundation activities (Ingersoll, for example, is co-creator of the Mahout machine learning project). The recent name change to LucidWorks came after the company has shifted its business model from offering support and consulting services to selling licensed technology, a development platform built on Apache Lucene/Solr open source search (Lucene is a text search engine library and Solr, built on top of the Lucene library, is an enterprise search platform that offers full-text and other search functionality).

These changes have come about after the hiring last December of Paul Doscher as CEO. Doscher told me at the conference that Lucene/Solr is the most deployed search technology in the world, used by companies such as Netflix, AT&T, Sears, Ford, and Verizon. According to Ingersoll, Twitter search is powered by Lucene, handling more than a billion queries a day, with close to four hundred million tweets indexed and available within 50 milliseconds of being posted (see here for a 2010 post about Lucene by the Twitter engineering team).

Twitter is a good example of how the world of data, once dominated by numbers, has become saturated, even overwhelmed, with text. And this is where the experience with search—and the general domain of information retrieval—really helps. Says Ingersoll: “I never like to use the word ‘unstructured’ because unstructured to me means random. Language is highly structured; we are just not very good in telling computers how to deal with it yet.” But the fuzziness and ambiguity of language is “fundamentally what search and information retrieval have been trying to solve since day one.”

Another challenge of a world awash with data that “we’ve lived and breathed for so long in search,” says Ingersoll, is data cleansing: “Everybody is so obsessed with their algorithms but at the end of the day it all comes down to pre-processing of data.” Commenting on his experience with a customer whose widely-used product was showing on page ten of search results because of a data entry error, Ingersoll shared a lesson learned by many other data scientists: “You think that data quality is like taking out the garbage. It’s something you take for granted until it doesn’t work, until the garbage man doesn’t show up and you got a whole bunch of trash sitting on your front step for weeks on end.”

Bringing this experience with search and the open source community to bear on emerging big data needs, LucidWorks launched in May a big data beta project. Built on top of LucidWorks Search, LucidWorks Big Data certifies and integrates all of the Apache open source components necessary to develop and manage a big data application, including Hadoop, Mahout, HBase, Zookeeper, Pig, and other software tools.

It’s a marriage between “two disparate systems that need to work together,” says Ingersoll. One provides ad-hoc access, consisting of the questions users ask about what they are trying to find in the data. The other is a batch-oriented, offline system, using the big data ecosystem tools and consisting of the questions—and the resulting analysis—about what users are doing with the data and how they interact with the system.

“Users are missing from the big data conversation,” argues Ingersoll. Paying attention to what users are doing helps improve the real-time, ad-hoc access to the data by improving relevance and search results. The analysis of users’ interaction with the system could also provide, as an interesting by-product, new insights about the business. In other words, what your employees do with your data may tell you a whole lot about how your business is functioning and even where it’s heading.

While providing decision-makers with easy-to-use access to the most relevant data, LucidWorks also wants to insulate them—and application developers—from ever-changing big data technologies. Ingersoll lumped all the familiar big data tools in what he called “the big data operating system,” or the storage and computation layer of the big data ecosystem. Where LucidWorks wants to make its mark is in the layer that sits between the user (or the developer) and this underlying technology, what it calls the Search, Discovery, and Analytics (SDA) layer. SDA may or may not become a new big data market segment, but LucidWorks is sure to try and make it a new market “category” for industry analysts to talk about and customers to get interested in.

Customers that have started to work with LucidWorks include startup companies that want to build new big data applications—for example, to capture all the publicly available legal information and provide it as a searchable repository to attorneys and in-house legal departments; and large companies such as telecommunications providers that are losing billions of dollars each year because of fraudulent calls and now can analyze these calls and prevent other fraudulent calls from taking place. “We are seeing a lot of companies forming committees to identify the list of use cases that are the most significant for the company and the right technology to address these requirements,” Doscher told me.

Most of the significant innovation in the information technology space in the last twenty years, I would argue, has happened outside enterprise IT by companies catering to “consumers” or more accurately, individuals creating, accessing and sharing data. This wave of innovation had one major catalyst—the invention of the World-Wide-Web by Tim Berners-Lee. Ironically, the original motivation for Berners-Lee’s invention was his desire to help people at CERN, where he worked at the time, share internal information. The initial goal of the Web was to help preserve CERN’s “organizational memory” and make it accessible and searchable. But because Berners-Lee made his new application available on the Internet, it became—with the help of other innovators—a searchable memory repository for the world.

At the same time, also twenty years ago, enterprise IT saw a major wave of change with the rise of relational databases running on inexpensive servers, data mining, and data warehousing. This helped companies use and derive value from their newly established data stores. LucidWorks may represent a new wave of change, using search—the first “killer app” of the Web—to unlock the value of enterprises’ much expanded big data stores and overflowing organizational memories.

“The analogy I go to,” says Paul Doscher, “is relational databases back in the late 1980s or early 1990s. Later there were tools and eventually there were apps. I see the same evolution going on with Hadoop. Hadoop is the [underlying] data structure and there will be tools from vendors like us and eventually there will be apps for vertical or horizontal market segments that will all be based on a big data framework. But where it took ten to twelve years [for this evolution to unfold] in the relational database context, I think this is going to take three to five years.”