SAS CTO on Big Data and Big Compute

“One of my biggest challenges,” Keith Collins told me recently, “is helping SAS understand how to communicate to IT organizations. We present workloads which look odd and different. IT does not know how to have an SLA (Service Level Agreement) around them.  We take all of the compute and I/O capacity that they can give us.”

SAS, the largest independent vendor in the business intelligence market, used to be a prime example of “shadow IT,” the purchasing of information technology tools by business users without the knowledge and approval of the central IT organization. But this is changing in the era of big data. The collection and analysis of data are becoming a very large part of many business activities and the IT organization is asked to provide support, even leadership, in tying together these disparate efforts.

Collins is SVP and CTO at SAS, where he has spent almost 30 years, helping the company grow with the market through a number of phases (and buzzwords)—statistical analysis, decision-support, data mining, knowledge and risk management, business intelligence, and business analytics.  Now SAS is helping its customers, including CIOs and their IT teams, address the challenges of big data. Collins has seen this movie before: “People are all hyped up about Hadoop.  But what is it, really? It is big and wide record sizes, big block sizes, designed specifically for high-volume, sequential processing. Just like a SAS data set in 1968… The only difference between a SAS data set and Hadoop is that now the disks are cheap enough that you can do replication.”  The following is an edited transcript of our conversation.

Gil Press:  Indeed, many people talk about Hadoop as a replacement for tape.

Keith Collins:  We love that people get that as a pattern now, because it really helps them understand SAS.  So it is a really good time for us to have the conversation with IT about it. But they are still struggling.  They see it as “what is my next big data repository?”  They do not see it as “this is my next big way to answer questions.”

GP:  Which of your customers are ahead of others in terms of integrating analytics within their IT department?

KC:  Almost all of the financial institutions, whether they are retail or insurance or brokerage, have large analytic communities already, which are outside of IT.  Retailers are often growing analytics inside of IT. To some extent, it depends on where you were in the maturity curve of your journey of using analytics.

GP:  So, for retail it is more recent…

KC:  And therefore IT had the opportunity to be a proponent.  You will find that I am a big, big fan of shadow IT and facilitating shadow IT.  It is simply, “why fight it?  Why not facilitate it?”  I think that the model really does become, to some extent, infrastructure as a service; it is what half of IT is morphing to be, and that infrastructure as a service may be inside your walls or outside your walls.  A lot of the energy that the IT organization currently spends on keeping the lights on is becoming someone else’s responsibility.

The CIO is going to have this interesting challenge and opportunity to be the person who actually knows how to bring all of that data back together to facilitate the business and actually do something with the data.  Otherwise, it is living in all its little silos.  The new master data management opportunity is, “how do I synchronize and get all of the data that is in all of these software-as-a-service applications?”

GP:  What specific steps have you seen CIOs take to move the business in that direction?

KC: There is a simple litmus test of where people are on their analytics journey.  If a CIO says, “my business is unique,” they are just beginning.  If a CIO says, “can you tell me, based on your experience, what is happening in another industry that I might be able to apply?” they really get it, because they realize that all of these techniques, while we use different terms and different data, are the same techniques across industries.  If I can figure out how manufacturing solved this quality control problem, I might actually be able to leverage that at my call center.  Or, if I understand revenue optimization in retail, then it might change the way I actually process a credit card transaction.  The CIOs that really look to draw from experiences in other industries are way up on that understanding curve.

GP:  CIOs are told again and again that they should understand the business of the company they work for. But you are saying that they should also understand other people’s business and bring in best practices from other industries or completely different businesses.

KC:  I believe so.  If I tell my CMO that he is doing marketing wrong, that does not go very far.  If I bring him examples of marketing at Chico’s, for example, then it becomes an opportunity to do something different.  The CIOs that bring a broader experience and understanding to how technology can drive the business through opportunities to see things differently have a greater impact.

We are seeing now, across the board, that people are beginning to understand that the only way to really address big data is with analytics.  The complexity of our businesses outstrips now the human capacity to see it all.  The only way you can do that is by using analytics to help you see things unseen, whether it is finding the correlation, or things that are distinct, or optimizing flows, or forecasting.

GP:  You believe the value is not necessarily in the volume of data but in the analysis.

KC:  Yes, I call it big compute.  I’ll give you an example from a large bank: They already had all of their transaction data, but they were only able to do the loan default modeling at the aggregate level until we came in with a new set of algorithms that work across a compute cluster of commodity hardware.  We took them from twenty hours to sixteen minutes and to the transaction level instead of the aggregate level. Now, the quality of the models is way better.

Recently I worked with a customer who currently can only afford to take twenty thousand attributes to do variable reduction down to the four hundred that he models with.  But he has enough data to allow him to work with one hundred thousand attributes.  The challenge for him was not the volume of data but it was algorithmic and computational. They had the data but could not process it in a cost-effective manner. That is an unsung piece of a lot of conversations around big data.  A lot of people had a bunch of data already—they just could not afford to process and analyze it. Now they can.

GP: What’s next?

KC:  I start a lot of these conversations to get people’s attention by saying, “the Enterprise Data Warehouse (EDW) is dead.”  Now, in reality, EDW is not dead; it just morphed into its correct role.  It is the great repository for structured information.  Hadoop, or something similar to Hadoop, is going to evolve to be the data lake: The place where we are going to pour stuff in because it is affordable. Then there are going to be specialty databases around this for outbound activities. For example, how do we, in real time, interact with you on your device, and know you personally?

GP:  If I were a CIO of a large company, what should be my top priorities today?

KC:  You should organize yourself and your staff so you basically have your own COO, someone to run the operations, to free you up so that you can understand what is happening in the ecosystem around you, so that you can add value to the business from a business perspective.  You should also invest in your skills to understand the outbound side of information as well as the inbound side. Balance your budget and your investments, and tools, training, and capability, as much on the outbound side of how you provide information to the business as you do on the inbound plumbing, infrastructure, and transactions.