The real scoop on Hadoop
Cloudera’s Mike Olson talks latest trends, changes and your formula for success
By Anne-Lindsay Beall, SAS Insights Editor
Mike Olson is unquestionably an expert in Hadoop. After selling his startup to Oracle in 2008, Olson co-founded Cloudera to sell a version of Hadoop that’s packed with the features and support businesses need to get value out of big data.
We sat down with Olson to find out how Hadoop technology – and the way businesses are using it – is changing.
The reason that Hadoop is such a great solution is that it's inexpensive, scalable and completely flexible. You don't need to predict, in advance, what kinds of data you'll capture and store.
Let’s start with your overall outlook on Hadoop and analytics. How are things changing around these technologies?
Mike Olson: The overarching theme, and the one driving all others, is innovation. Hadoop's 10 years old; since Doug Cutting and Mike Cafarella wrote the initial versions of HDFS and MapReduce, there's been an explosion of new projects expanding Hadoop's capabilities. The list reads like a strange sort of bestiary: Pig, Hive, Sentry, Zookeeper, Impala and 20 or 30 more. The capabilities span ingest and filtering, data quality, SQL, data flow, web-scale data serving, security and multitenancy, fast text search – a long list of enterprise-grade capabilities and powerful new ways to work with data at scale.
Of particular note, lately, is the advent of interactive and real-time services. Original Hadoop was batch-mode (and was much maligned for that). Today, you can do streaming data ingest, filter and alert on events as they happen, train models and score events in real time – tens of terabytes used to be expensive and hard to handle. Today, we have customers with tens of petabytes of data.
That innovation will continue. Apache Spark is the hot new project, and our One Platform initiative is driving improvements across the big data ecosystem so that Spark is just as secure, just as easy to manage, and just as scalable as the rest of the Hadoop ecosystem. But Spark won't be the last word. I'm confident that we'll continue to see new ideas come out of the open source community that provide even more value from big data. They'll span storage, processing and analytic capabilities.
How are you seeing companies use analytics with Hadoop? What types of results are they experiencing? How have you seen analytics projects evolve since companies began implementing Hadoop?
Olson: Two major trends.
First, companies are combining data sets that have long been segregated. Digesting user behavior from web and mobile interactions, combining that with transaction flows from in-store and e-commerce sites, and adding interactions from calls, chats or emails to customer support was impossible before. We had separate systems for all of those data sets. Now, we can land them in a single place, and use a variety of analytic tools to collect and analyze them together.
Second, really powerful new analytic techniques are available. SAS users have been at the forefront of analytics for a long time – machine learning and high-powered statistics are familiar to them. With Spark Streaming, we're now seeing those and other techniques applied in concert to complex event processing flows. The business users who get real-time results from these systems may not know or care what algorithms are hidden behind the curtain, but they have great application user interfaces supporting them in making better decisions based on data analyzed using those tools.
How does Hadoop fit into businesses modernization plans? Why is Hadoop a good solution for big data storage when it comes to analytics?
Olson: The core idea behind Hadoop – the insight that Google had, when it invented the technique – was that you could gang together large numbers of inexpensive industry-standard servers, and use their combined storage and CPU to catch, process and analyze more data, at dramatically lower cost than ever before. None of us in the database industry believed, back then, you could build a system big enough to ingest and store the whole internet. Google ignored the impossibility and did it.
The reason that Hadoop is such a great solution is that it's inexpensive, scalable and completely flexible. You don't need to predict, in advance, what kinds of data you'll capture and store. The system is able to handle any format, including new formats as they emerge. That's crucial, since we can't predict today what sensors, what systems and what data formats we'll be using five or 10 years hence.
Hadoop is, by a considerable margin, the most successful platform for big data storage and processing in the world. Businesses looking to modernize certainly need to plan for big data, and they ought to choose Hadoop right now, just on that basis. But its flexibility means it's the best choice for the long term, as well. It future-proofs the data center, adapting to new data and new analytic engines as they emerge.
Should all businesses be looking at Hadoop? Or only those with big data? How does an organization know when/if it’s ready for Hadoop? And if they’re not ready, what do organizations need to do to get ready for Hadoop?
Olson: Every substantial enterprise should, for sure. Businesses have been using data for a long time to make better decisions – capturing information about customers and sales, exploring them with business intelligence tools, using great analytic products from SAS to understand history and the present, and to predict the future. Big data just means that you can bring more detail, more data, to that party. More detail means a finer-grained and more useful picture of today, and a more reliable look at the future.
Even small businesses and individuals that don't want or use Hadoop themselves are getting plenty of it. You can't shop online, plan an airplane trip, get map directions or enjoy programs on television or the Internet without kicking off Hadoop analytics, and benefiting from their output. We'll see that march continue: More and more of the services we consume will be backed by big data. I could tell you great stories today about big data in health care, connected cars, the energy grid, agriculture, manufacturing and on and on.
When it comes to Hadoop, do you believe there’s a formula for success?
First of all, big data and Hadoop are new to most organizations. Like all new things, you want to learn about them before you try them out. Engage with someone who can help with advice, training and professional services.
Second, figure out in advance what you want to accomplish. Our most successful customers begin with just a handful of use cases – business problems they want to solve using big data. We know the platform very well, and of course our customers know their data and their businesses well. We often help sketch out projects that will succeed on the platform, and they choose the ones that will provide direct and measurable business value. This is important because success builds both confidence in the new capabilities, and experience in how to plan new projects that can use it.
Third, information technology decisions – especially decisions about core platforms -- last a very long time. Using big data means collecting terabytes or petabytes of information, and using it today and in the future to continually improve the business. You want a platform and a partner that will be there for the long haul. The main reason Hadoop has been so successful to date is that it's open source – CIOs know that no company can hijack the projects, so they'll survive any kind of marketplace upheaval. More to the point, though, the innovation I talked about above is a two-edged sword: It provides lots of runway for future improvements, but is complicated for enterprises to understand and ingest solo. A vendor that drives that innovation and delivers it in a consumable, secure, managed and governed way is essential.