What is Big Data?
Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. Why? More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
As far back as 2001, industry analyst Doug Laney (currently with Gartner) articulated the now mainstream definition of big data as the three Vs: volume, velocity and variety1:
- Volume. Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
- Velocity. Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
- Variety. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
At SAS, we consider two additional dimensions when thinking about big data:
- Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved.
- Complexity. Today's data comes from multiple sources. And it is still an undertaking to link, match, cleanse and transform data across systems. However, it is necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.
Examples of big data
- RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional bar code systems. Tweet
- In just four hours on "black Friday" 2012, Walmart handled 10 million cash register transactions – almost 5,000 items per second.2 Tweet
- United Parcel Service receives on average 39.5 million tracking requests from customers per day.3 Tweet
- VISA processes more than 172,800,000 card transactions each day.4 Tweet
- 500 million tweets are sent per day. That's more than 5,700 tweets per second.5 Tweet
- Facebook has more than 1.15 billion active users generating social interaction data.6 Tweet
- More than 5 billion people are calling, texting, tweeting and browsing websites on mobile phones.7 Tweet
The Importance of Big Data and What You Can Accomplish
The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making. For instance, by combining big data and high-powered analytics, it is possible to:
- Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.
- Optimize routes for many thousands of package delivery vehicles while they are on the road.
- Analyze millions of SKUs to determine prices that maximize profit and clear inventory.
- Generate retail coupons at the point of sale based on the customer's current and past purchases.
- Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.
- Recalculate entire risk portfolios in minutes.
- Quickly identify customers who matter the most.
- Use clickstream analysis and data mining to detect fraudulent behavior.
Case Study: Big Data at UPS
UPS is no stranger to big data, having begun to capture and track a variety of package movements and transactions as early as the 1980s.The company now tracks data on 16.3 million packages per day for 8.8 million customers, with an average of 39.5 million tracking requests from customers per day. The company stores more than 16 petabytes of data.
Much of its recently acquired big data, however, comes from telematics sensors in more than 46,000 vehicles. The data on UPS trucks, for example, includes their speed, direction, braking and drive train performance. The data in not only used to monitor daily performance, but to drive a major redesign of UPS drivers' route structures. This initiative, called ORION (On-Road Integration Optimization and Navigation), is arguably the world's largest operations research project. It also relies heavily on online map data, and will eventually reconfigure a driver's pickup and drop-offs in real time. The project has already led to savings in 2011 of more than 8.4 million gallons of fuel by cutting 85 million miles off of daily routes. UPS estimates that saving only one daily mile driver per driver saves the company $30 million, so the overall dollar savings are substantial. The company is also attempting to use data and analytics to optimize the efficiency of its 2,000 aircraft flights per day.3
Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information.
- What if your data volume gets so large and varied you don't know how to deal with it?
- Do you store all your data?
- Do you analyze it all?
- How can you find out which data points are really important?
- How can you use it to your best advantage?
Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But, what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data. You now have two choices:
- Incorporate massive data volumes in analysis. If the answers you're seeking will be better provided by analyzing all of your data, go for it. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in-database processing and in-memory analytics.
- Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine relevance based on context. This type of analysis determines which data should be included in analytical processes and what can be placed in low-cost storage for later use if needed.
" Now you can run hundreds and thousands of models at the product level – at the SKU level – because you have the big data and analytics to support those models at that level."
A number of recent technology advancements enable organizations to make the most of big data and big data analytics:
- Cheap, abundant storage.
- Faster processors.
- Affordable open source, distributed big data platforms, such as Hadoop.
- Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs.
- Cloud computing and other flexible resource allocation arrangements.
The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for better decision making.
Solutions from SAS
How can you make the most of all that data, now and in the future? It is a twofold proposition. You can only optimize your success if you weave analytics into your solution. But you also need analytics to help you manage the data itself. There are several key technologies that can help you get a handle on your big data, and more importantly, extract meaningful value from it.
- Data management. Many vendors look at big data as a discussion related to technologies such as Hadoop, NoSQL, etc. SAS takes a more comprehensive data management/data governance approach by providing a strategy and solutions that allow any amount of data to be managed and used effectively.
- High-performance analytics. By taking advantage of parallel processing power, high-performance analytics lets you do things you never thought possible because the data volumes were just too large to handle efficiently. Now you can.
- High-performance data visualization. With high-performance visualizations, you can explore huge volumes of data in mere seconds so you can quickly identify opportunities for further analysis.
- Flexible deployment options for big data. Flexible deployment models bring choices. High-performance analytics from SAS can analyze billions of variables, and those solutions can be deployed in the cloud (with SAS or another provider), on a dedicated appliance or within your existing IT infrastructure, whichever best suits your requirements.
1 Source: META Group. "3D Data Management: Controlling Data Volume, Velocity, and Variety." February 2001.
2 Source: http://news.walmart.com/news-archive/2012/11/23/walmart-us-reports-best-ever-black-friday-events
3 Source: Thomas H. Davenport and Jill Dyche, "Big Data in Big Companies," May 2013.
4 Source: https://en.bitcoin.it/wiki/Scalability
5 Source: http://expandedramblings.com/index.php/by-the-numbers-17-amazing-facebook-stats/
6 Source: http://www.complex.com/tech/2012/10/twitter-ceo-dick-costolo-reveals-staggering-number-of-tweets-per-day
7 Source: http://www.itu.int/en/ITU-D/Statistics/Pages/stat/default.aspx