Nine data prep lessons for advanced analytics
You may be ready for advanced analytics, but is your data? Follow this checklist to get started.
According to a recent survey from TDWI Research, 38 percent of organizations are practicing advanced analytics today, whereas 85 percent say they will be within three years. Why the dramatic upsurge in advanced analytics? It’s driven by organizations’ need to understand constantly changing business environments (as seen in the recent recession and its aftermath), as well as to discover opportunities for cost reductions and new sales targets (keys to surviving a down economy).
Organizations will face challenges as they move into advanced analytics. Many don’t understand that reporting and analytics are different practices, often with different data requirements. This checklist is designed to help by listing and explaining data requirements that are unique to advanced analytics.
Determine what you want to accomplish with advanced analytics
Here are just a few of the possibilities:
Discover relationships – Advanced analytics can help organizations discover and quantify important relationships to help in such things as reducing fraud, optimizing marketing campaigns and improving product quality.
Anticipate the future – Predictive analytics can produce scores and statistics to help predict the outcomes of certain situations, such as when a customer might churn, and predictive models can assist with forecasting.
Understand and adapt to change – Advanced analytics can help you under-stand change in the form of rising costs or new customer behaviors. Discoveries made through analytics can lead to positive changes that help businesses adapt to an evolving world, but this level of discovery requires specialized analytic tools and analytic databases.
Scale up data integration to handle large analytical data volume
Many analytics databases begin with multiple terabytes, and data loading must scale up to handle large data volumes and load quickly. Large data extracts from operational systems must be as non-intrusive as possible. Over time, analytics should help fine-tune data collection so that only the right data is collected.
Extract, transform and load (ETL) is the preferred method of data integration for data warehousing. With data sets for advanced analytics, it makes sense to rearrange the acronym to ELT to make data available for preliminary analysis immediately. Incremental transformations, when available, can be done as needed within the enterprise data warehouse (EDW) or analytics database.
Realize that reporting and analytics have different purposes and needs
Many people confuse reporting and analytics. Predictive analytics are more exploratory and forward-looking than reporting and OLAP. The value of predictive analytics is discovery achieved through data-sampling best practices that ensure predictive models generalize to new data – future data that is yet to be collected – giving the models broad usage over a long life cycle. Reporting is based on well-known summarized information. OLAP is implemented as a form of parameterized reporting, where the parameters represent dimensions with limited analysis breadth.
Distinguish between data warehouses, data marts and analytical databases
An enterprise data warehouse (EDW) is a data warehouse that manages data for multiple lines of business, supports multiple data processing workloads and manages data in multiple data structures and models. The EDW is optimal to support any BI and advanced analytics practice. An EDW can handle both query-intense and predictive-scoring workloads and manage the low-level, detailed data that advanced analytics often require.
Organizations with a warehouse focused on reporting and OLAP need to extend it with a separate analytics database to support an analytics workload and appropriate data.
Design a data warehouse architecture that accommodates analytics
Analytics processed within the EDW – This is the ideal, but analytical tools require users to dump data into flat files – the data structure for which the tool is optimized. Today, in-database analytics allow users to manage data within a data warehouse for analysis. In-database predictive analytics require the data warehouse to support the scoring of predictive models deployed to it.
Analytics sandboxes – Data warehouses that support in-database analytics should also support analytics sandboxes (analytic databases set up within the EDW sourced with data requested by a user), so that users can work within the sandbox without unpredictable performance hits on the EDW.
Analytics databases outside the EDW – Rogue data marts and spreadsheets often proliferate outside the EDW. A new best practice is to selectively isolate disruptive analytics workloads on data marts and other analytics databases outside the EDW.
Prepare data to meet the needs of the analytics method you’ve chosen
Online analytical processing (OLAP) – OLAP’s purpose is to quickly answer multidimensional queries of summarized data. This provides good query performance, but limits queries to the data, dimensions and simple summary statistics of one or more cubes.
Query-based analytics – Query-based analytical methods depend heavily on structured query language (SQL). Users gather large volumes of operational data and load it into a data warehouse, or analytics database, to analyze it quickly in reaction to sudden business changes. This works well when the SQL processing is executed on an MPP (massively parallel processing) database platform that’s built for complex queries.
Predictive analytics – Predictive analytics demand specific data structures. Some tools have multiple algorithms, each with a unique data requirement. Most algorithms are optimized to run fast and accurately with a flat record structure of hundreds of fields, and perhaps millions of records.
Preserve analytical data’s rich details because they enable discovery
Analytical discovery depends on data nuggets: Discovery-oriented analytical applications depend on large amounts of raw source data. Data set size, and the details within the raw source data, are integral to clustering and relationship definitions produced by advanced analytics. If you strip out the details during data preparation, analytical discovery is hamstrung from the beginning.
Analytics data can also be unstructured: There is usually a need for text mining to discover useful facts that can be converted to structured data. Text mining output commonly feeds into predictive analytics, enhancing the accuracy of predictive models.
Data from an EDW can be analytical, too, of course: Advanced analytics require new data sets, but they can also tap a data warehouse to provide new discoveries with historical context. Discovered facts can be added back to the warehouse.
Improve data after working with it, not before
The perils of improving analytics data: As analysts understand large data sets, they may develop some data integration or data quality routines to improve the data. Improvements to data should occur only after business analysts have worked with it so that details that discovery-oriented analytics depend upon are not lost.
Data quality for analytics databases: Advanced analytics has the potential to identify useful information from data perceived to be poor. Advanced analytics should assess outliers prior to the application of standard data quality routines. Filling in missing field values may mask useful information. Some anomalies should not be fixed, but documented in metadata, so anyone using the data set knows of their presence and how to compensate for it.
Data modeling for analytics databases: Remodeling data can speed up queries and enable multidimensional views, but modeling can also remove data details and limit the scope of queries. Data sets intended for analytical discovery should be treated to data modeling if it is truly required and adds value.
Apply the products of advanced analytics to BI and data warehouse activities
Ultimately, the information and insights inferred from the application of advanced analytics should be applied to standard EDW data to enhance BI activities. Standard reports and ad hoc queries are necessary to run an organization, but not sufficient to propel the organization forward. Advanced analytics enable continuous learning and improvement, as they make the most of organizational data assets.
Philip Russom is the Senior Manager of TDWI Research at The Data Warehousing Institute (TDWI). He has been an industry analyst at Forrester Research, Giga Information Group and Hurwitz Group, specializing in BI issues.
Phillip Russom, The Data Warehousing Institute
- Check out this blog post on the Applying Business Analytics Webinar Series
- View the webinar series