Data preparation is the task of blending, shaping and cleansing data to get it ready for analytics or other business purposes. But what exactly does data preparation involve? How does it intersect with or differ from other data management functions and data governance activities? How does doing it well help business and IT users – and the organization overall?
Data preparation is a formal component of many enterprise systems and applications maintained by IT, such as data warehousing and business intelligence. But it’s also an informal practice conducted by the business for ad hoc reporting and analytics, with IT and more tech-savvy business users (e.g., data scientists) routinely burdened by requests for customized data preparation. These days there’s growing interest in empowering business users with self-service tools for data preparation – so they can access and manipulate data sources on their own, without technical proficiency in SQL, Python or SAS® code.
One way to understand the ins and outs of data preparation is by looking at these five D’s: discover, detain, distill, document and deliver. Let’s examine these aspects in more detail.
Data Preparation Challenges Facing Every Enterprise
Ever wanted to spend less time getting data ready for analytics and more time analyzing the data? This e-book shares best practices for speeding and simplifying the process of data preparation.
Download the free e-book
Discover is about finding the data best-suited for a specific purpose. Many users cite this as a frustrating and time-consuming exercise. An essential enabler of efficient discovery is the creation and maintenance of a comprehensive, well-documented data catalog (i.e., metadata repository). In addition to data profiling statistics and other contents, the data catalog provides a descriptive index pointing to the location of available data.
Data profiling is key to unlocking a better understanding of this data, because it provides high-level statistics about the data’s quality (such as row counts, column data types, min, max and median column values, and null counts). That makes it easier to choose when multiple applicable data sets are discovered.
It’s important to note that discovery should not be just about finding the data that’s needed right now. It should also make it easier to find data later, when similar needs arise. As the enterprise encounters new data sources, especially those external to the organization, the data catalog should be updated. That’s true even if no further data preparation steps are done immediately.
Manual data preparation, often by spreadsheet wranglers, is not only time-consuming but often redundant. That’s because different users (or even the same user) may perform the same work – and they don’t necessarily generate the same results each time.
Detain is about collecting the data selected during discovery. The term “detain” conjures the image of temporarily imprisoning a copy of the data that feeds the rest of the preparation process. For too many organizations, the cells of a spreadsheet permanently detain data both during and after preparation. A temporary staging area or workspace is required for the processing that happens in the “distill” step of data preparation. When persistent detention of intermediate or delivered data is required, it should make use of shared and managed storage – a relational database, network file system or big data repository such as a Hadoop-enabled data lake. An emerging technique here involves the use of in-memory storage areas (or alternatively the cloud) for much faster real-time blending and shaping of the data before it’s sent on to other processes.
Distill is about refining the data collected during the detain phase of data preparation. In the process of refining data, you must determine how fit the data is for its intended purpose, or use. This is an overlapping function of data quality – making data quality integral to data preparation. The extent to which data quality functions such as validation, deduplication and enrichment can be performed is often determined by the ability to reuse components from other implementations.
In data warehousing and business intelligence, for example, transformations and data quality rules are applied while integrating multiple data sources into a single data model optimized for querying and standard reporting. The point is: Don’t reinvent the wheel, reuse what’s around. The more reusable your distillation processes are, the less reliant your business will be on IT to build custom processes.
Ideally, the enterprise should strive for making data quality components a library of functions and repository of rules that can be reused to cleanse data. Distillation may also include filtering and aggregating data to create customized views or alter levels of presentation detail. Some analytical tools enable this aspect of distillation via in-memory data manipulation to avoid persistent storage of alternative views of the delivered data.
Document is about recording both business and technical metadata about discovered, detained and distilled data. This includes:
- Technical definitions.
- Business terminology.
- Source data lineage.
- History of changes applied during distillation.
- Relationships to other data.
- Data usage recommendations.
- Associated data governance policies.
- Identified data stewards.
All of this metadata is shared via the data catalog. Manual data preparation, often by spreadsheet wranglers, is not only time-consuming but often redundant. That’s because different users (or even the same user) may perform the same work – and they don’t necessarily generate the same results each time. Shared metadata enables data preparation to be completed faster, and consistently repeated. Shared metadata also enables efficient collaboration when multiple users are involved in different aspects of data preparation.
Deliver is about structuring distilled data into the format needed by the consuming process or user. The delivered data set(s) should also be evaluated for persistent detention and, if detained, the supporting metadata should be added to the data catalog. These steps allow the data to be discovered by other users.
Delivery must also abide by data governance policies, such as those minimizing the exposure of sensitive information. It’s important to note this may not be a one-time delivery. Iterative delivery of new or changed data may require the data preparation to be run as a scheduled or on-demand process. In addition, use of delivered data should be monitored – and unused data should be deleted – after a set period of time (and the corresponding entries in the data catalog should be removed).
Make data preparation a repeatable process
Data preparation needs to become a formalized enterprise best practice. Shared metadata, persistent managed storage and reusable transformation/cleansing logic will make data preparation an efficient, consistent and repeatable process. In turn, it will become easier for users to find relevant data – and they’ll be armed with the knowledge they need to quickly put that data to use. By providing self-service data preparation tools, business users can work with data on their own and free up IT to work on other tasks. In the process, the entire organization becomes more productive.
Organizations are best served when the components of data preparation can be more readily reused and deployed into operational systems and production areas. Because data preparation enables you to discover, detain, distill, document and deliver data, it empowers the entire enterprise to make the most of all its valuable data assets.
About the author
Jim Harris is a recognized data quality thought leader with 20 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality. Jim is the host of the popular podcast OCDQ Radio, and is very active on Twitter, where you can follow him @ocdqblog.
- Article What is a data lake and why does it matter?A data lake is a storage repository that quickly ingests large amounts of raw data in its native format. As containers for multiple collections of data in one convenient location, data lakes allow for self-service access, exploration and visualization. In turn, businesses can see and respond to new information faster.
- Article GDPR and AI: Friends, foes or something in between?The GDPR may not be best buddies with artificial intelligence – but GDPR and AI aren't enemies, either. Kalliopi Spyridaki explains the tricky relationship between the two.
- Article Data management backgrounderFrom data integration to data quality and data preparation, find out what these terms mean and why they’re so important for your analytics projects.
- Article How openness can supercharge event stream analyticsWhat does openness do for event stream analytics? David Loshin shows how it helps you speed and govern the full streaming analytics life cycle.