Free ebook

What is data profiling and how does it make big data easier?

John Bauman, SAS Insights Editor

Data profiling, the act of monitoring and cleansing data, is an important tool organizations can use to make better data decisions. Learn how it helps with data problems – big and small.

In our increasingly connected world, the amount of data – and the sources of this data – continue to rise. Data profiling is an often-visual assessment that uses a toolbox of business rules and analytical algorithms to discover, understand and potentially expose inconsistencies in your data. This knowledge is then used to improve data quality as an important part of monitoring and improving the health of these newer, bigger data sets.

The need for data profiling is only going to grow. Corporate data warehouses must interact with increasingly diverse and intimidatingly large sets of data from different sources like blogs, social media and emerging big data technologies like Hadoop. In the industrial world, the Internet of Things introduces a multitude of devices generating data, while organizations can access data from biometrics and human-generated sources like email and electronic medical records.

The amount of data is only one side of the equation – data quality is important, too. Data that isn’t formatted right, standardized or correctly integrated with the rest of the database can cause delays and problems that lead to missed opportunities, confused customers and bad decisions.

Data profiling helps you to get ahead of these issues. By ensuring that you run a diagnosis and examine the data that you have, you can proactively create a plan to fix many of your data problems and clean up your data warehouse before they can affect your organization.

Data Management for Artificial Intelligence

Machine learning systems don’t just extract insights from the data they are fed, as traditional analytics do. They actually change the underlying algorithm based on what they learn from the data. Now is the time for executives, particularly the chief data officer, to decide on data management strategy, technology and best practices that will be essential for continued success.

Download the e-book

Why do you need data profiling?

Data profiling helps you discover, understand and organize your data. It should be an essential part of how your organization handles its data for several reasons.

First, data profiling helps cover the basics with your data, verifying that the information in your tables matches the descriptions. Then it can help you better understand your data by revealing the relationships that span different databases, source applications or tables.

Beyond just uncovering hidden nuggets of information buried within your own data, data profiling helps you ensure that your data is up to standard statistical measures, as well as business rules specific to your company. For example, a state column might use a combination of both two-letter codes and the fully spelled out (sometimes incorrectly) name of the state. Data profiling would uncover this inconsistency and inform the creation of a standardization rule that could make them all consistent, two-letter codes.

What are the different kinds of data profiling?

Many of the data profiling techniques or processes used today fall into three major categories: structure discovery, content discovery and relationship discovery. The goals, though, are consistent – improving data quality and gaining more understanding of the data.

Structure discovery, also known as structure analysis, validates that the data that you have is consistent and formatted correctly. There are several different processes that you can use for this, such as pattern matching. For example, if you have a data set of phone numbers, pattern matching helps you find the valid sets of formats within the data set. Pattern matching also helps you understand whether a field is text- or number-based along with other format-specific information.

Structure discovery also examines simple basic statistics in the data. By using statistics like the minimum and maximum values, means, medians, modes and standard deviations, you can gain insight into the validity of the data.

Content discovery is the process of looking more closely into the individual elements of the database to check data quality. This can help you find areas that contain null values or values that are incorrect or ambiguous.

Many data management tasks start with an accounting for all the inconsistent and ambiguous entries in your data sets. The standardization process in content discovery plays a major role in fixing these little problems. For example, finding and correcting your data to fit street addresses into the correct format is an essential part of this step. The potential problems that could arise from non-standard data, like being unable to reach customers via mail because the data set includes incorrectly formatted addresses, are costly and can be addressed early in the data management process.

Finally, relationship discovery involves discovering what data is in use and trying to gain a better understanding of the connections between the data sets. This process starts with metadata analysis to determine key relationships between the data and narrows down the connections between specific fields, particularly where the data overlaps. This process can help cut down on some of the problems that arise in your data warehouse or other data sets when data is not aligned.

Data profiling can be used to troubleshoot problems within even the biggest data sets.

What are some data profiling techniques?

There are four general methods by which data profiling tools help accomplish better data quality: column profiling, cross-column profiling, cross-table profiling and data rule validation.

Column profiling scans through a table and counts the number of times each value shows up within each column. This method can be useful to find frequency distribution and patterns within a column of data.

Cross-column profiling is made up of two processes: key analysis and dependency analysis. Key analysis examines collections of attribute values by scouting for a possible primary key. Dependency analysis is a more complex process that determines whether there are relationships or structures embedded in a data set. Both techniques help analyze dependencies among data attributes within the same table.

Cross-table profiling uses foreign key analysis, which is the identification of orphaned records and determination of semantic and syntactic differences, to examine the relationships of column sets in different tables. This can help cut down on redundancy but also identify data value sets that could be mapped together.

Finally, data rule validation uses data profiling in a proactive manner to verify that data instances and data sets conform with predefined rules. This process helps find ways to improve data quality and can be achieved either through batch validation or an ongoing validation service.

What are some examples of data profiling in use today?

Data profiling can be used to troubleshoot problems within even the biggest data sets by first examining metadata. For example, by using SAS metadata and data profiling tools with Hadoop, you can troubleshoot and fix problems within the data to find the types of data that can best contribute to new business ideas.

Data profiling is a critical component of implementing a data strategy, and informs the creation of data quality rules that can be used to monitor and cleanse your data. Organizations can make better decisions with data they can trust, and data profiling is an essential first step on this journey.