Data integration: It ain't what it used to be
By: David Loshin, President of Knowledge Integrity
Data integration involves combining multiple sources of data to present unified results. The term data integration used to refer to a specific set of processes for data warehousing called “extract, transform, load,” or ETL. ETL generally consisted of three phases:
- Extracting data from multiple sources and moving it to a staging area.
- Applying a series of transformations, including data standardization and cleansing (where data values are mapped to corresponding standard formats) – followed by reorganizing the data into a format suitable for loading into a target data warehouse.
- Loading the transformed data into an analytical data warehouse environment.
For many data warehousing professionals, the phrase data integration is synonymous with ETL. Over time, though, the techniques and practices used for moving data from original sources into a data warehouse have been applied to many other data management scenarios. Today, the concept of data integration is much broader in scope. And, frankly, it’s more robust than its limited use for data warehouse populations.
An evolution
One of the first innovative twists was to reconsider the traditional order of operations. Instead of extracting, transforming and loading, some environments opted to extract the data, load it into the target environment and then apply the transformations. This approach, dubbed “ELT” (extract, load, transform), not only removes the need for an intermediate staging platform – it also enables more consistent transformations when all of the sourced data sets are available for review at the same time within the data warehouse context. In addition, the ELT approach accommodates inclusion and transformation of data from real-time data sources along with conventionally produced data extracts.
Yet the volumes of both structured and unstructured data continue to explode as the number of real-time data streams grows. In turn, the practices of data integration have expanded to incorporate a richer, more dynamic set of capabilities that support both data warehousing and analytical needs as well as growing numbers of data applications for operational processes. These processes are increasingly data driven (such as just-in-time manufacturing, real-time insurance claims processing and Internet of Things applications).
Download a paper about data integration
Data integration is essential to any business that has to manage data. With an abundance of big data and emerging technologies like IoT, some say it’s more important than ever. Learn how data integration is evolving, and how you can keep pace.
Modern data integration
In contrast to the traditional approach of ETL, data integration today encompasses holistic approaches to data accessibility, availability and movement – that is, the way data is moved from one location to another. A modern data integration practice embraces additional processes for understanding how source data objects are introduced into the environment, how they move across the organization, how information is used by different consumers, what types of transformations are applied along the way, and how to ensure interpretation consistency across different business functions. In essence, data integration products enable you to customize data system solutions that channel the flow of data from producers to consumers.
Aside from the traditional methods for standardization, cleansing and transformation, today’s data integration often includes many other capabilities, like those described next.
Data flow modeling
These techniques and tools are used to document data lineage. That includes how data objects move from their origination points across all the touch points for reading and updates, and the ways those data objects are delivered to downstream consumers. Many data integration products provide data flow modeling capabilities that display data lineage, and even provide searching and impact analysis related to specific data elements and values.
Data quality control
External standards and regulations often impose tight restrictions on the use and accessibility of data – such as the level of personal data protection required by the European General Data Protection Regulation, or GDPR. This has driven the need for continuous vigilance in enforcing data quality. It has also motivated interest in embedding data validation, monitoring and notifications of missed expectations directly into information flows. In response to these needs, more data integration tools are adding data quality controls that can be directly incorporated into business applications.
Data virtualization and data federation
Growing interest in data accessibility has led application designers to rethink their approaches to data availability, especially as rampant data copying creates numerous data replicas of varying consistency and timeliness. An attractive alternative is to leave the data objects in their original locations and use data virtualization techniques to create a semantic representative model layered on top of federated data access services that access data in their original locations. These capabilities reduce data replication while increasing data reuse.
In contrast to the traditional approach of ETL, data integration today encompasses holistic approaches to data accessibility, availability and movement – that is, the way data is moved from one location to another. David Loshin President Knowledge Integrity
Change data capture
Even in cases where data extracts have been provided, it’s possible to reduce the amount of data required to maintain consistency by using change data capture (CDC). CDC is a data integration method that monitors changes to the source data systems and propagates changes along to any replicated databases.
Data protection
Data protection methods, such as encryption at rest, encryption in motion and data masking, simply observe policies for preventing unnecessary exposure of personally identifiable information (PII). Because these means of protection are applied as data objects move from one point to another, they are increasingly part of a data integration toolset.
Data streaming and integrated business rules
The dramatic rise in analytics is influencing all data integrators to ingest and process streaming data. Streaming data integration differs from conventional data integration in that “chunks” of the data streams are processed in time windows. There are certainly some limitations on the ability to apply sets of transformations to the entire data set at one time. But integrated business rules can be applied to the data objects in real time to achieve some – if not all – of the necessary transformations prior to downstream data consumption.
Data catalogs and data services
As more organizations ingest larger volumes of both structured and unstructured data, there is growing interest in moving acquired data into a data lake that’s built using an underlying object store (which has custom metadata). To accommodate different consumer communities, organizations are using data catalogs to inventory the available data sets and register developed data services that can be used to access those managed data assets.
Today’s data integration: Know your options
When considering options for data integration tools and technologies, today’s hybrid data processing environments are much more complex than those from the good old days. Conventional servers are being linked to big data analytics platforms, and we increasingly see data situated both on-site and in the cloud. There’s also reliance on a growing number of “as-a-service” offerings to manage a wide range of corporate data assets.
Want to determine the best ways modern data integration technologies can support your data processing needs? Make sure you understand the full range of expectations your users have about how they ought to be able to access and use data.
About the Author
David Loshin is the president of Knowledge Integrity Inc., a consulting, training and development services company that works with clients on business intelligence, big data, data quality, data governance and master data management initiatives. Loshin writes for many industry publications, including the Data Roundtable, TDWI Upside, and several TechTarget websites; he also speaks frequently at conferences and creates and teaches courses for The Data Warehousing Institute and other educational organizations. In addition, Loshin is the author of numerous books, including Big Data Analytics and Business Intelligence, Second Edition: The Savvy Manager's Guide.Recommended reading
- Article Marketers and privacyCompanies that break new privacy laws get fined heavily, but there may be an even greater cost: A damaged reputation.
- Article The untapped potential in unstructured textText is the largest human-generated data source. It grows every day as we post on social media, interact with chatbots and digital assistants, send emails, conduct business online, generate reports and essentially document our daily thoughts and activities using computers and mobile devices.
- Article Data lineage: Making artificial intelligence smarterLear how data lineage plays a vital role in understanding data, making it a foundational principle of AI.
- Article A guide to machine learning algorithms and their applicationsDo you know the difference between supervised and unsupervised learning? How about the difference between decision trees and forests? Or when to use a support vector algorithm? Get all the answers here.