Data lineage: Making artificial intelligence smarter
Jim Harris, Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)
Imagine you work in an office building in the bustling center of a large city. On your lunch break, you go for a walk to get some exercise and clear your head. Half an hour later, you realize that you haven’t been paying attention to your surroundings and don’t know where you are – but you need to get back to the office quickly. You pull out your smartphone and use a few trusty GPS-enabled apps to see your exact location, the path you took to get there and the fastest route back to the office. You even get some recommendations for quick lunch stops along the way. That’s a good analogy for data lineage, which details the journey data took to get from where it started to where it is now. These days, data lineage is particularly important in the context of artificial intelligence (AI). But before we delve into that, let’s look at a few definitions.
Data lineage defined
As it traces data’s path from its origins to the current location, data lineage shows many important details. These include technical, business and operational metadata – information that describes the following items:
- Origins. Data lineage shows where and when data was created or captured, and where it is stored and maintained. This applies to both internal and external data sources.
- Characteristics. What the data means in business and technical terms is known as its characteristics. Business metadata provides a glossary of human language descriptions of data that business users understand. Technical metadata provides the language that data models, applications and their proprietary interfaces use to describe the data and its structure.
- Relationships. This shows how the data is related, both within itself (e.g., hierarchies) and to other data – including key-based relationships, associations, dependencies, copies or derivatives.
- Movements. Movement is all about where the data has been. In today’s hybrid data ecosystems, data moves around a lot in multiplatform environments, from source to staging and sandboxes, to data warehouses and data lakes, and into analytics tools and reports that provide business intelligence. This point-to-point data flow – or data integration from source to current reference point to all destinations beyond – must be fully mapped to reflect a true sense of direction regarding data’s movement.
- Processes. It’s important to know what processes the data passed through that may have influenced its values, formatting or filtering, such as data quality, modeling, preparation and integration.
- Transformations. This refers to how data was altered during its journey. This includes translations, transformations, data quality rules, data quality test results and reference data values.
- Users. This relates to who or what uses the data. Which people and tools have access to the data and for what reasons? When and how often is the data consumed by these users?
Data lineage provides a complete audit trail for data, which is increasingly important for compliance with regulations such as the EU GDPR. Data lineage enables you to trace data quality issues and other errors back to their root cause and perform impact analysis on proposed changes. As it links data in disparate systems at a logical level by showing how metadata is connected, data lineage helps identify business rule discrepancies and data incompleteness. Data lineage also helps data stewards react to issues before they become a problem, define strategies for data quality improvement and promote effective reuse of existing information.
Data Management for Artificial Intelligence
When it comes to artificial intelligence, the old adage "garbage in, garbage out" applies more than ever. Establish a data management strategy for the future – one that accounts for the vital role lineage plays in understanding data and helping AI reach its full potential.
Defining artificial intelligence in the context of lineage
Artificial intelligence (AI) is an umbrella term that covers a variety of techniques and approaches that make it possible for machines to learn, adjust and act with intelligence comparable to the natural intelligence of humans. Lineage has direct implications for many of the techniques and approaches of AI, such as:
- Neural networks. AI classifies data to make predictions and decisions in much the same way a human brain does. A neural network is a computing system made up of interconnected units (like neurons) that process data from external inputs, relaying information between each unit. The neural network requires multiple passes at the data to find connections and derive meaning from undefined data. Neural networks benefit greatly from the movement aspects of data lineage – because connecting those dots directs its search for meaning.
- Natural language processing. AI that enables interaction, understanding and communication between humans and machines by analyzing and generating human language, including speech, is called natural language processing (NLP). NLP allows humans to communicate with computers using normal, everyday language to perform tasks. Natural language processing relies heavily on the human language data descriptions provided by the characteristics aspect of data lineage.
- Machine learning. AI that’s focused on giving machines access to data and letting them learn for themselves is known as machine learning. Machine learning automates analytical model building using methods from neural networks, statistics, operations research and physics – and it finds hidden insights in data without being explicitly programmed where to look or what to conclude. Machine learning delves into the relationships, processes and transformations aspects of data lineage during its undirected exploration of data’s potential.
- Deep learning. With deep learning, AI uses huge neural networks with many layers of processing to learn complex patterns in large amounts of data and perform humanlike tasks, such as recognizing speech or understanding images and videos (also known as computer vision). This method takes advantage of advances in computing power and improved training techniques. Deep learning depends on the users’ aspect of data lineage because its education is guided by analyzing how users interact with data.
AI plays an ever-increasing role in enterprise solutions. Unlike robotics, which automate manual tasks, AI automates computing tasks. That’s especially valuable given the large and diverse data sets most organizations use today.
While the human role in enterprise solutions will never disappear, it’s foolish to argue against the advantage of AI-augmented humans. There’s a tremendous boost to human productivity when time-consuming tasks (like analyzing gigabytes of data) can be fully automated. But for AI to reach its full potential, the data feeding its algorithms and models needs to be well-understood. Data lineage plays a vital role in understanding data – making it a foundational principle of AI.
Just as GPS provides you with turn-by-turn directions and a visual overview of the completely mapped route, data lineage provides point-to-point data movement and a visual overview of data's complete journey. Jim Harris Obsessive-Compulsive Data Quality
Data lineage: GPS for data
Whether it’s by humans or machines, using data means taking a journey with data. Data flows in many directions across and through the enterprise, making it difficult to understand where the data that’s about to be used came from, and how it got into its current state. To get the full technical functionality and business value from data, you need a strong sense of direction. Data lineage provides that sense of direction, acting as GPS for your data.
Due to the complexity of enterprise data flows, it’s key to be able to visualize data lineage. Just as GPS provides you with turn-by-turn directions and a visual overview of the completely mapped route, data lineage provides point-to-point data movement and a visual overview of data’s complete journey. And just as you might want to augment your GPS directions with data that’s close by and related (as when you look for restaurants along your travel route), data lineage helps you locate data that’s nearby and related to the data that’s currently being used. That additional data can replace or augment the analysis being performed. For example, an AI application predicting customer behavior might benefit from including related social media content.
Data is often called the lifeblood of an organization. And today, streaming data is literally coursing through the veins of machine learning models and other AI applications with the goal of providing business intelligence. Just as it’s hard to have a good sense of direction without GPS, it’s hard to have a good sense of the data feeding AI without data lineage.
As data increasingly drives decisions and actions – and with AI independently making some of those decisions and taking some of those actions – you’d better know where your data has been before you let it get behind the wheel. Both human and artificial intelligence are naturally smarter with data lineage.
About the Author
Jim Harris is a recognized data quality thought leader with 20 years of enterprise data management industry experience. Jim is an independent consultant, speaker, and freelance writer. Jim is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality. Jim is the host of the popular podcast OCDQ Radio, and is very active on Twitter, where you can follow him @ocdqblog.
Recommended reading
- Key questions to kick off your data analytics projectsThere’s no single blueprint for starting a data analytics project. Technology expert Phil Simon suggests these 10 questions as a guide.
- Are you good at scoring?Credit scoring is the foundation for evaluating clients who apply for a loan (or other types of exposure for the bank). It is not unusual for it to take up to 12 months to build and deploy a new credit scoring model. Reforming the process will help minimize losses, increase earnings and reduce operational risk.
- Public health infrastructure desperately needs modernizationPublic health agencies must flex to longitudinal health crises and acute emergencies – from natural disasters like hurricanes to events like a pandemic. To be prepared, public health infrastructure must be modernized to support connectivity, real-time data exchanges, analytics and visualization.
Ready to subscribe to Insights now?