News |
The New Data Integration LandscapeFor years, the key to success for any business intelligence solution has been the process known as extract, transform and load (ETL). Selecting the right tool to meet an organization's need for gathering data from disparate sources, and transforming the data before loading it into a target destination, was the critical factor in building a data warehouse or a data mart in order to support an organization's BI projects. ETL became so important that the process became synonymous with the tool, and the technology became known as ETL technology, which spawned many ETL tools. Numerous organizations have struggled through the process of selecting tool after tool to gain access to each new data source, only to end up with several ETL tools acquired via mergers and acquisitions or by allowing departments to operate unchecked with their tools-of-choice. Likewise, some organizations have failed to see the benefits of tools over custom coding, which results in small armies of programmers building and maintaining code. The problem with using several tools or custom code is that it significantly increase the total cost of ownership in terms of maintenance, training, and time lost in regaining familiarity with a rarely used tool. In addition, using several tools can lead to very fragmented metadata, which turns compliance and other issues into chores rather than automatic processes delivered through self-documenting metadata. In addition to the proliferation of ETL tools, building and maintaining a data warehouse or a data mart is no longer the only activity occurring in organizations when it comes to their data. And BI, while still a powerful driver, no longer stands alone. Organizations are increasingly finding it necessary to take on additional non-warehouse projects such as system migration, system consolidation, and system synchronization as a result of mergers, de-mergers, acquisitions and an overall need to update older systems to their modern equivalents. While ETL supports some of these projects, when taken in their near real-time and more batch form, many others demand new technologies. For example, master data management and real-time synchronization/data quality to maintain integrity of operational systems are fast emerging as a critical theme in most organizations. This new expanded scope has led to the emergence of the topic data integration. Data integration should be a strategic topic in all organizations because it affects everything that an organization does. It's time to move from an ad-hoc approach and look at data integration as something that can contribute significantly to your competitive advantage. It's time to think about standardizing as much of your data integration, including ETL, with one "system-neutral" vendor in order to leverage synergies such as shared business rules and metadata across the spectrum of data integration. In addition, you will see reduced costs for training and maintenance costs, and many benefits on the operational and BI fronts from having one, consistent, integrated set of technologies. It's time to ensure that the experiences of the ETL era are understood and to establish one way to success.
Data integration defined
Data integration cannot be seen as just "a means to an end." While, in many cases, data integration supports operational processes or keeps operational systems in sync, it is not necessarily directly driving things the way BI and analytics do. Perhaps, it is this shift in focus that most characterizes data integration, and is the reason why data integration technologies that come from RDBMS vendors are somewhat limited; they are still too focused on the BI world. With a cohesive, data integration strategy, the major focus needs to be on the non-BI aspects that are affecting organizations today, at a time when many organizations have not yet resolved the issue of ETL/data integration for the purpose of supporting data warehouses and data marts. If we agree that this is an important topic, how do we move forward? As with all things, a data-integration strategy brings some "buy" vs. "build" choices to organizations. Because of the way that the portfolios of most vendors in the market have evolved (that is, through mergers and acquisitions), there is a third method: buy and integrate the tools even if they are provided by a single vendor. This is the same tool integration that would be required if you take a "piecemeal" approach and buy from several vendors to meet all your needs. Organizations that want to establish a data integration strategy should learn how all the capabilities were added to a portfolio (integrated through in-house development or purchased through acquisition), and if things such as metadata, business rules, etc., can be shared, not just if they exist. If they can't be shared, when and what will the migration steps be? Organizations should be careful not to be "taken in" by descriptions of manual steps in order to get the bigger picture. Manual steps introduce overhead and risk, and many hidden costs and risks can suddenly become apparent.
High-level guide to data integration
Let's build the data integration landscape beginning with what most people know about data integration, today, and develop it to include emerging topics and technologies. Connectivity and Metadata: Although not part of the data integration landscape, the topics of connectivity and metadata are very important in any data integration strategy, because they are pervasive in all the other parts of the landscape as key-enabling technologies and as such they should be given equal consideration when selecting a solution. Any data integration solution should provide data connectivity both through native access using standard utilities and open standard access (such as ODBC) to all major structured data sources such as relational databases, flat files, ERP systems, and mark-up languages such as XML for reading and writing. The data connectivity capabilities should facilitate the access of information on many different systems such as z/OS, UNIX, and Windows, preferably without having to make use of intermediate files and extracts. Support for connectivity to and reading and writing of data from message queues and the ability to receive and send data to/from Web services should also be provided by the solution to provide complete connectivity. In the longer term, the solution needs to continue evolving in order to support unstructured data sources. Metadata should be pervasive through all types of data integration. Data integration, at its core, is about relating multiple data sources and bringing them together to make your data more valuable. Metadata provides the definition across data sources that make this possible. In addition, metadata enables you to trace what moved when, how it was changed, what business rules were applied, and what impact those changes might have. These are critical issues facing all organizations. Failure to place enough emphasis on metadata will result in problems later on, often at great cost to an organization. Data Quality / Real-Time Data Quality Integration: Any data integration solution should include an INTEGRATED data-quality solution to support data-quality processes like profiling; householding; deduplication; data-quality, business-rule creation; and cleansing of data (where required). These rules should also be callable through custom exits, messages placed onto message queues, or Web services to trigger the process and deliver what can be referred to as real-time data quality integration. A classic example is the checking of names and addresses at the point of entry into an ERP system, through the use of a custom exit, to build-in data quality from the start. Data Warehousing/Data Marts (ETL): Any data integration solution needs to provide the capability to both build and maintain data warehouses/data marts via the ETL process. This solution needs to leverage the data connectivity capabilities that were previously mentioned and have fully integrated metadata. Such a solution should also include SUPPORT. Here SUPPORT means technical support and help from professional services as a part of the solution, and extensions through custom coding so that organizations have the flexibility to do more than the tool delivers but will not lose the support of the vendor when they use custom code, thus reducing risk. In addition, the solution must allow for the re-use of data-quality business rules that are provided by the data-quality part of the data integration offering. Data quality must take "center stage" in any integration strategy. Data Migration: Any data integration solution needs to provide the capability to migrate data from multiple existing systems to one or more new or existing systems. You could argue that, in its most primitive form, this is just the application of the ETL process plus data quality to some other target besides a data warehouse or a data mart. Organizations should be looking to build up data-quality business rules over time (and from data migration project-to-project) that can be applied whenever a migration takes place in order to get re-usable, immediate, and low-cost business benefits. These same rules should be usable when supporting data warehouse and data mart creation/maintenance. While a one-off migration might often take place, it will be very hard to achieve on the operational side where the source system might live. This is because organizations often have business applications running from the operational system to be migrated so that movement forward will first involve migrating the data to a new system and verifying its correctness (again, this is where metadata becomes vitally important), before establishing an ongoing data synchronization process between the old and the new, and placing the business application on top of the new system for acceptance testing. After you are satisfied that the data in the new system is up-to-date and that the business application is operating as expected on the new system, the old system can be turned off and data synchronization ended. Data Synchronization: Any data integration solution needs the capability to reflect that the changes made in one system are also made in other systems in the organization. There are two types of data synchronization. The first type is the movement of "changes" made in one or more systems to other systems in batch/near real-time. The second type is the movement of "changes" made in one or more systems to other systems in real-time. The first type of data synchronization is just another application of the ETL process using change-data capture and a scheduled process to move data around. This process can be scheduled nightly, every 30 minutes, every 5 minutes, or even more frequently depending on the needs of the organization and the amount of data to be moved. However, it typically involves the movement of multiple "transactions" or "records" concurrently. The second type of data synchronization involves the movement of "individual transactions" or "records" to synchronize status across multiple systems as the transactions occur and in real-time. Technologies such as message queues and brokers are often used in such circumstances. Here, a real-time server needs to be invoked by using custom exits, messages placed onto message queues, change brokers, or Web services to trigger the process. Again, it is important to note the importance of data quality in data synchronization. Although bad data in one system is not good, the proliferation of bad data through data synchronization can have a devastating effect. Organizations should ensure that any data synchronization also includes the application of data-quality business rules to maintain the quality of data throughout all systems. Master Data Management (MDM): Any data integration solution needs to provide the capability to handle the new and emerging topic of master data management. Master data management is the practice of creating a single "perceived" truth through mapping multiple disparate definitions of items such as names of customers and products, which are held in various systems. Thus, when a user asks for "customers" they can have all the customers names returned in a common format that uses a standard, company-accepted definition for any application without having to understand the underlying structure in the various silos throughout the organization. Tied closely to MDM are emerging topics such as customer data integration (CDI) and product data integration (PDI) that build on the basic technology and deliver a number of common mappings and definitions to get organizations up-and-running, quickly. Where MDM is a topic of concern, organizations should look for the development of these more advanced solution areas that incorporate a true metadata management framework in traditional reference data management and speed up the time to deployment, thereby reducing overall costs. Ultimately, CDI and PDI are examples of real implementations that solve specific problems in the broader MDM space. Many organizations will have to solve one of these specific sets of problems first. However, the more forward-thinking organizations will have a broad MDM strategy that leverages many of the same technologies and capabilities to achieve common results within their enterprise. If a vendor says they do MDM but they do not deal with topics like CDI or PDI, then you might be getting a very limited solution that will require a lot of ongoing, manual, and expensive custom development. Data Federation / Enterprise Information Integration (EII): Any data integration offering needs to support data federation or EII, which is basically a form of data integration that keeps data in place and allows the data to be integrated and surfaced as needed. Due to its dynamic nature, data federation can lend itself to potential problems where there is no need to access large amounts of data or data from many underlying systems. Data federation and EII, along with data synchronization, are often the underlying technologies that are employed with MDM and, also, often used with BI solutions where more operational or real-time views of data are required.
Your data integration strategy
SAS – in conjunction with DataFlux, a wholly owned subsidiary of SAS that focuses on the data quality aspects of data integration and real-time data integration – delivers a variety of integrated solutions to meet various needs that can be brought together, incrementally and in a variety of ways, to suit the needs of any organization. You can start with a solution to address master data management, or with technologies to build data warehouses and data marts or to carry out rudimentary data profiling. The important thing is that, whichever choice you make and whichever direction you subsequently take, SAS and DataFlux can deliver all the technologies that you need to establish a data integration strategy while realizing the benefits of shared business rules, shared metadata, and integrated technologies, along with associated cost reductions when employees need less training, and you can re-use business rules. In addition, you'll have less inherent tool and metadata integration and fewer maintenance and management problems, which are alleviated by an integrated comprehensive approach. If you are not doing so already, today might be a good time to start deciding where the future of your data integration strategy lies. All the topics in the preceding landscape should live and work together to give you maximum benefit and value. Previously established piecemeal standards need to be challenged – and time is not on your side. The most successful organizations will have a clear and precise strategy in place that recognizes data integration as a fundamental cornerstone of their competitive differentiation. Those who succeed will be the leaders who can address all their needs by using one integrated offering, thereby having the flexibility to react to new challenges quickly. Those who hesitate will be quickly left behind in a sea of complexity and cost. Data integration should be complete, flexible, integrated, and proven. SAS and DataFlux provide all these strengths and are ready to help you address your challenges today.
Bio:
|
Read More
This story appears in the First Quarter 2006 issue of
|