Data integration is a fundamental, yet deceptively challenging, component of any organization’s business intelligence and data warehousing strategy. Data integration involves combining data residing in different data repositories and providing business users with a unified view of this data. In addition, companies face a challenge of ensuring that data being reported is current and up-to-date. Companies are now increasingly incorporating both traditional batch-oriented techniques for query performance and real-time data integration to eliminate the annoyance of out-of-date data. The top batch-oriented technique that companies utilize is known as ETL while one of the popular real-time techniques is known as Data Federation.
Data Integration Technique – Extraction, Transformation, & Loading (ETL)
- The term ETL which stands for extraction, transformation, & loading is a batch or scheduled data integration processes that includes extracting data from their operational or external data sources, transforming the data into an appropriate format, and loading the data into a data warehouse repository. ETL enables physical movement of data from source to target data repository. The first step, extraction, is to collect or grab data from from its source(s). The second step, transformation, is to convert, reformat, cleanse data into format that can be used be the target database. Finally the last step, loading, is import the transformed data into a target database, data warehouse, or a data mart.
ETL Step 1 – Extraction
The extraction step of an ETL process involves connecting to the source systems, and both selecting and collecting the necessary data needed for analytical processing within the data warehouse or data mart. Usually data is consolidated from numerous, disparate source systems that may store the date in a different format. Thus the extraction process must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type and amount of source data.
ETL Step 2 – Transformation
The transformation step of an ETL process involves execution of a series of rules or functions to the extracted data to convert it to standard format. It includes validation of records and their rejection if they are not acceptable. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.
ETL Step 3 – Loading
The load is the last step of ETL process involves importing extracted and transformed data into a target database or data warehouse. Some load processes physically insert each record as a new row into the table of the target warehouse utilizing a SQL insert statement. Whereas other load processes include a massive bulk insert of data utilizing a bulk load routine. The SQL insert is a slower routine for imports of data, but does allow for integrity checking with every record. The bulk load routine may be faster for loads of large amounts of data, but does not allow for integrity check upon load of each individual record.
- ETL Tool Providers
- Here is a list of the most popular commercial and freeware (open-source) ETL Tools.
- Commercial ETL Tools:
- • IBM Infosphere DataStage
- • Informatica PowerCenter
- • SAP Business Objects Data Integrator (BODI)
- • SAP Business Objects Data Services
- • Oracle Warehouse Builder (OWB)
- • Oracle Data Integrator (ODI)
- • SAS Data Integration Studio
- • Microsoft SQL Server Integration Services (SSIS)
- • Ab Initio
- • SyncSort DMExpress
- • iWay DataMigrator
- • Pervasive Data Integrator
- Freeware, Open Source ETL tools:
- • Pentaho Data Integration (Kettle)
- • Talend Integrator Suite
- • CloverETL
- • Jasper ETL
Data Integration Technique – Data Federation
- Data federation is a category of data integration technology that provides the ability to query and aggregate data from disparate sources in a virtual database so it can be used by business intelligence, reporting, or analysis applications in real-time. The virtual database created by data federation technology doesn’t contain the data itself. Instead, it contains information or metadata about the actual data and its location. The actual data is physically left in place within its source data repository.
Data federation is used to create virtualized and integrated views of data and allows for execution of distributed queries against multiple data sources (relational databases, enterprise applications, data warehouses, documents, XML) at the same time. Data federation allows for accesses to data without physical movement of data and provides a layer of abstraction above the physical implementation of data.
- Data federation is synonymous with other technologies and commonly referred to as …
- • Data Virtualization
- • Enterprise Information Integration (EII)
- Pros of Data Federation
- • Access current and transactional data stored in multiple sources
- • Does not require movement of data (No ETL)
- • Only requested data returned
- • Real-time data access
- • Quicker development time – supports incremental development
- • Reduces data storage and data transfer
Cons of Data Federation
- • Still queries against original data sources
- • Only contains as much data as source system contains
- • If data is archived off source, data is no longer available in federation tool
- • Query performance is not as good as a data warehouse
- • High system performance transferred to an application server from a database server
- Top Data Federation Tools
- Below a list of the most popular enterprise data integration tools providing the data federation features:
- • SAP BusinessObjects Data Federator
- • Sybase Data Federation
- • IBM InfoSphere Federation Server
- • Oracle Data Service Integrator
- • SAS Enterprise Data Integration Server
- • JBoss Enterprise Data Services Platform
- • iWay Data Hub