Three Steps in ETL Processing

Steps within ETL Processing

Steps within ETL Processing

The term ETL which stands for extraction, transformation, & loading is a batch or scheduled data integration processes that includes extracting data from their operational or external data sources, transforming the data into an appropriate format, and loading the data into a data warehouse repository. ETL enables physical movement of data from source to target data repository. The first step, extraction, is to collect or grab data from from its source(s).  The second step, transformation, is to convert, reformat, cleanse data into format that can be used be the target database.  Finally the last step, loading, is import the transformed data into a target database, data warehouse, or a data mart.

Step 1 – Extraction
The extraction step of an ETL process involves connecting to the source systems, and both selecting and collecting the necessary data needed for analytical processing within the data warehouse or data mart. Usually data is consolidated from numerous, disparate source systems that may store the date in a different format.  Thus the extraction process must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type and amount of source data.

Step 2 – Transformation
The transformation step of an ETL process involves execution of a series of rules or functions to the extracted data to convert it to standard format. It includes validation of records and their rejection if they are not acceptable. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.

Step 3 – Loading
The load is the last step of ETL process involves importing extracted and transformed data into a target database or data warehouse. Some load processes physically insert each record as a new row into the table of the target warehouse utilizing a SQL insert statement.  Whereas other load processes include a massive bulk insert of data utilizing a bulk load routine.  The SQL insert is a slower routine for imports of data, but does allow for integrity checking with every record. The bulk load routine may be faster for loads of large amounts of data, but does not allow for integrity check upon load of each individual record.

Read more

Share

Advantages of KPI Scorecarding

A KPI scorecard is a solution that allows organizations to manage their organizational performance by linking operations with strategy. The scorecard monitors the execution of strategic objectives at each level of the organization and ensures a consistent understanding at all levels of the organization’s priorities and expectations.

Sample KPI Scorecard

Sample KPI Scorecard

Scorecards can be utilized at all levels of an organization including executives, managers and staff. They are usually updated in periodic snapshots, use graphical symbols and icons to present summaries of organizational performance, and chart overall progress over time for the organization.

Scorecards provide value to the organization by linking key performance indicators (KPIs) within tactical and operational levels in an organization to the overall strategy of the organization. In addition, the scorecard establishes accountability to appropriate individuals and enables visibility to up and down an organization from executive management to operational levels. The scorecard can be a powerful tool to allow for assignment of goals and objectives to all individuals and focuses accountability on the relevant individuals and business units.

Typically scorecards provide internal and industry benchmarks, goals, and targets that assist an individual’s understanding of their own unique contribution to the organization. Often, the scorecard spans strategic, tactical, and operational aspects and decisions of the organization and supports the specific demands of varying levels of management.

Benefits and Advantages of KPI Scorecards to an organization include…
•  Scorecards drive improved organizational performance
•  Scorecards translate strategy into concrete terms and help track its execution
•  Scorecards help ensure that the right measures are utilized
•  Scorecards encourage the right balance of operational and strategic factors
•  Scorecards encourage good management
•  Scorecards present a compelling picture of performance that is not distorted an individual issue
Share

Data Warehouses (EDW vs DataMarts)

There are two fundamental types of data warehouses:
•  Enterprise Data Warehouses (EDW)
•  Data Marts
Overview of Enterprise Data Warehouse (EDW) and Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Enterprise Data Warehouse (EDW): The enterprise data warehouse is typically a large organization-wide database repository that crosses over every business function and includes data from every organizational unit, division, and department.  In essence, an enterprise data warehouse is a substantially large repository of historical and current transaction data of an entire organization. As an implementation of an enterprise data warehouse is commonly strategic in nature, the volume of data in enterprise data warehouses tends to be quite large.  Enterprise data warehouses can contain hundreds of gigabytes, terabytes, and sometimes even petabytes of data.

Data Mart: A data mart is a collection of subject areas organized for decision support based on the needs of a given department or office.  Many times, a data mart will serve as the reporting and analytical solution for a particular department within an organization, such as accounting, sales, customer service, and/or marketing.  For the most part, data marts are designed with just enough data entities, fields, and records to satisfy one department’s requirements.

There are two kinds of data marts that exist — dependent and independent:
•  A dependent data mart is one whose source is another data warehouse, and all dependent data marts within an organization are typically fed by the same source — the enterprise data warehouse.
•  An independent data mart is one whose source is directly from transactional systems, legacy applications, or external data feeds.
Share

Data Federation Technology – Pros and Cons

Data Federation
Data federation is a category of data integration technology that provides the ability to query and aggregate data from disparate sources in a virtual database so it can be used by business intelligence, reporting, or analysis applications in real-time.  The virtual database created by data federation technology doesn’t contain the data itself.  Instead, it contains information or metadata about the actual data and its location. The actual data is physically left in place within its source data repository.

Data federation is used to create virtualized and integrated views of data and allows for execution of distributed queries against multiple data sources (relational databases, enterprise applications, data warehouses, documents, XML) at the same time.  Data federation allows for accesses to data without physical movement of data and provides  a layer of abstraction above the physical implementation of data.

Data federation is synonymous with other technologies and commonly referred to as …
•  Data Virtualization
•  Enterprise Information Integration (EII)

Pros of Data Federation

•  Access current and transactional data stored in multiple sources
•  Does not require movement of data (No ETL)
•  Only requested data returned
•  Real-time data access
•  Quicker development time – supports incremental development
•  Reduces data storage and data transfer

Cons of Data Federation

•  Still queries against original data sources
•  Only contains as much data as source system contains
•  If data is archived off source, data is no longer available in federation tool
•  Query performance is not as good as a data warehouse
•  High system performance transferred to an application server from a database server
Share

General Data Appliance / Complimentary Database Engine – Datatupia

Data Warehouse Appliance - Complimentary Vendor
A completely unique data warehouse appliance is provided by Datatupia.  Rather than providing a solution that provides an organization a complete data warehouse solution, Datatupia’s solution compliments existing data warehouse deployments within an organization.

Dataupia’s flagship product, Satori Server

Datatupia is a data warehouse appliance in the sense that it combines software and hardware components on a rack-mounted server hardware.  Additionally, Satori Server includes an embedded copy of the Linux operating system, a custom database engine, an aggregation engine, built-in storage, and parallel processors.  But in contrast to full data warehouse appliances that are purpose-built for data warehousing, Satori Server interoperates with existing database management systems rather than requiring their replacement.

The main feature of Satori Server that makes it unique amongst all data warehouse appliances is “omniversal transparency“.  This feature shields applications and users from changes to the underlying technology and database platform by complementing Oracle, MS SQL Server, or DB2 databases and their dependent applications. Unlike other solutions that displace existing technologies, the Satori Server’s innovative design extends and enhances an existing information eco-system.  Further, Datatupia’s Satori Server is usually added to an existing data warehouse infrastructure by augmenting an organization’s existing data management systems as a query accelerator or capacity extender.

Share

Partial Data Warehouse Appliances (Software Appliances)

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard.  Kognitio can be deployed to a customer in one of three ways:  as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture.  The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture.  In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required.  EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer.  Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment.  Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine.  Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights.  Aster Data products can be deployed to a customer in one of three ways:  Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance.  In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.
Share