SAP HANA Platform – Technical Overview

Sep 21, 2012 No Comments

At the Sapphire conference in May, 2010, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage. Additionally, HANA has been built to split up queries to run in parallel on multiple processors—a fundamentally different architecture from SAP’s existing applications.  This in-memory and parallel processing architecture of HANA allows for extremely fast performance of queries and analytics on very large amounts of data.

The SAP HANA platform implements a new approach to business data processing. In fact, it is much more than the traditional definition of a database. And the in-memory attribute of HANA is much more than simple caching of disk data structures in the server’s main memory.  SAP HANA incorporates a full database management system (DBMS) with a standard SQL interface, transactional isolation and recovery (ACID [atomicity, consistency, isolation, durability] properties), high availability, and massive parallel processing (MPP).  SAP HANA supports most entry-level SQL92. SAP applications that use Open SQL can run on the SAP HANA platform without changes. SQL is the standard interface to SAP HANA. SAP HANA is fully adaptable to the dramatic advances of hardware storage technology, on premise and in the cloud. Hana supports multicore CPUs and 64-bit systems offer a new reality in scalability.

Traditional database management systems (dbms) are designed for optimizing performance on hardware with constraints on main memory. Disk I/O is typically the main bottleneck. The focus was on optimizing disk access, for example, by minimizing the number of disk pages to be read into main memory during processing. However, the SAP HANA database component is designed from the ground up around the idea that memory is available  in abundance.  Built within the design of HANA is the consideration that roughly 18 billion gigabytes or 18 exabytes are the theoretical limits of memory capacity for 64-bit systems, and that I/O access to the hard disk is not a constraint. Instead of optimizing I/O hard disk access, SAP HANA optimizes memory access between the CPU cache and main memory. Additionally SAP HANA is a massively parallel (distributed) data management system that runs completely in main memory, allows for row and column based storage options, and includes a sophisticated calculation engine.

 

SAP HANA - Combination of Technologies

 

The HANA database takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors, and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications. It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system.

HANA Blade ServerThe SAP HANA solution has been introduced on Hewlett Packard x86 servers (HP ProLiant DL580 G7 and DL980 G7 servers) and is built upon Intel’s multi-core servers. Moreover, a single server blade can contain up to 4TB of main memory and up to 64 processor cores.  With this total solution, SAP claims that they beat the current performance benchmark by factor of 20, on hardware that was several dozens of times cheaper for a 200X price performance improvement. SAP also claims that HANA either reduces or out-right eliminates the need for the development and deployments of complex and expensive datamarts.

The SAP HANA database manages data in a multi-core architecture for data distribution across all cores to maximize memory RAM locality using scale-out (horizontally) and scale-up (vertically) functionality.  The HANA database scales beyond a single server by allowing multiple servers in one cluster. Large tables can be distributed across multiple servers using round-robin, hash, or range partitioning, either alone or in combination. HANA has the functionality to execute queries and maintain distributed transaction safety across multiple servers.

Utilizing column-based data storage, SAP HANA can achieve major compression rates unheard of in traditional row-based databases. On one example, the analysis of SAP customers’ systems showed that only 10% of attributes in a single financial database table was used in an SQL statement.  Shrinking the actual size of data volume to be accessed from 35 GB in traditional relational database management system (RDBMS) storage to 800 MB in a column-store design, just over 2% of the volume in the traditional storage.

One of the major contentions and the reason for slow performance in traditional DBMS is locking data when data updates or inserts are being performed. SAP HANA avoids this issue and enables high levels of parallelization using insert-only data records. Instead of creating new records in a database table, deltas are inserted as net-new entries in existing records stored in columns.

The table summarizes the benefits offered by specific features of the SAP HANA database:

Database Feature Benefit
Multi-Core Architecture Significant Computation Power over Mutiple Processors (CPUs)
In-Memory Processing Performance Faster Than Reading From Disk
Support of Row and Column Based Storage Enables Both Transactional and Analytical Databases
Column Based Storage Fast Select Query Performance
High Data Compression Rates Efficient Use of of Disk Storage
Data Partitioning Efficient and Fast Analysis of Very Large Data Sets
Insert Only On Deltas Fast Data Loads

 

One of the differentiating attributes of SAP HANA is having both row-based and column-based stores within the same database engine. Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a table in linear memory, two options can be chosen (row storage or column storage). Row storage stores a sequence of records that contain the fields of one row in the table. In column storage, the entries of a column are stored in contiguous memory locations.

Row-based storage is recommended for transactional systems or when:
•  The table has a small number of rows, such as configuration tables.
•  The application needs to conducts updates of single records.
•  The application typically needs to access the complete record.
•  The columns contain mainly distinct values so the compression rate would be low.
•  Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:
•  Calculations are executed on a single column or a few columns only.
•  The table is searched based on the values of a few columns.
•  The table has a large number of columns.
•  The table has a large number of records.
•  Aggregations and fast searching on large tables are required.
•  Columns contain only a few distinct values, resulting in higher compression rates.

 

 

Share

Benefits of a Data Warehouse

Jul 31, 2011 1 Comment

Data warehouses (DW) are centralized data repositories that integrate data from various transactional, legacy, or external systems, applications, and sources. The data warehouse provides an environment separate from the operational systems and is completely designed for decision-support, analytical-reporting, ad-hoc queries, and data mining. This isolation and optimization enables queries to be performed without any impact on the systems that support the business’ primary transactions (i.e transactional and operational systems).

Data Warehouse Overview
Fundamentally, a data warehouse helps solve the on-going problem of pulling data out of transactional systems quickly and efficiently and converting that data into actionable information.  Additionally, the data warehouse allows for processing of large and complex queries in a highly-efficient manner. Upon succesful implementation of a data warehouse or data mart, business will realize numerous improvements and positive gains.
Benefits from a successful implementation of a data warehouse include:
• Enhanced Business Intelligence
• Increased Query and System Performance
• Business Intelligence from Multiple Sources
• Timely Access to Data
• Enhanced Data Quality and Consistency
• Historical Intelligence
• High Return on Investment

 

Benefit of a DW:  Enhanced Business Intelligence

Insights will be gained through improved information access.  Managers and executives will be freed from making their decisions based on limited data and their own “gut feelings”.  Decisions that affect the strategy and operations of organizations will be based upon credible facts and will be backed up with evidence and actual organizational data. Moreover, decision makers will be better informed as they will be able to query actual data will retrieve  information based upon their personal needs.   In addition, data warehouses and related business intelligence can be used to can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales.

 

Benefit of a DW:  Increased System and Query Performance

Data warehouses are purposely designed and constructed with a focus on speed of data retrieval and analysis.  Moreover, a data warehouse is designed for storing large volumes of data and being able to rapidly query the data.  These analtical systems are constructed differently from operational systems which focus on creation and modification of data.  In contrast, the data warehouse is built for analysis and retrieval of data rather than efficient upkeep of invidual records (i.e. transactions).  Further, the data warehouse allows for a large system burden to be taken off the operational environment and effectively distributes system load across an entire organization’s technology infrastructure.

 

Benefit of a DW:  Business Intelligence from Multiple Sources

For many organizations, enterprise information systems are comprised of multiple subsystems, physically separated and built on different platforms. Moreover, merging data from multiple disparate data sources is a common need when conducting business intelligence. To solve this problem, the data warehouse performs integration of existing disparate data sources and makes them accessible in one place. Consolidating enterprise data into a single data repository alleviates the burden of duplicating data gathering efforts, and enables the extraction of information that would otherwise be impossible. Additionally, the data warehouse becomes the “single view of truth” for the enterprise rather than the multiple truths that can come from reporting on individual subsystems.

 

Benefit of a DW:  Timely Access to Data

The data warehouse enables business users and decision makers to have access to data from many different sources as they need to have access to the data. Additionally, business users will spend little time in the data retrieval process.  Scheduled data integration routines, known as ETL, are leveraged within a data warehouse environment.  These routines consolidate data from multiple source systems and transform the data into a useful format. Subsequently, business users can then easily access data from one interface. Further, consumers of data will be able to query data directly with less information technology support. The wait time for information technology professionals to develop reports and queries is greatly diminished as the business users are given the ability to generate reports and queries on their own.  The use of query and analysis tools against a consistent and consolidated data repository enables business users to spend more time performing data analysis and less time gathering data.

 

Benefit of a DW: Enhanced Data Quality and Consistency

A data warehouse implementation typically includes the conversion of  data from numerous source systems and data files and transformation of the disparate data into a common format.  Data from the various business units and departments is standardized and the inconsistent nature of data from the unique source systems is removed.  Moreover, individual business units and departments including sales, marketing, finance, and operations, will start to utilize the same data repository as the source system for their individual queries and reports.  Thus each of these individual business units and departments will produce results that are consistent with the other business units within the organization.  Subsequently the overall confidence in the organization’s data is substantially increased.

 

Benefit of a DW: Historical Intelligence

Data warehouses generally contain many years worth of data that can neither be stored within nor reported from a transactional system.  Typically transactional systems satisfy most operating reporting requirements for a given time-period but without the inclusion of historical data.  In contrast, the data warehouse stores large amounts of historical data and can enable advanced business intelligence including time-period analysis, trend analysis, and trend prediction.  The advantage of the data warehouse is that it allows for advanced reporting and analysis of multiple time-periods.

 

Benefit of a DW:  High Return on Investment (ROI): 

Return on investment (ROI) refers to the amount of increased revenue or decreased expenses a business will be able to realize from any project or investment of capital.  Subsequently, implementations of data warehouses and complementary business intelligence systems have enabled business to generate higher amounts of revenue and provide substantial cost savings.  According to a 2002 International Data Corporation (IDC) study “The Financial Impact of Business Analytics”, analytics projects have been achieving a substantial impact on a business’ financial status. Additionally, the study found that business analytics implementations have generated a median five-year return on investment of 112% with a mean payback of 1.6 years.  Of the businesses included in the study, 54% have had a return on investment of 101% or more.

 

 

Share

Data Integration Techniques (ETL and Data Federation)

May 31, 2011 2 Comments

Data integration is a fundamental, yet deceptively challenging, component of any organization’s business intelligence and data warehousing strategy.  Data integration involves combining data residing in different data repositories and providing business users with a unified view of this data.  In addition, companies face a challenge of ensuring that data being reported is current and up-to-date.  Companies are now increasingly incorporating both traditional batch-oriented techniques for query performance and real-time data integration to eliminate the annoyance of out-of-date data.  The top batch-oriented technique that companies utilize is known as ETL while one of the popular real-time techniques is known as Data Federation.

 

Data Integration Technique – Extraction, Transformation, & Loading (ETL)

Steps within ETL Processing

Steps within ETL Processing

The term ETL which stands for extraction, transformation, & loading is a batch or scheduled data integration processes that includes extracting data from their operational or external data sources, transforming the data into an appropriate format, and loading the data into a data warehouse repository.  ETL enables physical movement of data from source to target data repository. The first step, extraction, is to collect or grab data from from its source(s).  The second step, transformation, is to convert, reformat, cleanse data into format that can be used be the target database.  Finally the last step, loading, is import the transformed data into a target database, data warehouse, or a data mart.

ETL Step 1 – Extraction
The extraction step of an ETL process involves connecting to the source systems, and both selecting and collecting the necessary data needed for analytical processing within the data warehouse or data mart. Usually data is consolidated from numerous, disparate source systems that may store the date in a different format.  Thus the extraction process must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type and amount of source data.

ETL Step 2 – Transformation
The transformation step of an ETL process involves execution of a series of rules or functions to the extracted data to convert it to standard format. It includes validation of records and their rejection if they are not acceptable. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.

ETL Step 3 – Loading
The load is the last step of ETL process involves importing extracted and transformed data into a target database or data warehouse. Some load processes physically insert each record as a new row into the table of the target warehouse utilizing a SQL insert statement.  Whereas other load processes include a massive bulk insert of data utilizing a bulk load routine.  The SQL insert is a slower routine for imports of data, but does allow for integrity checking with every record. The bulk load routine may be faster for loads of large amounts of data, but does not allow for integrity check upon load of each individual record.

ETL Tool Providers
Here is a list of the most popular commercial and freeware (open-source) ETL Tools.
Commercial ETL Tools:
•  IBM Infosphere DataStage
•  Informatica PowerCenter
•  SAP Business Objects Data Integrator (BODI)
•  SAP Business Objects Data Services
•  Oracle Warehouse Builder (OWB)
•  Oracle Data Integrator (ODI)
•  SAS Data Integration Studio
•  Microsoft SQL Server Integration Services (SSIS)
•  Ab Initio
•  SyncSort DMExpress
•  iWay DataMigrator
•  Pervasive Data Integrator
Freeware, Open Source ETL tools:
•  Pentaho Data Integration (Kettle)
•  Talend Integrator Suite
•  CloverETL
•  Jasper ETL

 

Data Integration Technique – Data Federation

Data Federation
Data federation is a category of data integration technology that provides the ability to query and aggregate data from disparate sources in a virtual database so it can be used by business intelligence, reporting, or analysis applications in real-time.  The virtual database created by data federation technology doesn’t contain the data itself.  Instead, it contains information or metadata about the actual data and its location. The actual data is physically left in place within its source data repository.

Data federation is used to create virtualized and integrated views of data and allows for execution of distributed queries against multiple data sources (relational databases, enterprise applications, data warehouses, documents, XML) at the same time.  Data federation allows for accesses to data without physical movement of data and provides  a layer of abstraction above the physical implementation of data.

Data federation is synonymous with other technologies and commonly referred to as …
•  Data Virtualization
•  Enterprise Information Integration (EII)
Pros of Data Federation
•  Access current and transactional data stored in multiple sources
•  Does not require movement of data (No ETL)
•  Only requested data returned
•  Real-time data access
•  Quicker development time – supports incremental development
•  Reduces data storage and data transfer

Cons of Data Federation

•  Still queries against original data sources
•  Only contains as much data as source system contains
•  If data is archived off source, data is no longer available in federation tool
•  Query performance is not as good as a data warehouse
•  High system performance transferred to an application server from a database server
Top Data Federation Tools
Below a list of the most popular enterprise data integration tools providing the data federation features:
•  SAP BusinessObjects Data Federator
•  Sybase Data Federation
•  IBM InfoSphere Federation Server
•  Oracle Data Service Integrator
•  SAS Enterprise Data Integration Server
•  JBoss Enterprise Data Services Platform
•  iWay Data Hub

 

 

Share

Data Warehouse Appliance Vendors and Products

May 10, 2011 No Comments
Data warehouse appliances are purpose-built data warehouse solutions and systems that encompass a whole-technology stack including:
• Operating System (OS)
• Database Management System (DBMS)
• Server Hardware
• Storage Capabilities
Data warehouse appliances can be broken down into three main categories:
• Complete Data Warehouse Solutions
• Partial Data Warehouse Solutions / Software Appliances
• General Data Appliance / Complimentary Database Engine

 

Complete Data Warehouse Solutions

Initially DW appliances were created with proprietary custom-built hardware and storage units. Netezza, Teradata, DATAllegro, & White Cross (now Kognito) were the first vendors to provide solutions in this manner. Subsequently data warehouse appliances evolved and started to utilize lower-cost, industry-standard non-proprietary hardware components.The movement from proprietary to commodity hardware has proven to bring down the cost of the data warehouse appliance as the commodity hardware can integrated at a lower cost of both developing and integrating proprietary hardware. Examples of commodity hardware typically include general-purpose servers from Dell, Hewlett Packard (HP), or IBM utilizing Intel processors and popular network and storage hardware from either Cisco, EMC, or Sun.

Data Warehouse Appliance - Complete Solution VendorsIntroduced 2002, Netezza was the first vendor to offer a complete data warehouse appliance, so early definitions of appliance were based upon Netezza products. Subsequently, Netezza Performance Server still provides all of the software components of a data warehouse appliance, including the database, operating system, servers, and storage units. However in 2009, Netezza replaced its own proprietary hardware with IBM blade servers and storage units. Further in 2010, IBM completed a corporate acquisition of Netezza.

Similar to Netezza, DATAllegro was launched in 2005 with a complete solution involving proprietary hardware. Soon after DATAllegro replaced its own proprietary hardware with commodity server from Dell and storage units from EMC. In 3008, Microsoft acquired DATAllegro in 2008 and announced it will integrate DATAllegro’s massive parallel processing (MPP) architecture into its own MS SQL Server platform, which also runs on commonly-available hardware.

Additionally, both Kognitio and Teradata replaced their proprietary hardware within their appliances in a process similar to that of DATAllegro. Kognitio now offers a row-based, in-memory database database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and Hewlett-Packard. Teradata provides a proprietary database, a variety of common operating systems (Linux, Unix, and Windows), and a proprietary networking subsystem packaged along with commodity processors and storage units.

Announced at the 2008 Oracle OpenWorld conference in San Francisco, the Oracle Exadata Database Machine is a complete package of database software, operating system, servers, and storage. The product was initially assembled in collaboration between Oracle Corporation and Hewlett Packard where Oracle developed the database, operating system and storage software, while HP constructed the hardware. However, with Oracle’s acquisition of Sun Microsystems, Oracle announced the release of Exadata Version two with improved performance and usage of Sun Microsystems storage and operating systems technologies.

At the Sapphire conference in May, 2010 in Orlando, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage.

Finally IBM bundles and integrates its own Infosphere Warehouse database software (formerly “DB2 Warehouse”) with its own servers and storage to deliver the IBM Infosphere Balanced Warehouse.

 

Partial Data Warehouse Solutions – Software Appliances

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as “software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

 

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard. Kognitio can be deployed to a customer in one of three ways: as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

 

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture. The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture. In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required. EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer. Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment. Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

 

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine. Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights. Aster Data products can be deployed to a customer in one of three ways: Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance. In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.

 

General Data Appliance / Complimentary Database Engine

A completely unique data warehouse appliance is provided by Dataupia. Rather than providing a solution that provides an organization a complete data warehouse solution, Dataupia’s solution compliments existing data warehouse deployments within an organization.

Data Warehouse Appliance - Complimentary VendorDataupia’s flagship product, Satori Server is a data warehouse appliance in the sense that it combines software and hardware components on a rack-mounted server hardware. Additionally, Satori Server includes an embedded copy of the Linux operating system, a custom database engine, an aggregation engine, built-in storage, and parallel processors. But in contrast to full data warehouse appliances that are purpose-built for data warehousing, Satori Server interoperates with existing database management systems rather than requiring their replacement.

The main feature of Satori Server that makes it unique amongst all data warehouse appliances is “omniversal transparency”. This feature shields applications and users from changes to the underlying technology and database platform by complementing Oracle, MS SQL Server, or DB2 databases and their dependent applications. Unlike other solutions that displace existing technologies, the Satori Server’s innovative design extends and enhances an existing information eco-system. Further, Datatupia’s Satori Server is usually added to an existing data warehouse infrastructure by augmenting an organization’s existing data management systems as a query accelerator or capacity extender.

 

 

Share

Modeling Slowly Changing Dimensions in Data Warehouses

Jan 24, 2011 2 Comments

Modern data warehouse design assumes that business transactions such as sales, orders, shipments, and receivables can occur at a rapid rate and each the details of each transaction needs to be recorded. Hence a fact table with a dimensional model contains a separate record for each business transaction. While in contrast, the describing or text-based values of the transaction or dimension often remain fairly constant. Often, dimensional tables within the dimensional model do not take changes into account.

But in reality, some dimensional values do change over time and numerous fields of a given row within a dimension table will need to be updated. This phenomenon in data modeling is known as “slowly changing dimensions” and it can be applied to any dimension table within a data warehouse schema. Moreover, both simple and advanced modeling techniques have been established and can be implemented for handling updates and changes within a dimension table. In addition, slowly changing dimensions assist the data warehouse in precisely recording the past values, providing an efficient method for tracking history, and allowing for the ability to respond to changes to descriptive values of transactions.

Examples of slowly changing dimensions include:
–  account name
–  customer phone number
–  vendor address
–  product description

These are good examples as they are text-based values that remain relatively constant, but can change and commonly do change over time.  Names, phone numbers, addresses are fairly intuitive and it is easy to see how these values can change slowly over time.  But let’s see how a product description could change…  A simple ingredient change or a packaging change in a product may be so trivial that the organization does not decide to give the product a new product id.  Rather the source system provides the data warehouse with a revised description of the product.  Hence the data warehouse needs to track both the old and new descriptions of the product.

 

“The data warehouse must accept the responsibility of accurately describing the past. By doing so, the data warehouse simplifies the responsibilities of the OLTP system. Not only does the data warehouse relieve the OLTP system of almost all forms of reporting, but the data warehouse contains special structures that have several ways of tracking historical data.”
Ralph Kimball, PHD

Originally pioneered by Ralph Kimball, PhD, four main data modeling techniques have been established for managing dimension tables that contain slowly changing dimensions:

Type One – Overwrite the Record
Type Two – Update Record to Inactive / Create an Active Record
Type Three – Leverage Previous and Current Value Fields
Type Four – Insert Into a History Table

These four data modeling techniques range from the complete loss of historical data with relative ease of implementation to methods of saving almost all historical data but containing more technical challenges.  Choice of the appropriate technique by the database designer can ensure that the data warehouse contains required historical values and allows for comparisons of current data or data from other time periods.

 

Type One Modeling – Overwrite the Record

Type one modeling is very simple and effectively handles updates to incorrect and/or outdated values within the dimension table. Type one models do not retain a history of changes and do not store previous values in any way. With this modeling technique, the data field that has changed is simply updated to reflect the most current value.  Since this technique does not preserve any previous value, it should only be used when there is no requirement to retain historical data.

Type one models makes sense when correcting data issues, but not when the system requires retrieval of historical values.  Without a doubt, type one is the simplest method for handling slowly changing dimensions.  However, best practice is to utilize this technique on a very limited basis.  Moreover, it should only be implemented when there is no need to track the history of changes.

Example of Type One Model
First an electronics company loads its list of products within its data warehouse.  Later the company discovers that a television which was originally manufactured in the USA is now manufactured in China, and the company has in fact never received a shipment of the older American-made televisions.  The product description column in the table is simply updated, and the date and time of the update is entered in the “Last Update” column.

Type One - Slowly Changing Dimension

To correct the product description, the existing description is simply overwritten with the correct value, and the “Last Update” field is revised with the current date. Moreover, the product description is updated because the item is no longer manufactured in the United States. Utilizing the type one method of managing slowly changing dimensions, the record on the product dimension table is updated in-place in the following manner…
–  The product description is updated in-place to the new description.
–  The last update is updated to the date and time of the change.

 

Type Two Modeling – Update Record to Inactive / Create an Active Record

Type two modeling is a very reliable and straightforward technique for preserving history of changes to dimensional tables.  It is commonly utilized when a full or partial set of the previous values of a dimension’s attributes must be retained.  In this modeling technique, every time an update occurs to any of the values in a dimensional table, a new record is physically inserted as active or current into the same table and the old records are marked inactive or historical.

The main advantage of type two modeling is that it can accommodate and record a massive amount of history and nearly an unlimited changes to slowly changing dimensions.  But this advantage also can become a significant drawback. Within this modeling technique, it is possible for dimension tables to grow to massive sizes and adversely affect both system and query performance.  In addition, since upon every change this modeling technique requires an update to the linked fact table, implementation is fairly difficult and may require substantial effort to design, develop, and support.

Two unique sub-methods have been established for distinguishing the active (or current) record from the inactive (or historical) records:  active flagging and tuple versioning.

In the type two – active flagging sub-method, an “Active” column is constructed within the dimensional table and acts as a flag. Flags are binary fields with permissible values being Y and N, T and F, or 1 and 0.  A positive value (Y, T, 1) indicates an active record, while a negative value (N, F, 0) indicates an inactive record.

Example of Type Two – Active Flagging Sub-Method

Suppose that a vendor changes his phone number to 858-555-6555 from 202-555-8639 because the phone company has added a new area code.  Using active flagging, the change to the vendor dimension table would be processed as follows…

Type Two - Active Flagging - Slowly Changing Dimension

•  The current active record in the vendor dimension table is updated:
–  The active flag is changed from T to F.
–  The phone number field remains as 202-555-6555.
•  A new record is inserted into the vendor dimension table:
–  The vendor key is given the next available integer number.
–  The vendor name field remains the same as in the old record.
–  The phone number has the new value of 858-555-8639.
–  The active flag is set to T.

In the type two – tuple versioning sub-method, start date and end date columns are included on the dimension table. The values of these date fields then define the period during which that record has been active.  For the most part, the start date is the date the record has been created.  Moreover, the end date is the date the record has become inactive, either because a newer record has replaced it or because the original record in the source system no longer exists.  On the active record, the end date will either be left null, blank, or identified in another way such as all-nines, according to modeler’s preferences and standards.

The main advantage of the tuple versioning method over active flagging is that it provides an audit trail of the date and sequence of all updates.

Example of Type Two – Tuple Versioning Sub-Method

Continuing with the same example above but using the tuple versioning method, the updated phone number in a record in the vendor dimension table is processed as follows:

Type Two - Tuple Versioning - Slowly Changing Dimension

•  The current active record in the vendor dimension table is updated:
–  The end date is updated to the date of the change.
–  The phone number field remains as 202-555-6555.
•  A new record is inserted into the vendor dimension table:
–  The vendor key is given the next available integer number.
–  The vendor name field remains the same as in the old record.
–  The phone number has the new value of 858-555-8639.
–  The start date is the date the record is inserted into the table.
–  The end date is intentionally left blank.

 

Type Three Model – Leverage Previous and Current Value Fields

Type three models are defined by adding one or more columns to the dimension table, so that the new (or current or active) and old (or historical or inactive) value of an field are stored.  In addition, multiple previous values can be stored within the table depending on how many previous columns are included in the the dimension table.

Type three modeling is a middle state between complete history loss of type one models and the numerous additional records of type two models.  Moreover, type three models only provide only a limited view of history as only a predefined number previous value of any attribute can be retained, rather than a complete history. This modeling technique is fairly useful when changes to the dimension table are made on a regular schedule, such as annually, and when archival copies of the database are stored offline for historical and audit purposes. Type three modeling is not so useful when changes are more frequent and unpredictable. Multiple previous value fields could be added to the record to provide a longer historical trail, but it may be a challenge to design the table with the optimum number of previous fields. For the most part, type three modeling makes most sense to be implemented only when only there are a limited number of previous values that need to be retained.

Continuing with the examples of type two models, the phone number of a vendor changes from 202-555-6555 to 858-555-8639 because the phone company has added a new area code. But in this model the vendor dimension table contains a current phone and previous phone fields.  Initially the vendor record contains a null value for the previous phone number and a value of 202-555-6555 for the current phone number.

Type Three - Slowly Changing Dimension

In order to process this update using the previous and current fields, the vendor dimension table is updated:
–  The previous phone field is updated to 202-555-6555, which is now the vendor’s last previous phone number.
–  The current phone field is updated to 858-555-8639, which is now the vendor’s current number.
–  The effective date is updated to the date of the change.

 

Type Four Model – Insert Into a History Table

Type four models, also known as leveraging “history tables”,  is the most technically sophisticated of the four models and may be the most difficult to implement.  This modeling technique provides for nearly unlimited tracking of historical records while having less storage requirement than type two models.  Rather than storing the changes in the same table, a second “history” table is created which stores only the previous values of slowly changing dimensions.

Similar to type two models, type four models accommodate infinite changes to dimensional fields and create an additional record for every change to a dimensional attribute. But in contrast to type two, type four models allow for every change to an attribute to be generated within a new record in a relatively compact history table.  The history table is subsequently more efficient in capturing a large amount of historical data.

Another key advantage of type four models is an efficient manner to query against a timeframe as the related search index only requires two fields (key and date fields).  Other modeling techniques require more fields in the search index for date queries.  Thus the search index utilizing a type four model is smaller, more intuitive, and quicker to retrieve the relevant record in the dimension table than the search index using other modeling techniques.

Type four models do have some important disadvantages.  Namely type four models require implementation of multiple tables, are less intuitive for query developers, require more effort to develop and maintain than other types of models, and allow for history tables to grow to massive size.

Continuing with the examples of type two and type three models and utilizing the history table of type four models, the vendor dimension table would be updated and a new record will be inserted into the vendor history table in the following manner…

Type Four - Slowly Changing Dimension

•  The vendor dimension table is updated:
–  The phone number is updated from 202-555-8638 to 858-555-6555.
•  A new record is inserted into the vendor history table:
–  The vendor key is copied from the vendor dimension table.
–  The phone number 858-555-6555 is inserted.
–  The effective date of 12/15/2008 is inserted.

 

 

Share

Basic Concepts of EDW and Data Marts

Dec 03, 2010 1 Comment
Data Warehouse Concept
Data Warehouses (DW) are read-only, integrated databases designed to provide insight into past organizational performance and project future results.  Further, they can be defined as highly-optimized data repositories that include data sourced from numerous transactional, legacy, external systems, applications, data feeds, and data files that exist throughout an organization.

In practice, the data warehouse is designed, organized, and optimized for retrieval and analysis of data while providing managers, executives, and other decision makers with a single and complete view of the truth.  Moreover, the data warehouse is a data repository completely designed for decision-support, analytical-reporting, managing performance, and the conversion of raw data into information. The data warehouse is constructed with a focus on data retrieval and the data warehouse enables complex queries to be performed without any impact on the systems that support the business’ primary transactions (i.e. transactional, operational, and mainframe applications).  In contrast to its source databases, the data warehouse is a collection of data designed to support management decision making and contains a wide variety of data that business personnel can use to gain a better understanding of business conditions.

Fundamentally, a data warehouse is unique database in which data is collected for analysis and to support management decision making.  A data warehouse is defined by its purpose and it comprises data that is subject-oriented, integrated, time-variant, and non volatile.  This means the data warehouse is focused on a business concept like sales, orders, and/or fulfillments, as opposed to a specific business process.

There are two fundamental types of data warehouses that exist:
• Enterprise Data Warehouses (EDW)
• Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Enterprise Data Warehouse (EDW): The enterprise data warehouse is a centralized data store which provides critical analytical data for an entire organization. The enterprise data warehouse is typically a large organization-wide database repository that crosses over every business function and will include data from every organizational unit, division, and department.  The enterprise data warehouse represents a truly corporate effort and it affects and benefits multiple organizational units and departments. Typically the implementations of an enterprise data warehouse is initiated and driven by a centrally coordinated organization, such as the information technology (IT) department, and can have lengthy implementation schedules (usually 2-5 years). The organizational and business impact of an enterprise data warehouse tends to be quite substantial as multiple organizational units, departments, and divisions are affected and and subsequently can establish total 360 degree perspective. In essence, an enterprise data warehouse is a substantially large repository of historical and current transaction data of an entire organization.

As an implementation of an enterprise data warehouse is commonly strategic in nature, the volume of data in enterprise data warehouses tends to be quite large.  Enterprise data warehouses can contain hundreds of gigabytes, terabytes, and sometimes even petabytes of data.

Data Mart: A data mart is a collection of subject areas organized for decision support based on the needs of a given department or office. Each department within a larger organization may have its own requirements for a data mart and each department’s data mart will be unique to and specific to its own needs. Many times, a data mart will serve as the reporting and analytical solution for a particular department within an organization, such as accounting, sales, customer service, and/or marketing.

The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. The data mart is typically housed in multi-dimensional technology which is great for flexibility of analysis but is not optimal for large amounts of data.

Typically designed for more tactical and quick-strike purposes, data mart implementations are commonly focused on solving a particular business issue or meeting an individual department’s needs. The data mart also usually contains data feeds from a small number of source systems, includes a manageable volume of data, and is exemplified by a rapid development and deployment schedule (usually 3-6 months). For the most part, data marts are designed with just enough data entities, fields, and records to satisfy one department’s requirements.

There are two kinds of data marts that exist — dependent and independent:

•  A dependent data mart is one whose source is another data warehouse, and all dependent data marts within an organization are typically fed by the same source — the enterprise data warehouse.
•  An independent data mart is one whose source is directly from transactional systems, legacy applications, or external data feeds.

 

 

Share

Overview of a Data Warehouse Appliance

Nov 29, 2010 No Comments

By definition, a data warehouse appliance is a complete hardware and software solution that contains a fully integrated stack of processors, memory, storage, operating system, and database management software.  The data warehouse appliance is typically constructed to be optimized for enterprise data warehouses, designed to handle massive amounts of data and queries, and designed to scale and grow over time.

At its core, a data warehouse appliance simplifies the deployment, scaling, and management of the database and storage infrastructure as it provides a self-managing, self-tuning, plug-and-play database management system that can be scaled out in a modular manner. Moreover data warehouse appliances effectively integrate hardware, storage, databases, and operating systems into one unit and allow for accelerated and cost-effective implementations of immense sized analytical systems. Fundamentally, a data warehouse appliance provides a viable solution for the mid-to-large volume data warehouse market and offers cost-effective solutions for multi-terabyte and petabyte data environments.

The origin of the term data warehouse appliance can be traced back to 2002 or 2003 when it was coined by Foster Hinshaw, founder of Netezza and founder and current CEO of Dataupia.

Fundamentally, a data warehouse appliance is a fully-integrated solution that contains …
• Database Management System (DMBS)
• Server Hardware
• Storage Capabilities
• Operating System (OS)

The purpose for which the data warehouse appliance was invented is to simplify the implementation, scaling, and administration of the database and storage environment. Therefore appliances typically come bundled with a self-managing, self-tuning, plug-and-play database and storage system that can be expanded and scaled in a modular and cost-effective manner. Four basic criteria can be used to describe and compare the ways in which data warehouse appliances are designed to fulfill this purpose:

•  Performance Optimization:  Data warehouse appliances are designed to provide optimal performance on large-block reads, long table scans, complex queries, and other advanced and demanding tasks required in large analytical environments.

•  High-Scalability:  Data warehouse appliances are designed to perform well on large datasets that increase massively over time, and are designed specifically for databases in the multi-terabyte to petabyte range (ten thousand gigabytes and above).

•  High-Availability:  Data warehouse appliances are designed with fail-safe and fault-tolerant architectures, so they are not susceptible to any single point of failure and can be expected to have planned up-time close to 100%.

• Simplified Management:  Data warehouse appliances include intuitive tools and controls to conduct systems administration. Therefore they require minimal knowledge to maintain and have a low learning curve to conduct basic administrative tasks including installation, setup, configuration, tuning, and support.

The primary factor for the platform scalability and large-query optimization of the data warehouse appliance is its massively parallel processing (MPP) architecture. These MPP architectures are comprised of numerous independent processors or servers that all execute in parallel.  Also known as a “shared nothing architecture”, the MPP appliance architecture is characterized by a concept in which every embedded server is self-sufficient and controls its own memory and disk operations.  Further, the MPP architecture effectively distributes data amongst a number of dedicated disk storage units connected to each server in the appliance.  Computations are moved as close to the data as possible and data is logically distributed amongst numerous system nodes.

Almost all large operations in the data warehouse appliance environment, including data loads, query processes, backups, restorations, and indexing are executed completely in parallel. This divide-and-conquer approach allows for extreme high performance and allows for systems to scale linearly as new processors can easily be added into the environment.

Data warehouse appliances have numerous benefits over more traditional and conventional data warehousing systems, including ..
•  Minimal cost-of-entry for data warehouse initiatives
•  Complete data warehouse hardware and software environments
•  Massively parallel processing (MPP) architectures
•  Single hardware and software vendor
•  Tight integration of all data warehouse system components
•  Consolidated system administration of entire environment
•  Scalable for high-capacity and high-performance
•  Built-in high-availability and fault-tolerance
•  Rapid implementation and time-to-value

 

 

Share
Facebook Like Button for Dummies