Posts

Data Vault Data Model for EDW

The Data Vault data model provides an evolution in the way that enterprise data warehouses (EDW) are designed, constructed, and deployed. Moreover, Data Vault provides a new and powerful way to model data within an enterprise data warehouse (EDW) that is more scalable, extensible, and auditible than conventional modeling techniques. It is an ideal modeling technique for EDW that either have a large volume of data, a large number of disparate data sources, a high degree of variety in the ingested data, or a large number of diverse information consumers. Fundamentally, the Data Vault is a hybrid data modeling technique that provides historical data representation from multiple sources designed to be resilient to environmental changes.

The Data Vault model provides a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault is accompanied by record source and load date attributes, enabling an auditor to trace values back to the source.

According to Dan Linstedt, creator of Data Vault data modeling, Data Vault is defined as:

The data vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. it is a hybrid approach encompassing the best of breed between 3rd normal form (3nf) and star schema. the design is flexible, scalable, consistent and adaptable to the needs of the enterprise. it is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.

Data Vault Tables

The Data Vault model is based upon three distinct table types which each table type serving a specific purpose:

•  Hub tables store a unique field for each business entity known as business key.
•  Link tables store relationships between hub tables.
•  Satellite tables store the attributes of the business key.

Hub Table:
Consists of a list of unique business keys that represent a distinct way of identifying a business element as well as fields that describe the origin of the record. Additionally hub tables cannot contain any foreign key fields. Types of fields included in the hub table are:

1) Hash Key which serves as the primary key of the hub table in a hash format.
2) Load Date which includes the date time that the record was inserted into the hub table.
3) Record Source which includes the name of the data source from where the record originated.
4) Business Key which is the unique identifier of the business entity as a either a text or number value and can be more than one field in the table.

Link Table:
Establishes relationships between business keys, contains foreign keys to hub tables, and includes fields about the relationship. A link table is therefore an intersection of business keys and contains the fields that represent the business keys from the related hub tables. The purpose of the link table is to capture and record the relationship of data elements at the lowest possible level of granularity. And a link table must have relationship with at least two hub tables. Types of fields included in link table are:

1) Hash Key which serves as the primary key of the link table in a hash format.
2) Foreign Keys which provide references to the primary hash key of related hub tables.
3) Business Keys which provide a copy of the business key value from the related hub tables.
4) Load Date which includes the date time that the record was inserted into the link table.
5) Record Source which includes the name of the data source from where the record originated.

Satellite Table:
Comprised of all the fields that describe a business entity or relationship and provide context at a given time or over a time period to either a hub or link. Satellite tables consist of foreign keys linking them to a parent hub table or link table, fields that describe the origin of the record, as well as a start and end dates fields. The structure and concept of a
satellite table is very much like a type 2 slowly changing dimension in a dimensional model. History of changes is stored within a
satellite table as well as change data capture (CDC) is conducted within a satellite table. Types of fields included in the satellite table are:

1) Parent Hash Key which provides a foreign key reference to a hub table and is one of two fields included in the primary key of the table.
2) Start Date which indicates the date and time that the satellite record starts being active and is the second of two fields included in the primary key of the table.
3) Record Source which includes the name of the data source from where the record originated.
4) Load End Date which indicates the date and time that the satellite record became inactive.
5) Extract Date which indicates the date and time that the records was extracted from the source system.
6) Hash Diff which is a hash value of all of the descriptive values of a record.
7) Descriptive Fields which are any fields that provide more detail about the entity.

Business Keys and Hash Keys

Business Key: A text value that must be a declared unique or alternate key constraint within the hub table. This means that only one distinct value can exist within the entire Hub table. The business key does not necessarily have to be just one field within the hub table. It can be a compound key made up of more than one column. Business keys are also included as non-unique fields within link tables. True business keys are not tied to any one source system and could be contained within multiple source systems. Examples of typical business keys include: account number, product code, customer number, employee id, invoice number, and order number.

Hash Key:  One of the innovations within the latest version of the Data Vault model is the replacement of standard integer primary keys or surrogate keys with hash-based primary keys. This feature of the Data Vault model enables a Data Vault solution to be deployed either on a relational data management system (RDBMS) or on Hadoop systems. Hadoop systems do not have surrogate key generators like an RDBMS, but a unique MD5 hash value can be generated in Hadoop. With a hash key being used in Hadoop and a hash key being used in an RDBMS, tables can be logically joined.

Data Vault and Information Marts

The Data Vault layer within an EDW is normally used to store data and data is never deleted from the data vault unless there is a technical error while loading data. Additionally the Data Vault layer is not optimized for business intelligence and analytical use. Dimensional modeling is much more suited for this purpose. Subsequently, information marts contain dimensional models and are the source for data analytics. In order to be used by end-users, data contained in the data vault needs to be converted to a dimensional model and moved into related information marts. Dimension tables in information marts will then be sourced from data vault hub and related satellite tables. While fact tables in information marts will be sourced from data vault link and related satellite tables. Upon data being converted into dimensional models and moved into information marts, business intelligence tools including SAP Business Objects, Cognos, OBIEE, Tableau, SSAS, Power BI, & Qlik Sense can be used by end-users to conduct analytics on the data.

Summary

The best use for the Data Vault data model is within the enterprise data warehouse (EDW) of a complete business intelligence environment. Moreover, the Data Vault model is specifically designed for this purpose. And the Data Vault model is the ideal data modeling technique for databases that store large volumes of highly volatile data, contain a high degree of data variety, and contain data from multiple disparate sources.

However, the Data Vault model is only one part of the complete Data Vault architecture which contains three layers:

1) Data Staging Area which contains a copy of source data.
2) EDW which is designed using the Data Vault model.
3) Information Marts which are designed using a dimensional model are the source of data for end-user analysis.

Share

Column and Row Based Database Storage

TableRowColumnStore_320x9999
Row-Based Database Storage:
The data sequence consists of the data fields in one table row.
Column-Based Database Storage:
The data sequence consists of the entries in one table column.

Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a database table in linear memory, two options can be chosen (row based storage or column based storage). Row based storage stores a sequence of records that contain the fields of one row in the table. In column based storage, the entries of a column are stored in contiguous memory locations.

Row-based database systems are designed to efficiently return data for an entire row, or record, in as few operations as possible. This matches the common use-case where the system is attempting to retrieve information about a particular object. This is particularly useful for transactional systems that conduct large amounts of inserts, updates, and deletes of records.

Column-based database systems combine all of the values of a column together, then the values of the next column, and so on. Within this layout, any one of the columns more closely matches the structure of an index in a row-based system. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed  up performance of select queries. This is particularly useful for systems that conduct large amounts of analytics.

Column-oriented-database

Row-based storage is recommended for transactional systems or when:

•  The table has a small number of rows, such as configuration tables.
•  The application needs to conducts updates of single records.
•  The application typically needs to access the complete record.
•  The columns contain mainly distinct values so the compression rate would be low.
•  Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:

•  Calculations are executed on a single column or a few columns only.
•  The table is searched based on the values of a few columns.
•  The table has a large number of columns.
•  The table has a large number of records.
•  Aggregations and fast searching on large tables are required.
•  Columns contain only a few distinct values, resulting in higher compression rates.

Share

Data Warehouses (EDW vs DataMarts)

There are two fundamental types of data warehouses:
•  Enterprise Data Warehouses (EDW)
•  Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Enterprise Data Warehouse (EDW): The enterprise data warehouse is typically a large organization-wide database repository that crosses over every business function and includes data from every organizational unit, division, and department.  In essence, an enterprise data warehouse is a substantially large repository of historical and current transaction data of an entire organization. As an implementation of an enterprise data warehouse is commonly strategic in nature, the volume of data in enterprise data warehouses tends to be quite large.  Enterprise data warehouses can contain hundreds of gigabytes, terabytes, and sometimes even petabytes of data.

Data Mart: A data mart is a collection of subject areas organized for decision support based on the needs of a given department or office.  Many times, a data mart will serve as the reporting and analytical solution for a particular department within an organization, such as accounting, sales, customer service, and/or marketing.  For the most part, data marts are designed with just enough data entities, fields, and records to satisfy one department’s requirements.

There are two kinds of data marts that exist — dependent and independent:
•  A dependent data mart is one whose source is another data warehouse, and all dependent data marts within an organization are typically fed by the same source — the enterprise data warehouse.
•  An independent data mart is one whose source is directly from transactional systems, legacy applications, or external data feeds.

Share

Partial Data Warehouse Appliances (Software Appliances)

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard.  Kognitio can be deployed to a customer in one of three ways:  as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture.  The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture.  In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required.  EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer.  Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment.  Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine.  Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights.  Aster Data products can be deployed to a customer in one of three ways:  Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance.  In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.

Share

Complete Data Warehouse Appliance Solutions

Basic Architecture of a Data Warehouse (DW) Appliance
Complete data warehouse appliances are purpose-built data warehouse solutions and systems that encompass a whole-technology stack including:
• Operating System (OS)
• Database Management System (DBMS)
• Server Hardware
• Storage Capabilities

Initially DW appliances were created with proprietary custom-built hardware and storage units. Netezza, Teradata, DATAllegro, & White Cross (now Kognito) were the first vendors to provide solutions in this manner. Subsequently data warehouse appliances evolved and started to utilize lower-cost, industry-standard non-proprietary hardware components.The movement from proprietary to commodity hardware has proven to bring down the cost of the data warehouse appliance as the commodity hardware can integrated at a lower cost of both developing and integrating proprietary hardware. Examples of commodity hardware typically include general-purpose servers from Dell, Hewlett Packard (HP), or IBM utilizing Intel processors and popular network and storage hardware from either Cisco, EMC, or Sun.

Data Warehouse Appliance - Complete Solution VendorsIntroduced 2002, Netezza was the first vendor to offer a complete data warehouse appliance, so early definitions of appliance were based upon Netezza products.  Subsequently, Netezza Performance Server still provides all of the software components of a data warehouse appliance, including the database, operating system, servers, and storage units.  However in 2009, Netezza replaced its own proprietary hardware with IBM blade servers and storage units. Further in 2010, IBM completed a corporate acquisition of Netezza.

Similar to Netezza, DATAllegro was launched in 2005 with a complete solution involving proprietary hardware. Soon after DATAllegro replaced its own proprietary hardware with commodity server from Dell and storage units from EMC. In 3008, Microsoft acquired DATAllegro in 2008 and announced it will integrate DATAllegro’s massive parallel processing (MPP) architecture into its own MS SQL Server platform, which also runs on commonly-available hardware.

Additionally, both Kognitio and Teradata replaced their proprietary hardware within their appliances in a process similar to that of DATAllegro. Kognitio now offers a row-based, in-memory database database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and Hewlett-Packard. Teradata provides a proprietary database, a variety of common operating systems (Linux, Unix, and Windows), and a proprietary networking subsystem packaged along with commodity processors and storage units.

Announced at the 2008 Oracle OpenWorld conference in San Francisco, the Oracle Exadata Database Machine is a complete package of database software, operating system, servers, and storage. The product was initially assembled in collaboration between Oracle Corporation and Hewlett Packard where Oracle developed the database, operating system and storage software, while HP constructed the hardware. However, with Oracle’s acquisition of Sun Microsystems, Oracle announced the release of Exadata Version two with improved performance and usage of Sun Microsystems storage and operating systems technologies.

At the Sapphire conference in May, 2010 in Orlando, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage.

Finally IBM bundles and integrates its own Infosphere Warehouse database software (formerly “DB2 Warehouse”) with its own servers and storage to deliver the IBM Infosphere Balanced Warehouse.

Share

Data Warehouse Appliance: Oracle Exadata

Announced by CEO Larry Ellison at the 2008 Oracle OpenWorld conference in San Francisco, Oracle Exadata Database Machine is a complete database appliance with support for both transactional (OLTP) and analytical (OLAP) database systems. Delivered as a complete package of database software, operating system, servers, and storage, the Oracle Exadata Database Machine is simple and fast to implement and ready for large-scale business applications.

The product was initially assembled in collaboration between Oracle Corporation and Hewlett Packard (HP) where Oracle developed the database, operating system and storage software, while HP constructed the hardware.  However, with Oracle’s acquisition of Sun Microsystems, Oracle announced the release of Exadata Version two with improved performance and usage of Sun Microsystems storage and operating systems technologies. The main idea of Exadata is to make the storage database aware and push processing of queries down to the disks for optimal scanning and performance. Subsequently an Exadata machine can scan 1 TB of data in about 3.5 seconds by scanning several (or all) disks in parallel with Oracle’s Parallel Query technology.

Oracle Exadata Database Machine

Oracle Exadata Database Machine

Currently the Oracle Exadata Database Machine provides a solution for all types of database systems, ranging from scan-intensive data warehouse applications to highly concurrent transactional applications.  With its bundled combination of storage, database software, operating system, and standard hardware components from Sun, the Oracle Exadata Database Machine provides extreme performance within a highly-available, highly-secure environment. Additionally Oracle’s unique clustering and workload management capabilities position the Oracle Exadata Database Machine to be well-suited for consolidating multiple databases onto a single and centralized environment.

Facts and Benefits of Oracle Exadata Database Machine

•  Accelerates data warehouse query performance by at least a factor of 10x.
•  Runs more queries concurrently for faster access to business-critical information.
•  Scales to 10x more concurrent users.
•  Provides a trusted highly-available and cost-effective platform.
•  Replaces and consolidates isolated special-purpose databases into one platform.
•  Allows for massive parallel processing of data with a high-bandwidth.
•  Easily expands with the connection of multiple units.
•  Includes combination of Oracle Exadata Storage server, Oracle database software, Sun Solaris operating system (OS), and the latest industry standard hardware components from Sun.

Share

Data Warehouse Appliance: SAP HANA

At the Sapphire conference in May, 2010, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage. Additionally, HANA has been built to split up queries to run in parallel on multiple processors—a fundamentally different architecture from SAP’s existing applications.  This in-memory and parallel processing architecture of HANA allows for extremely fast performance of queries and analytics on very large amounts of data.

HANA Blade ServerThe SAP HANA solution has been introduced on Hewlett Packard x86 servers (HP ProLiant DL580 G7 and DL980 G7 servers) and is built upon Intel’s multi-core servers. Moreover, a single server blade can contain up to 2TB of main memory (4TB coming soon) and up to 64 processor cores.  With this total solution, SAP claims that they beat the current performance benchmark by factor of 20, on hardware that was several dozens of times cheaper for a 200X price performance improvement. SAP also claims that HANA either reduces or out-right eliminates the need for the development and deployments of complex and expensive datamarts.

SAP intends HANA systems to be well-integrated with its own enterprise resource planning (ERP) systems, allowing for transactional data in SAP ERP systems to be analyzed in real time.  However, HANA is not dependent solely on SAP ERP systems as a data source.  Moreover,  HANA is data source “agnostic”  which means most common data sources and database can be integrated with it.

According to an SAP document, the HANA platform includes a modeling environment that is simple enough for business users to work with.  Additional, HANA supports client interfaces currently include Microsoft Excel and SAP’s Business Objects business intelligence software.

References: SAP’s Transformation: A Work-In-Progress – Part One (ChainLink Research)SAP Launches HANA for In-memory Analytics (PC World)

Share

Basic Architecture of a Data Warehouse Appliance

By definition, a data warehouse appliance is a complete hardware and software solution that contains a fully integrated stack of processors, memory, storage, operating system, and database management software. The data warehouse appliance is typically constructed to be optimized for enterprise data warehouses, designed to handle massive amounts of data and queries, and designed to scale and grow over time. At its core, a data warehouse appliance simplifies the deployment, scaling, and management of the database and storage infrastructure as it provides a self-managing, self-tuning, plug-and-play database management system that can be scaled out in a modular manner.

Basic Architecture of a Data Warehouse (DW) Appliance

Basic Architecture of a Data Warehouse (DW) Appliance

Fundamentally, a data warehouse appliance is a fully-integrated solution that contains …
– Database Management System (DMBS)
– Server Hardware
– Storage Capabilities
– Operating System (OS)

The primary factor for the platform scalability and large-query optimization of the data warehouse appliance is its massively parallel processing (MPP) architecture. These MPP architectures are comprised of numerous independent processors or servers that all execute in parallel. Also known as a “shared nothing architecture”, the MPP appliance architecture is characterized by a concept in which every embedded server is self-sufficient and controls its own memory and disk operations. Further, the MPP architecture effectively distributes data amongst a number of dedicated disk storage units connected to each server in the appliance. Computations are moved as close to the data as possible and data is logically distributed amongst numerous system nodes.

Almost all large operations in the data warehouse appliance environment, including data loads, query processes, backups, restorations, and indexing are executed completely in parallel. This divide-and-conquer approach allows for extreme high performance and allows for systems to scale linearly as new processors can easily be added into the environment.

Share

Slowly Changing Dimensions – Type Four Models

Type Four – Insert Into a History Table

Type four models, also known as leveraging “history tables”, is the most technically sophisticated of the four models and may be the most difficult to implement. This modeling technique provides for nearly unlimited tracking of historical records while having less storage requirement than type two models. Rather than storing the changes in the same table, a second “history” table is created which stores only the previous values of slowly changing dimensions.

Similar to type two models, type four models accommodate infinite changes to dimensional fields and create an additional record for every change to a dimensional attribute. But in contrast to type two, type four models allow for every change to an attribute to be generated within a new record in a relatively compact history table.  The history table is subsequently more efficient in capturing a large amount of historical data.

Another key advantage of type four models is an efficient manner to query against a timeframe as the related search index only requires two fields (key and date fields).  Other modeling techniques require more fields in the search index for date queries.  Thus the search index utilizing a type four model is smaller, more intuitive, and quicker to retrieve the relevant record in the dimension table than the search index using other modeling techniques.

Type four models do have some important disadvantages.  Namely type four models require implementation of multiple tables, are less intuitive for query developers, require more effort to develop and maintain than other types of models, and allow for history tables to grow to massive size.

Suppose that a vendor changes his phone number to 858-555-6555 from 202-555-8639 because the phone company has added a new area code. Utilizing a type four model, the vendor dimension table would be updated and a new record will be inserted into the vendor history table in the following manner…

Type Four Model - Slowly Changing Dimension

• The vendor dimension table is updated:
– The phone number is updated from 202-555-8638 to 858-555-6555.
• A new record is inserted into the vendor history table:
– The vendor key is copied from the vendor dimension table.
– The phone number 858-555-6555 is inserted.
– The effective date of 12/15/2008 is inserted.

Share

Slowly Changing Dimensions – Type Three Models

Type Three – Leverage Previous and Current Value Fields

Type three models are defined by adding one or more columns to the dimension table, so that the new (or current or active) and old (or historical or inactive) value of an field are stored.  In addition, multiple previous values can be stored within the table depending on how many previous columns are included in the the dimension table.

Type three modeling is a middle state between complete history loss of type one models and the numerous additional records of type two models.  Moreover, type three  models only provide only a limited view of history as only a predefined number previous value of any attribute can be retained, rather than a complete history. This modeling technique is fairly useful when changes to the dimension table are made on a regular schedule, such as annually, and when archival copies of the database are stored offline for historical and audit purposes. Type three modeling is not so useful when changes are more frequent and unpredictable. Multiple previous value fields could be added to the record to provide a longer historical trail, but it may be a challenge to design the table with the optimum number of previous fields. For the most part, type three modeling makes most sense to be implemented only when only there are a limited number of previous values that need to be retained.

Suppose that a vendor changes his phone number to 858-555-6555 from 202-555-8639 because the phone company has added a new area code. Utilizing a type three model, the vendor dimension table will contain both a current phone and previous phone field.  Initially the vendor record contains a null value for the previous phone number and a value of 202-555-6555 for the current phone number.

Type Three - Slowly Changing Dimension

In order to process this update using the previous and current fields, the vendor dimension table is updated:
–  The previous phone field is updated to 202-555-6555, which is now the vendor’s last previous phone number.
–  The current phone field is updated to 858-555-8639, which is now the vendor’s current number.
–  The effective date is updated to the date of the change.

Share