Data Vault Data Model for EDW

The Data Vault data model provides an evolution in the way that enterprise data warehouses (EDW) are designed, constructed, and deployed. Moreover, Data Vault provides a new and powerful way to model data within an enterprise data warehouse (EDW) that is more scalable, extensible, and auditible than conventional modeling techniques. It is an ideal modeling technique for EDW that either have a large volume of data, a large number of disparate data sources, a high degree of variety in the ingested data, or a large number of diverse information consumers. Fundamentally, the Data Vault is a hybrid data modeling technique that provides historical data representation from multiple sources designed to be resilient to environmental changes.

The Data Vault model provides a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault is accompanied by record source and load date attributes, enabling an auditor to trace values back to the source.

According to Dan Linstedt, creator of Data Vault data modeling, Data Vault is defined as:

The data vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. it is a hybrid approach encompassing the best of breed between 3rd normal form (3nf) and star schema. the design is flexible, scalable, consistent and adaptable to the needs of the enterprise. it is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.

Data Vault Tables

The Data Vault model is based upon three distinct table types which each table type serving a specific purpose:

•  Hub tables store a unique field for each business entity known as business key.
•  Link tables store relationships between hub tables.
•  Satellite tables store the attributes of the business key.

Hub Table:
Consists of a list of unique business keys that represent a distinct way of identifying a business element as well as fields that describe the origin of the record. Additionally hub tables cannot contain any foreign key fields. Types of fields included in the hub table are:

1) Hash Key which serves as the primary key of the hub table in a hash format.
2) Load Date which includes the date time that the record was inserted into the hub table.
3) Record Source which includes the name of the data source from where the record originated.
4) Business Key which is the unique identifier of the business entity as a either a text or number value and can be more than one field in the table.

Link Table:
Establishes relationships between business keys, contains foreign keys to hub tables, and includes fields about the relationship. A link table is therefore an intersection of business keys and contains the fields that represent the business keys from the related hub tables. The purpose of the link table is to capture and record the relationship of data elements at the lowest possible level of granularity. And a link table must have relationship with at least two hub tables. Types of fields included in link table are:

1) Hash Key which serves as the primary key of the link table in a hash format.
2) Foreign Keys which provide references to the primary hash key of related hub tables.
3) Business Keys which provide a copy of the business key value from the related hub tables.
4) Load Date which includes the date time that the record was inserted into the link table.
5) Record Source which includes the name of the data source from where the record originated.

Satellite Table:
Comprised of all the fields that describe a business entity or relationship and provide context at a given time or over a time period to either a hub or link. Satellite tables consist of foreign keys linking them to a parent hub table or link table, fields that describe the origin of the record, as well as a start and end dates fields. The structure and concept of a
satellite table is very much like a type 2 slowly changing dimension in a dimensional model. History of changes is stored within a
satellite table as well as change data capture (CDC) is conducted within a satellite table. Types of fields included in the satellite table are:

1) Parent Hash Key which provides a foreign key reference to a hub table and is one of two fields included in the primary key of the table.
2) Start Date which indicates the date and time that the satellite record starts being active and is the second of two fields included in the primary key of the table.
3) Record Source which includes the name of the data source from where the record originated.
4) Load End Date which indicates the date and time that the satellite record became inactive.
5) Extract Date which indicates the date and time that the records was extracted from the source system.
6) Hash Diff which is a hash value of all of the descriptive values of a record.
7) Descriptive Fields which are any fields that provide more detail about the entity.

Business Keys and Hash Keys

Business Key: A text value that must be a declared unique or alternate key constraint within the hub table. This means that only one distinct value can exist within the entire Hub table. The business key does not necessarily have to be just one field within the hub table. It can be a compound key made up of more than one column. Business keys are also included as non-unique fields within link tables. True business keys are not tied to any one source system and could be contained within multiple source systems. Examples of typical business keys include: account number, product code, customer number, employee id, invoice number, and order number.

Hash Key:  One of the innovations within the latest version of the Data Vault model is the replacement of standard integer primary keys or surrogate keys with hash-based primary keys. This feature of the Data Vault model enables a Data Vault solution to be deployed either on a relational data management system (RDBMS) or on Hadoop systems. Hadoop systems do not have surrogate key generators like an RDBMS, but a unique MD5 hash value can be generated in Hadoop. With a hash key being used in Hadoop and a hash key being used in an RDBMS, tables can be logically joined.

Data Vault and Information Marts

The Data Vault layer within an EDW is normally used to store data and data is never deleted from the data vault unless there is a technical error while loading data. Additionally the Data Vault layer is not optimized for business intelligence and analytical use. Dimensional modeling is much more suited for this purpose. Subsequently, information marts contain dimensional models and are the source for data analytics. In order to be used by end-users, data contained in the data vault needs to be converted to a dimensional model and moved into related information marts. Dimension tables in information marts will then be sourced from data vault hub and related satellite tables. While fact tables in information marts will be sourced from data vault link and related satellite tables. Upon data being converted into dimensional models and moved into information marts, business intelligence tools including SAP Business Objects, Cognos, OBIEE, Tableau, SSAS, Power BI, & Qlik Sense can be used by end-users to conduct analytics on the data.

Summary

The best use for the Data Vault data model is within the enterprise data warehouse (EDW) of a complete business intelligence environment. Moreover, the Data Vault model is specifically designed for this purpose. And the Data Vault model is the ideal data modeling technique for databases that store large volumes of highly volatile data, contain a high degree of data variety, and contain data from multiple disparate sources.

However, the Data Vault model is only one part of the complete Data Vault architecture which contains three layers:

1) Data Staging Area which contains a copy of source data.
2) EDW which is designed using the Data Vault model.
3) Information Marts which are designed using a dimensional model are the source of data for end-user analysis.

Share

Column and Row Based Database Storage

TableRowColumnStore_320x9999
Row-Based Database Storage:
The data sequence consists of the data fields in one table row.
Column-Based Database Storage:
The data sequence consists of the entries in one table column.

Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a database table in linear memory, two options can be chosen (row based storage or column based storage). Row based storage stores a sequence of records that contain the fields of one row in the table. In column based storage, the entries of a column are stored in contiguous memory locations.

Row-based database systems are designed to efficiently return data for an entire row, or record, in as few operations as possible. This matches the common use-case where the system is attempting to retrieve information about a particular object. This is particularly useful for transactional systems that conduct large amounts of inserts, updates, and deletes of records.

Column-based database systems combine all of the values of a column together, then the values of the next column, and so on. Within this layout, any one of the columns more closely matches the structure of an index in a row-based system. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed  up performance of select queries. This is particularly useful for systems that conduct large amounts of analytics.

Column-oriented-database

Row-based storage is recommended for transactional systems or when:

•  The table has a small number of rows, such as configuration tables.
•  The application needs to conducts updates of single records.
•  The application typically needs to access the complete record.
•  The columns contain mainly distinct values so the compression rate would be low.
•  Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:

•  Calculations are executed on a single column or a few columns only.
•  The table is searched based on the values of a few columns.
•  The table has a large number of columns.
•  The table has a large number of records.
•  Aggregations and fast searching on large tables are required.
•  Columns contain only a few distinct values, resulting in higher compression rates.

Share

MS SQL Server Indexed View Defined

For many years and versions, the MS SQL Server database platform has supported the ability to create virtual tables known as views.  Views in SQL Server can be defined as a stored select statement on a base table.  Data in a view is not stored in the database and the result set is determined while the view is executed. Traditionally, views provide a security mechanism that restricts users to a certain subset of data in one or more base tables or provide a mechanism that allow developers to customize how users can logically view the data stored in base tables.  Prior to the addition of indexed views in SQL Server, views never contained indexes of their own.

Starting with Microsoft SQL Server 2000, indexed views are views whose results are persisted in the database (similar to tables) and contain database indexes for fast retrieval of data.  A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space.  Indexed views work best when the underlying data is infrequently updated. The maintenance of an indexed view can be higher than the cost of maintaining a table index. If the underlying data is updated frequently, then the cost of maintaining the indexed view data may outweigh the performance benefits of using the indexed view.

The main benefit of an indexed view is query optimization and faster query performance. Because the indexed view physically exists on disk, there can be substantial savings in response time to queries that involve the indexed view. Additionally data within the  indexed view can store joins between tables, can store a subset of fields and/or records that exist within a table, can store calculations of table fields, and can store aggregations or groupings of data.  Thus less data is retrieved in a query of  the the indexed table than the  query of the base tables.  And finally, indexed views utilize performance-enhancing indexes corresponding to its own data where ordinary views only utilize indexes related to the indexes in its base table.

The main costs of an indexed view are additional maintenance of the indexes, additional disk space taken up by both the indexed view and related indexes, additional overhead of maintaining the view on disk as the base tables are modified. One must weigh the cost to maintain the index against the benefit of query optimization provided by the index. When the underlying tables are subject to significant inserts, updates, and deletes, be very careful in selecting the indexes (both table and view) that will provide the greatest coverage across your queries for the lowest cost.

Typically, environments that are best suited for indexed views are analytical data repositories such as data warehouses and data marts. Analytical data repositories require rapid query speeds and the underlying data changes infrequently.

The syntax to create a Indexed View in SQL Server is:

CREATE VIEW IndexView
WITH SCHEMABINDING
    AS SELECT Au_Fname + ‘ ‘ + Au_Lname, Au_id FROM Authors WHERE Contract = 1 GO
CREATE UNIQUE CLUSTERED Index MyINDEX ON IndexView (au_id) Go

Share

Oracle’s Materialized View Defined

materialized view within an Oracle database is an object that extends the functionality of an ordinary view and contains the results of a query (SQL select statement) in a persistent or physical format. Materialized views can use either tables, views, and other materialized views as its data source and can be both indexed and/or refreshed.

Within an Oracle database, a view is a virtual table representing the result of a database query or select command with a table or another view as its source.  Whenever a select query or an update addresses a view, the database converts these statements into select queries or updates against the source tables of the view’s select query.  Thus a view is just a stored query or virtual table with no storage of physical data, and the physical data is stored in a table.  However, a materialized view takes a different approach in which the query result is cached as a concrete table that can be refreshed from the source tables of the select query. This enables better query performance, at the cost of some data being potentially out-of-date.

Because a materialized view is manifested as a physical table, anything that can be performed to a real table can be done to the materialized view. With the most important thing that can be performed is the ability to build indexes on any field, enabling drastic speedups in query performance time. In a view, it’s typically only possible to utilize indexes on columns that come directly from (or have a mapping to) indexed columns in the source table.  In addition, a materialized view can contain denormalized data that physically exists in multiple tables, a subset of fields and records that exist in physical tables, pre-calculation of table fields, and data pre-aggregated at summary levels.

Materialized views are most useful in analytical data repositories such data warehouses or data marts, where query performance is important and data refreshes and updates occur infrequently. Materialized views are one of the single most important database tuning tools that exist allowing you to pre-join complex table and views and pre-compute summaries for fast response time of queries.

Create Materialized View Syntax

Create Materialized View DeptCheckMV
Start with Sysdate
As
Select * From EmployeeTbl
Where Salary > (Select avg (salary) from EmployeeTbl

Share

Dimensional Modeling – Conformed Dimensions

Conformed DimensionsIn dimensional modeling, conformed dimensions are common and repeatedly used dimensions that relate and are joined to multiple facts. The conformed dimension is consistent across an enterprise and has the same meaning to every fact with which it relates. Typically, a conformed dimension exists as a single dimension table and relates to multiple fact tables within the same data warehouse. But the conformed dimension can also exist as identical dimension tables in separate data marts. In a nutshell, conformed dimensions allow facts and measures to be categorized and described in the same way across dimensional stars and/or data marts, ensuring consistent reporting across an enterprise.

Properties of Conformed Dimensions:
•  Dimensions that are joined to multiple fact tables
•  Exist in multiple star schemas and/or data marts
•  Same meaning with every fact table join
•  Allow various fact tables to be used in the same query
•  Same structure and attributes in every model
Common Conformed Dimensions:
•  Customer
•  Product
•  Date/Time
•  Employee
•  Account
•  Region or Territory
•  Vendor

Share

Key Benefits of a Data Warehouse

Cool Data Warehouse

Data Warehouses are centralized data repositories that integrate data from various transactional, legacy, or external systems, applications, and sources. The data warehouse provides an environment separate from the operational systems and is completely designed for decision-support, analytical-reporting, ad-hoc queries, and data mining. This isolation and optimization enables queries to be performed without any impact on the systems that support the business’ primary transactions (i.e transactional and operational systems). Read more

Share

Three Steps in ETL Processing

Steps within ETL Processing

Steps within ETL Processing

The term ETL which stands for extraction, transformation, & loading is a batch or scheduled data integration processes that includes extracting data from their operational or external data sources, transforming the data into an appropriate format, and loading the data into a data warehouse repository. ETL enables physical movement of data from source to target data repository. The first step, extraction, is to collect or grab data from from its source(s).  The second step, transformation, is to convert, reformat, cleanse data into format that can be used be the target database.  Finally the last step, loading, is import the transformed data into a target database, data warehouse, or a data mart.

Step 1 – Extraction
The extraction step of an ETL process involves connecting to the source systems, and both selecting and collecting the necessary data needed for analytical processing within the data warehouse or data mart. Usually data is consolidated from numerous, disparate source systems that may store the date in a different format.  Thus the extraction process must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type and amount of source data.

Step 2 – Transformation
The transformation step of an ETL process involves execution of a series of rules or functions to the extracted data to convert it to standard format. It includes validation of records and their rejection if they are not acceptable. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.

Step 3 – Loading
The load is the last step of ETL process involves importing extracted and transformed data into a target database or data warehouse. Some load processes physically insert each record as a new row into the table of the target warehouse utilizing a SQL insert statement.  Whereas other load processes include a massive bulk insert of data utilizing a bulk load routine.  The SQL insert is a slower routine for imports of data, but does allow for integrity checking with every record. The bulk load routine may be faster for loads of large amounts of data, but does not allow for integrity check upon load of each individual record.

Read more

Share

Data Warehouses (EDW vs DataMarts)

There are two fundamental types of data warehouses:
•  Enterprise Data Warehouses (EDW)
•  Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Enterprise Data Warehouse (EDW): The enterprise data warehouse is typically a large organization-wide database repository that crosses over every business function and includes data from every organizational unit, division, and department.  In essence, an enterprise data warehouse is a substantially large repository of historical and current transaction data of an entire organization. As an implementation of an enterprise data warehouse is commonly strategic in nature, the volume of data in enterprise data warehouses tends to be quite large.  Enterprise data warehouses can contain hundreds of gigabytes, terabytes, and sometimes even petabytes of data.

Data Mart: A data mart is a collection of subject areas organized for decision support based on the needs of a given department or office.  Many times, a data mart will serve as the reporting and analytical solution for a particular department within an organization, such as accounting, sales, customer service, and/or marketing.  For the most part, data marts are designed with just enough data entities, fields, and records to satisfy one department’s requirements.

There are two kinds of data marts that exist — dependent and independent:
•  A dependent data mart is one whose source is another data warehouse, and all dependent data marts within an organization are typically fed by the same source — the enterprise data warehouse.
•  An independent data mart is one whose source is directly from transactional systems, legacy applications, or external data feeds.

Share

General Data Appliance / Complimentary Database Engine – Datatupia

Data Warehouse Appliance - Complimentary Vendor
A completely unique data warehouse appliance is provided by Datatupia.  Rather than providing a solution that provides an organization a complete data warehouse solution, Datatupia’s solution compliments existing data warehouse deployments within an organization.

Dataupia’s flagship product, Satori Server

Datatupia is a data warehouse appliance in the sense that it combines software and hardware components on a rack-mounted server hardware.  Additionally, Satori Server includes an embedded copy of the Linux operating system, a custom database engine, an aggregation engine, built-in storage, and parallel processors.  But in contrast to full data warehouse appliances that are purpose-built for data warehousing, Satori Server interoperates with existing database management systems rather than requiring their replacement.

The main feature of Satori Server that makes it unique amongst all data warehouse appliances is “omniversal transparency“.  This feature shields applications and users from changes to the underlying technology and database platform by complementing Oracle, MS SQL Server, or DB2 databases and their dependent applications. Unlike other solutions that displace existing technologies, the Satori Server’s innovative design extends and enhances an existing information eco-system.  Further, Datatupia’s Satori Server is usually added to an existing data warehouse infrastructure by augmenting an organization’s existing data management systems as a query accelerator or capacity extender.

Share

Partial Data Warehouse Appliances (Software Appliances)

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard.  Kognitio can be deployed to a customer in one of three ways:  as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture.  The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture.  In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required.  EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer.  Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment.  Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine.  Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights.  Aster Data products can be deployed to a customer in one of three ways:  Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance.  In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.

Share