Archive for "Data Warehousing"

Column and Row Based Database Storage

Business Intelligence, Data Warehousing Jul 01, 2014 No Comments
TableRowColumnStore_320x9999
Row-Based Database Storage:
The data sequence consists of the data fields in one table row.
Column-Based Database Storage:
The data sequence consists of the entries in one table column.

Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a database table in linear memory, two options can be chosen (row based storage or column based storage). Row based storage stores a sequence of records that contain the fields of one row in the table. In column based storage, the entries of a column are stored in contiguous memory locations.

Row-based database systems are designed to efficiently return data for an entire row, or record, in as few operations as possible. This matches the common use-case where the system is attempting to retrieve information about a particular object. This is particularly useful for transactional systems that conduct large amounts of inserts, updates, and deletes of records.

Column-based database systems combine all of the values of a column together, then the values of the next column, and so on. Within this layout, any one of the columns more closely matches the structure of an index in a row-based system. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed  up performance of select queries. This is particularly useful for systems that conduct large amounts of analytics.

Column-oriented-database

Row-based storage is recommended for transactional systems or when:

•  The table has a small number of rows, such as configuration tables.
•  The application needs to conducts updates of single records.
•  The application typically needs to access the complete record.
•  The columns contain mainly distinct values so the compression rate would be low.
•  Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:

•  Calculations are executed on a single column or a few columns only.
•  The table is searched based on the values of a few columns.
•  The table has a large number of columns.
•  The table has a large number of records.
•  Aggregations and fast searching on large tables are required.
•  Columns contain only a few distinct values, resulting in higher compression rates.

 

 

Share

MS SQL Server Indexed View Defined

Data Warehousing Sep 20, 2012 No Comments

For many years and versions, the MS SQL Server database platform has supported the ability to create virtual tables known as views.  Views in SQL Server can be defined as a stored select statement on a base table.  Data in a view is not stored in the database and the result set is determined while the view is executed. Traditionally, views provide a security mechanism that restricts users to a certain subset of data in one or more base tables or provide a mechanism that allow developers to customize how users can logically view the data stored in base tables.  Prior to the addition of indexed views in SQL Server, views never contained indexes of their own.

Starting with Microsoft SQL Server 2000, indexed views are views whose results are persisted in the database (similar to tables) and contain database indexes for fast retrieval of data.  A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space.  Indexed views work best when the underlying data is infrequently updated. The maintenance of an indexed view can be higher than the cost of maintaining a table index. If the underlying data is updated frequently, then the cost of maintaining the indexed view data may outweigh the performance benefits of using the indexed view.

The main benefit of an indexed view is query optimization and faster query performance. Because the indexed view physically exists on disk, there can be substantial savings in response time to queries that involve the indexed view. Additionally data within the  indexed view can store joins between tables, can store a subset of fields and/or records that exist within a table, can store calculations of table fields, and can store aggregations or groupings of data.  Thus less data is retrieved in a query of  the the indexed table than the  query of the base tables.  And finally, indexed views utilize performance-enhancing indexes corresponding to its own data where ordinary views only utilize indexes related to the indexes in its base table.

The main costs of an indexed view are additional maintenance of the indexes, additional disk space taken up by both the indexed view and related indexes, additional overhead of maintaining the view on disk as the base tables are modified. One must weigh the cost to maintain the index against the benefit of query optimization provided by the index. When the underlying tables are subject to significant inserts, updates, and deletes, be very careful in selecting the indexes (both table and view) that will provide the greatest coverage across your queries for the lowest cost.

Typically, environments that are best suited for indexed views are analytical data repositories such as data warehouses and data marts. Analytical data repositories require rapid query speeds and the underlying data changes infrequently.

The syntax to create a Indexed View in SQL Server is:

CREATE VIEW IndexView
WITH SCHEMABINDING
    AS SELECT Au_Fname + ‘ ‘ + Au_Lname, Au_id FROM Authors WHERE Contract = 1 GO
CREATE UNIQUE CLUSTERED Index MyINDEX ON IndexView (au_id) Go

 

 

 

Share

Oracle’s Materialized View Defined

Data Warehousing Sep 18, 2012 No Comments

materialized view within an Oracle database is an object that extends the functionality of an ordinary view and contains the results of a query (SQL select statement) in a persistent or physical format. Materialized views can use either tables, views, and other materialized views as its data source and can be both indexed and/or refreshed.

Within an Oracle database, a view is a virtual table representing the result of a database query or select command with a table or another view as its source.  Whenever a select query or an update addresses a view, the database converts these statements into select queries or updates against the source tables of the view’s select query.  Thus a view is just a stored query or virtual table with no storage of physical data, and the physical data is stored in a table.  However, a materialized view takes a different approach in which the query result is cached as a concrete table that can be refreshed from the source tables of the select query. This enables better query performance, at the cost of some data being potentially out-of-date.

Because a materialized view is manifested as a physical table, anything that can be performed to a real table can be done to the materialized view. With the most important thing that can be performed is the ability to build indexes on any field, enabling drastic speedups in query performance time. In a view, it’s typically only possible to utilize indexes on columns that come directly from (or have a mapping to) indexed columns in the source table.  In addition, a materialized view can contain denormalized data that physically exists in multiple tables, a subset of fields and records that exist in physical tables, pre-calculation of table fields, and data pre-aggregated at summary levels.

Materialized views are most useful in analytical data repositories such data warehouses or data marts, where query performance is important and data refreshes and updates occur infrequently. Materialized views are one of the single most important database tuning tools that exist allowing you to pre-join complex table and views and pre-compute summaries for fast response time of queries.

Create Materialized View Syntax

Create Materialized View DeptCheckMV
Start with Sysdate
As
Select * From EmployeeTbl
Where Salary > (Select avg (salary) from EmployeeTbl

 

 

 

Share

Dimensional Modeling – Conformed Dimensions

Data Warehousing Nov 18, 2011 No Comments

Conformed DimensionsIn dimensional modeling, conformed dimensions are common and repeatedly used dimensions that relate and are joined to multiple facts. The conformed dimension is consistent across an enterprise and has the same meaning to every fact with which it relates. Typically, a conformed dimension exists as a single dimension table and relates to multiple fact tables within the same data warehouse. But the conformed dimension can also exist as identical dimension tables in separate data marts. In a nutshell, conformed dimensions allow facts and measures to be categorized and described in the same way across dimensional stars and/or data marts, ensuring consistent reporting across an enterprise.

Properties of Conformed Dimensions:
•  Dimensions that are joined to multiple fact tables
•  Exist in multiple star schemas and/or data marts
•  Same meaning with every fact table join
•  Allow various fact tables to be used in the same query
•  Same structure and attributes in every model
Common Conformed Dimensions:
•  Customer
•  Product
•  Date/Time
•  Employee
•  Account
•  Region or Territory
•  Vendor

 

 

Share

Key Benefits of a Data Warehouse

Business Intelligence, Data Warehousing, Front Page, Posts Jul 28, 2011 No Comments

Cool Data Warehouse

Data Warehouses are centralized data repositories that integrate data from various transactional, legacy, or external systems, applications, and sources. The data warehouse provides an environment separate from the operational systems and is completely designed for decision-support, analytical-reporting, ad-hoc queries, and data mining. This isolation and optimization enables queries to be performed without any impact on the systems that support the business’ primary transactions (i.e transactional and operational systems). Read more »

Share

Three Steps in ETL Processing

Data Warehousing, Posts May 27, 2011 No Comments
Steps within ETL Processing

Steps within ETL Processing

The term ETL which stands for extraction, transformation, & loading is a batch or scheduled data integration processes that includes extracting data from their operational or external data sources, transforming the data into an appropriate format, and loading the data into a data warehouse repository. ETL enables physical movement of data from source to target data repository. The first step, extraction, is to collect or grab data from from its source(s).  The second step, transformation, is to convert, reformat, cleanse data into format that can be used be the target database.  Finally the last step, loading, is import the transformed data into a target database, data warehouse, or a data mart.

Step 1 – Extraction
The extraction step of an ETL process involves connecting to the source systems, and both selecting and collecting the necessary data needed for analytical processing within the data warehouse or data mart. Usually data is consolidated from numerous, disparate source systems that may store the date in a different format.  Thus the extraction process must convert the data into a format suitable for transformation processing. The complexity of the extraction process may vary and it depends on the type and amount of source data.

Step 2 – Transformation
The transformation step of an ETL process involves execution of a series of rules or functions to the extracted data to convert it to standard format. It includes validation of records and their rejection if they are not acceptable. The amount of manipulation needed for transformation process depends on the data. Good data sources will require little transformation, whereas others may require one or more transformation techniques to to meet the business and technical requirements of the target database or the data warehouse. The most common processes used for transformation are conversion, clearing the duplicates, standardizing, filtering, sorting, translating and looking up or verifying if the data sources are inconsistent.

Step 3 – Loading
The load is the last step of ETL process involves importing extracted and transformed data into a target database or data warehouse. Some load processes physically insert each record as a new row into the table of the target warehouse utilizing a SQL insert statement.  Whereas other load processes include a massive bulk insert of data utilizing a bulk load routine.  The SQL insert is a slower routine for imports of data, but does allow for integrity checking with every record. The bulk load routine may be faster for loads of large amounts of data, but does not allow for integrity check upon load of each individual record.

Read more »

Share

Data Warehouses (EDW vs DataMarts)

Data Warehousing, Posts May 19, 2011 2 Comments
There are two fundamental types of data warehouses:
•  Enterprise Data Warehouses (EDW)
•  Data Marts
Overview of Enterprise Data Warehouse (EDW) and Data Marts

Overview of Enterprise Data Warehouse (EDW) and Data Marts

Enterprise Data Warehouse (EDW): The enterprise data warehouse is typically a large organization-wide database repository that crosses over every business function and includes data from every organizational unit, division, and department.  In essence, an enterprise data warehouse is a substantially large repository of historical and current transaction data of an entire organization. As an implementation of an enterprise data warehouse is commonly strategic in nature, the volume of data in enterprise data warehouses tends to be quite large.  Enterprise data warehouses can contain hundreds of gigabytes, terabytes, and sometimes even petabytes of data.

Data Mart: A data mart is a collection of subject areas organized for decision support based on the needs of a given department or office.  Many times, a data mart will serve as the reporting and analytical solution for a particular department within an organization, such as accounting, sales, customer service, and/or marketing.  For the most part, data marts are designed with just enough data entities, fields, and records to satisfy one department’s requirements.

There are two kinds of data marts that exist — dependent and independent:
• A dependent data mart is one whose source is another data warehouse, and all dependent data marts within an organization are typically fed by the same source — the enterprise data warehouse.
• An independent data mart is one whose source is directly from transactional systems, legacy applications, or external data feeds.

Read more »

Share

General Data Appliance / Complimentary Database Engine – Datatupia

Data Warehouse Appliance, Posts May 10, 2011 No Comments
Data Warehouse Appliance - Complimentary Vendor
A completely unique data warehouse appliance is provided by Datatupia.  Rather than providing a solution that provides an organization a complete data warehouse solution, Datatupia’s solution compliments existing data warehouse deployments within an organization.

Dataupia’s flagship product, Satori Server

Datatupia is a data warehouse appliance in the sense that it combines software and hardware components on a rack-mounted server hardware.  Additionally, Satori Server includes an embedded copy of the Linux operating system, a custom database engine, an aggregation engine, built-in storage, and parallel processors.  But in contrast to full data warehouse appliances that are purpose-built for data warehousing, Satori Server interoperates with existing database management systems rather than requiring their replacement.

The main feature of Satori Server that makes it unique amongst all data warehouse appliances is “omniversal transparency“.  This feature shields applications and users from changes to the underlying technology and database platform by complementing Oracle, MS SQL Server, or DB2 databases and their dependent applications. Unlike other solutions that displace existing technologies, the Satori Server’s innovative design extends and enhances an existing information eco-system.  Further, Datatupia’s Satori Server is usually added to an existing data warehouse infrastructure by augmenting an organization’s existing data management systems as a query accelerator or capacity extender.

Read more »

Share

Partial Data Warehouse Appliances (Software Appliances)

Data Warehouse Appliance, Posts May 01, 2011 1 Comment

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

 

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard.  Kognitio can be deployed to a customer in one of three ways:  as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

 

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture.  The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture.  In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required.  EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer.  Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment.  Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

 

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine.  Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights.  Aster Data products can be deployed to a customer in one of three ways:  Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance.  In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.

Read more »

Share

Complete Data Warehouse Appliance Solutions

Data Warehouse Appliance, Posts Apr 25, 2011 No Comments
Basic Architecture of a Data Warehouse (DW) Appliance
Complete data warehouse appliances are purpose-built data warehouse solutions and systems that encompass a whole-technology stack including:
• Operating System (OS)
• Database Management System (DBMS)
• Server Hardware
• Storage Capabilities

Initially DW appliances were created with proprietary custom-built hardware and storage units. Netezza, Teradata, DATAllegro, & White Cross (now Kognito) were the first vendors to provide solutions in this manner. Subsequently data warehouse appliances evolved and started to utilize lower-cost, industry-standard non-proprietary hardware components.The movement from proprietary to commodity hardware has proven to bring down the cost of the data warehouse appliance as the commodity hardware can integrated at a lower cost of both developing and integrating proprietary hardware. Examples of commodity hardware typically include general-purpose servers from Dell, Hewlett Packard (HP), or IBM utilizing Intel processors and popular network and storage hardware from either Cisco, EMC, or Sun.

Data Warehouse Appliance - Complete Solution VendorsIntroduced 2002, Netezza was the first vendor to offer a complete data warehouse appliance, so early definitions of appliance were based upon Netezza products.  Subsequently, Netezza Performance Server still provides all of the software components of a data warehouse appliance, including the database, operating system, servers, and storage units.  However in 2009, Netezza replaced its own proprietary hardware with IBM blade servers and storage units. Further in 2010, IBM completed a corporate acquisition of Netezza.

Similar to Netezza, DATAllegro was launched in 2005 with a complete solution involving proprietary hardware. Soon after DATAllegro replaced its own proprietary hardware with commodity server from Dell and storage units from EMC. In 3008, Microsoft acquired DATAllegro in 2008 and announced it will integrate DATAllegro’s massive parallel processing (MPP) architecture into its own MS SQL Server platform, which also runs on commonly-available hardware.

Additionally, both Kognitio and Teradata replaced their proprietary hardware within their appliances in a process similar to that of DATAllegro. Kognitio now offers a row-based, in-memory database database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and Hewlett-Packard. Teradata provides a proprietary database, a variety of common operating systems (Linux, Unix, and Windows), and a proprietary networking subsystem packaged along with commodity processors and storage units.

Announced at the 2008 Oracle OpenWorld conference in San Francisco, the Oracle Exadata Database Machine is a complete package of database software, operating system, servers, and storage. The product was initially assembled in collaboration between Oracle Corporation and Hewlett Packard where Oracle developed the database, operating system and storage software, while HP constructed the hardware. However, with Oracle’s acquisition of Sun Microsystems, Oracle announced the release of Exadata Version two with improved performance and usage of Sun Microsystems storage and operating systems technologies.

At the Sapphire conference in May, 2010 in Orlando, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage.

Finally IBM bundles and integrates its own Infosphere Warehouse database software (formerly “DB2 Warehouse”) with its own servers and storage to deliver the IBM Infosphere Balanced Warehouse.

Read more »

Share
Facebook Like Button for Dummies