Posts

Column and Row Based Database Storage

TableRowColumnStore_320x9999
Row-Based Database Storage:
The data sequence consists of the data fields in one table row.
Column-Based Database Storage:
The data sequence consists of the entries in one table column.

Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a database table in linear memory, two options can be chosen (row based storage or column based storage). Row based storage stores a sequence of records that contain the fields of one row in the table. In column based storage, the entries of a column are stored in contiguous memory locations.

Row-based database systems are designed to efficiently return data for an entire row, or record, in as few operations as possible. This matches the common use-case where the system is attempting to retrieve information about a particular object. This is particularly useful for transactional systems that conduct large amounts of inserts, updates, and deletes of records.

Column-based database systems combine all of the values of a column together, then the values of the next column, and so on. Within this layout, any one of the columns more closely matches the structure of an index in a row-based system. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed  up performance of select queries. This is particularly useful for systems that conduct large amounts of analytics.

Column-oriented-database

Row-based storage is recommended for transactional systems or when:

•  The table has a small number of rows, such as configuration tables.
•  The application needs to conducts updates of single records.
•  The application typically needs to access the complete record.
•  The columns contain mainly distinct values so the compression rate would be low.
•  Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:

•  Calculations are executed on a single column or a few columns only.
•  The table is searched based on the values of a few columns.
•  The table has a large number of columns.
•  The table has a large number of records.
•  Aggregations and fast searching on large tables are required.
•  Columns contain only a few distinct values, resulting in higher compression rates.
Share

Partial Data Warehouse Appliances (Software Appliances)

Starting in 2006, a new wave of vendors emerged that focus on database management systems (DBMS) purpose-built for data warehousing that easily integrate with another vendor’s hardware. The products from these “partial-solution” data warehouse appliance vendors have been coined as software appliances”. The software appliance has proved to be a good starting point for new data warehouse vendors. It allows them to focus on their strength, database software, and to partner with commodity hardware vendors and benefit from the hardware firms’ focus and resources.

Data Warehouse Appliance - Partial Solution Vendors

The partial solution vendor is typically a software company that has developed a proprietary database and either partners with or is a subsidiary of one or more major hardware vendors. These software vendors generally focus on the development of an optimal database management software that is suitable for large data warehouse implementations, and commonly rely on other vendors to supply the most-appropriate hardware.

Software appliance vendors with products based upon the traditional row-based database model:

Kognitio – Offers a row-based “virtual” database called WX2 that does not include indexes or data partitions and runs on blade servers from IBM and hardware from Hewlett-Packard.  Kognitio can be deployed to a customer in one of three ways:  as a software-only license solution, as a fully configured data warehouse appliance, and on-demand via a Data Warehousing as a Service (DaaS) offering.

Software appliance vendors with products based upon the columnar database model in which data is stored in columns rather than rows:

ParAccel – The ParAccel Analytic Database is a proprietary column-based database that combines patent-pending innovations with a columnar, compressed, massively parallel, all-in-memory-capable architecture.  The database is available as a packaged appliance on Sun hardware known as Ultra-Fast Data Warehouse powered by Sun and ParAccel.
Vertica – The Vertica Analytic Database is a grid-based, column-oriented database that runs on grids of commodity Linux-based servers. It features column-oriented storage organization, compression, hybrid storage organization, and a shared nothing architecture.  In March 2011, Hewlett Packard (HP) completed an acquisition of Vertica Systems, Inc.
EXASOL – Offers a proprietary, massively parallel database management system called EXASolution. The product is column-based and makes extensive use of in-memory processing. Join indexes are automatically generated at run-time, as required.  EXASolution is available either as a software only solution or as a pre-configured appliance running on commodity hardware such as Fujitsu Siemens, HP, Supermicro or Hitachi. EXASolution uses its own version of Linux which provides extended capabilities to support its clustered architecture.
Sybase IQ – Distinguishes itself from conventional databases with its column-oriented, grid-based architecture, patented data compression, and advanced query optimizer.  Sybase IQ provides significantly faster query performance than traditional database environments which were designed primarily for processing transactions. The improvement over such solutions can be 10 to 1000 times greater, even as thousands of users and years worth of data are added to the environment.  Announced in May 2008, Sybase Analytical Appliance combines pSeries hardware from IBM with the Sybase IQ database.

Software appliance vendors with products based upon a hybrid row and column database:

EMC Greenplum – Acquired in July 2010 by EMC, Greenplum has become the foundation of EMC’s Data Computing Division. Moreover, the EMC Greenplum database contains a shared-nothing, massively parallel processing (MPP) architecture that has been purpose-built for business intelligence and analytical processing.
Aster Data – Provides a database management system (DBMS) called Aster Data nCluster, which is a massively parallel (MPP) hybrid row and column database with an integrated analytics engine.  Aster Data’s solution utilizes Aster Data’s patent-pending SQL-MapReduce to allow for parallel processing of data and applications and to deliver rich, analytic insights.  Aster Data products can be deployed to a customer in one of three ways:  Software-only, Cloud-based, and Complete-appliance called Aster Data MapReduce DW Appliance.  In April 2011, Teradata completed its acquisition of Aster Data Systems, Inc.
Share