SAP HANA Platform – Technical Overview

At the Sapphire conference in May, 2010, SAP announced the release of its new data warehouse appliance called HANA or High-Performance Analytic Appliance. SAP HANA is a combination of hardware, storage, operating system, management software, and in-memory data query engine that is characterized by data being held in RAM rather than being read from disks or flash storage. Additionally, HANA has been built to split up queries to run in parallel on multiple processors—a fundamentally different architecture from SAP’s existing applications. This in-memory and parallel processing architecture of HANA allows for extremely fast performance of queries and analytics on very large amounts of data.

The SAP HANA platform implements a new approach to business data processing. In fact, it is much more than the traditional definition of a database. And the in-memory attribute of HANA is much more than simple caching of disk data structures in the server’s main memory. SAP HANA incorporates a full database management system (DBMS) with a standard SQL interface, transactional isolation and recovery (ACID [atomicity, consistency, isolation, durability] properties), high availability, and massive parallel processing (MPP). SAP HANA supports most entry-level SQL92. SAP applications that use Open SQL can run on the SAP HANA platform without changes. SQL is the standard interface to SAP HANA. SAP HANA is fully adaptable to the dramatic advances of hardware storage technology, on premise and in the cloud. Hana supports multicore CPUs and 64-bit systems offer a new reality in scalability.

Traditional database management systems (dbms) are designed for optimizing performance on hardware with constraints on main memory. Disk I/O is typically the main bottleneck. The focus was on optimizing disk access, for example, by minimizing the number of disk pages to be read into main memory during processing. However, the SAP HANA database component is designed from the ground up around the idea that memory is available in abundance. Built within the design of HANA is the consideration that roughly 18 billion gigabytes or 18 exabytes are the theoretical limits of memory capacity for 64-bit systems, and that I/O access to the hard disk is not a constraint. Instead of optimizing I/O hard disk access, SAP HANA optimizes memory access between the CPU cache and main memory. Additionally SAP HANA is a massively parallel (distributed) data management system that runs completely in main memory, allows for row and column based storage options, and includes a sophisticated calculation engine.

The HANA database takes advantage of the low cost of main memory (RAM), data processing abilities of multi-core processors, and the fast data access of solid-state drives relative to traditional hard drives to deliver better performance of analytical and transactional applications. It offers a multi-engine query processing environment which allows it to support both relational data (with both row- and column-oriented physical representations in a hybrid engine) as well as graph and text processing for semi- and unstructured data management within the same system.

The SAP HANA solution has been introduced on Hewlett Packard x86 servers (HP ProLiant DL580 G7 and DL980 G7 servers) and is built upon Intel’s multi-core servers. Moreover, a single server blade can contain up to 4TB of main memory and up to 64 processor cores. With this total solution, SAP claims that they beat the current performance benchmark by factor of 20, on hardware that was several dozens of times cheaper for a 200X price performance improvement. SAP also claims that HANA either reduces or out-right eliminates the need for the development and deployments of complex and expensive datamarts.

The SAP HANA database manages data in a multi-core architecture for data distribution across all cores to maximize memory RAM locality using scale-out (horizontally) and scale-up (vertically) functionality. The HANA database scales beyond a single server by allowing multiple servers in one cluster. Large tables can be distributed across multiple servers using round-robin, hash, or range partitioning, either alone or in combination. HANA has the functionality to execute queries and maintain distributed transaction safety across multiple servers.

Utilizing column-based data storage, SAP HANA can achieve major compression rates unheard of in traditional row-based databases. On one example, the analysis of SAP customers’ systems showed that only 10% of attributes in a single financial database table was used in an SQL statement. Shrinking the actual size of data volume to be accessed from 35 GB in traditional relational database management system (RDBMS) storage to 800 MB in a column-store design, just over 2% of the volume in the traditional storage.

One of the major contentions and the reason for slow performance in traditional DBMS is locking data when data updates or inserts are being performed. SAP HANA avoids this issue and enables high levels of parallelization using insert-only data records. Instead of creating new records in a database table, deltas are inserted as net-new entries in existing records stored in columns.

The table summarizes the benefits offered by specific features of the SAP HANA database:

Database Feature	Benefit
Multi-Core Architecture	Significant Computation Power over Multiple Processors (CPUs)
In-Memory Processing	Performance Faster Than Reading From Disk
Support of Row and Column Based Storage	Enables Both Transactional and Analytical Databases
Column Based Storage	Fast Select Query Performance
High Data Compression Rates	Efficient Use of of Disk Storage
Data Partitioning	Efficient and Fast Analysis of Very Large Data Sets
Insert Only On Deltas	Fast Data Loads

One of the differentiating attributes of SAP HANA is having both row-based and column-based stores within the same database engine. Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. However, computer memory is organized as a linear sequence. For storing a table in linear memory, two options can be chosen (row storage or column storage). Row storage stores a sequence of records that contain the fields of one row in the table. In column storage, the entries of a column are stored in contiguous memory locations.

Row-based storage is recommended for transactional systems or when:
• The table has a small number of rows, such as configuration tables.
• The application needs to conducts updates of single records.
• The application typically needs to access the complete record.
• The columns contain mainly distinct values so the compression rate would be low.
• Aggregations and fast searching are not required.

Column-based storage is recommended for analytical systems or when:
• Calculations are executed on a single column or a few columns only.
• The table is searched based on the values of a few columns.
• The table has a large number of columns.
• The table has a large number of records.
• Aggregations and fast searching on large tables are required.
• Columns contain only a few distinct values, resulting in higher compression rates.