Data Lineage and Data Provenance

Data Lineage and Data Provenance commonly refer to the ways or the steps that a data set comes to its final state. Moreover, these two concepts provide a mechanism for documenting data flows and tracing changes in data elements either forward from source to destination or backwards from destination to source. Further, presentations of data flows can be defined either at the summary or detail level. At the summary level, presentations of data flows only provide the names and types of systems that data interacts as it flows through an organization. At the detail level, presentations can include specifics about each data point, transport mechanisms, attribute properties, and data quality issues. 

Data Lineage is defined as a life cycle of data elements over time, including the origin of data, what happens to the data, and where the data moves throughout an organization.  Data lineage is typically represented graphically to represent data flow from source to destination and how the data gets transformed. A simple representation of data lineage can be shown with dots and lines. The dots represent the data container and the lines represent the movement and transformation of data between the data containers. More complex representations of data lineage contain the type of data container (i.e. application, database, data file, query, or report) and transport method of the data (i.e. FTP, SFTP, HTTPS, ETL, EAI, EII, ODBC, JDBC, etc.)  

Data Provenance is a type of data lineage that is specific to database systems and is comprised of the inputs, entities, systems, and processes that untie data within an organization. Moreover, data provenance provides a historical record of data from origin to destination. It is used to trace records through a data flow, to execute a data flow on a subset of inputs, and to debug data flows that contain errors.  

With the use of data provenance, a data scientist or data analyst can ascertain the quality of data within a system, diagnose data anomalies, determine solutions to correct data errors, analyze derivations of data elements, and provide attribution to data sources. Within an organization, data provenance can be used to drill down to the source of data in a data warehouse, track the creation of individual records, and provide a data audit trail of updates of records.  Fundamentally, data provenance documents processes that influence data of interest, in effect providing a historical record of the data and its changes over time.

Share

Master Data Management “The Golden Record”

The Golden Record is a fundamental concept within Master Data Management (MDM) that identifies and defines the single version of truth, where truth is understood to be data that is trusted to both accurate and correct. When building database tables from disparate data sources, there commonly are issues of duplication of records, incomplete values within a record, and records with poor data quality. The golden record solves these issues by correcting duplications, by providing values when a value may not exist, and by improving data quality within a record. Moreover, the Golden Record is a record that an organization assumes to best possible record to be utilized.

The main consideration in the creation and maintenance of the Golden Record is the matching and merging of records that were created in different data sources. A good MDM system will include functionality to automatically merge similar records as much as possible. Additionally a good MDM system will provide functionality to allow a data steward to manually determine the best possible record. The data steward should be able to use their knowledge of a particular data set to make a judgement related to the correct values in a record. And the data steward should be able to identify whether an attribute or an entire records is correct.

Source Data From Multiple Systems

When similar records from different systems have different values, one of the records has be chosen as the one that is correct. In order to determine the correct record, either the system or the data steward will need to consider the user of each each data set, the level of quality in each data set, the attribute of each data set that are the most reliable, and the rules for determining priority for each field. In the example above, each of three systems contain a record that is similar to a record contained in the other two systems. But the values in each attribute of the three similar records are not exactly the same. The Golden Record could be determined to be an entire record from one of the source systems or a combination of attributes from the records in the source systems.

The Golden Record – The Best Choice of Attributes

In this case, the Golden Record contains a combination of values for the multiple source systems. The value for the name field is taken from source system three, while all of the values of the other attributes are taken from source system one.

Share

Types of Enterprise Data (Transactional, Analytical, Master)

All business enterprises have three varieties of physical data located within their numerous information systems.  These varieties of data are characterized by their data types and their purpose within the organization.
• Transactional Data
• Analytical Data
• Master Data

Data Element Types

Transactional data supports the daily operations of an organization (i.e. describes business events). Analytical data supports decision-making, reporting, query, and analysis (i.e. describes business performance). While master data represents the key business entities upon which transactions are executed and the dimensions around which analysis is conducted (i.e. describes key business entities).

Transactional Data

Transactional data are the elements that support the on-going operations of an organization and are included in the application systems that automate key business processes. This can include areas such as sales, service, order management, manufacturing, purchasing, billing, accounts receivable and accounts payable. Commonly, transactional data refers to the data that is created and updated within the operational systems.  Examples of  transactional data included the time, place, price,discount, payment methods, etc. used at the point of sale. Transactional data is normally stored within normalized tables within Online Transaction Processing (OLTP) systems and are designed for integrity.  Rather than being the objects of a transaction such as customer or product, transactional data is the describing data including time and numeric values.

Analytical Data

Analytical data are the numerical values, metrics, and measurements that provide business intelligence and support organizational decision making. Typically analytical data is stored in Online Analytical Processing (OLAP) repositories optimized for decision support, such as enterprise data warehouses and department data marts. Analytical data is characterized as being the facts and numerical values in a dimensional model. Normally, the data resides in fact tables surrounded by key dimensions such as customer, product, account, location, and date/time. However, analytical data are defined as the numerical measurements rather than being the describing data.

Master Data

Master data is usually considered to play a key role in the core operation of a business. Moreover, master data refers to the key organizational entities that are used by several functional groups and are typically stored in different data systems across an organization.  Additionally, master data represents the business entities around which the organization’s business transactions are executed and the primary elements around which analytics are conducted. Master data is typically persistent, non-transactional data utilized by multiple systems that defines the primary business entities. Master data may include data about customers, products, employees, inventory, suppliers, and sites.

Share