2020 March

Data Lineage and Data Provenance commonly refer to the ways or the steps that a data set comes to its final state. Moreover, these two concepts provide a mechanism for documenting data flows and tracing changes in data elements either forward from source to destination or backwards from destination to source. Further, presentations of data flows can be defined either at the summary or detail level. At the summary level, presentations of data flows only provide the names and types of systems that data interacts as it flows through an organization. At the detail level, presentations can include specifics about each data point, transport mechanisms, attribute properties, and data quality issues.

Data Lineage is defined as a life cycle of data elements over time, including the origin of data, what happens to the data, and where the data moves throughout an organization. Data lineage is typically represented graphically to represent data flow from source to destination and how the data gets transformed. A simple representation of data lineage can be shown with dots and lines. The dots represent the data container and the lines represent the movement and transformation of data between the data containers. More complex representations of data lineage contain the type of data container (i.e. application, database, data file, query, or report) and transport method of the data (i.e. FTP, SFTP, HTTPS, ETL, EAI, EII, ODBC, JDBC, etc.)

Data Provenance is a type of data lineage that is specific to database systems and is comprised of the inputs, entities, systems, and processes that untie data within an organization. Moreover, data provenance provides a historical record of data from origin to destination. It is used to trace records through a data flow, to execute a data flow on a subset of inputs, and to debug data flows that contain errors.

With the use of data provenance, a data scientist or data analyst can ascertain the quality of data within a system, diagnose data anomalies, determine solutions to correct data errors, analyze derivations of data elements, and provide attribution to data sources. Within an organization, data provenance can be used to drill down to the source of data in a data warehouse, track the creation of individual records, and provide a data audit trail of updates of records. Fundamentally, data provenance documents processes that influence data of interest, in effect providing a historical record of the data and its changes over time.

The Golden Record is a fundamental concept within Master Data Management (MDM) that identifies and defines the single version of truth, where truth is understood to be data that is trusted to both accurate and correct. When building database tables from disparate data sources, there commonly are issues of duplication of records, incomplete values within a record, and records with poor data quality. The golden record solves these issues by correcting duplications, by providing values when a value may not exist, and by improving data quality within a record. Moreover, the Golden Record is a record that an organization assumes to best possible record to be utilized.

The main consideration in the creation and maintenance of the Golden Record is the matching and merging of records that were created in different data sources. A good MDM system will include functionality to automatically merge similar records as much as possible. Additionally a good MDM system will provide functionality to allow a data steward to manually determine the best possible record. The data steward should be able to use their knowledge of a particular data set to make a judgement related to the correct values in a record. And the data steward should be able to identify whether an attribute or an entire records is correct.

When similar records from different systems have different values, one of the records has be chosen as the one that is correct. In order to determine the correct record, either the system or the data steward will need to consider the user of each each data set, the level of quality in each data set, the attribute of each data set that are the most reliable, and the rules for determining priority for each field. In the example above, each of three systems contain a record that is similar to a record contained in the other two systems. But the values in each attribute of the three similar records are not exactly the same. The Golden Record could be determined to be an entire record from one of the source systems or a combination of attributes from the records in the source systems.

The Golden Record – The Best Choice of Attributes

In this case, the Golden Record contains a combination of values for the multiple source systems. The value for the name field is taken from source system three, while all of the values of the other attributes are taken from source system one.