Graph NoSQL Database

Graph NoSQL Database

A graph database is a NoSQL database that organizes data as nodes, which are like records in a relational database, and relationships, which represent connections between nodes. Because the graph system stores the relationship between nodes, it can support richer representations of data relationships. Relationships are the key concept in graph databases, representing an abstraction that is not directly implemented in RDBMS or other NoSQL databases. Primarily, graph databases are applied in systems that share relationships between values, such as social networks, reservation systems, fraud detection, or customer relationship management systems. And graph databases address significant limitations of existing relational database management systems (RDBMS).

Graph databases, by design, allow simple and fast retrieval of complex hierarchical structures that are difficult to model with RDBMS.  They allow for simple queries that display the nearest neighboring nodes.  And they allow for complex queries that explore vast networks of connections and quickly find patterns in the connections.  Flexible structure enables graph databases to accommodate complex data that doesn’t conform to rigid data models required for RDBMS implementations.

Graph databases contain four types of data fields (nodes, relationships, properties, & labels):

•  Nodes: Objects that represent data entities or instances such as people, businesses, accounts, products or any other item to be tracked. They are roughly the equivalent of the record or row in a relational database, or the document in a document-store database. Each node contains several pieces of information that go together. For example, a single node might include a product name, description, price and product code. Another might have information about a customer, such as name and account number.
•  Relationships: Objects that describe how the nodes relate to each other. Relationships represent the connections, edges, or lines between nodes to other nodes. A relationship connects two nodes and enables users to find related nodes. A relationship always has a source node and a target node that provides the direction of the arrow. Meaningful patterns can emerge when examining the connections and interconnections of nodes.
•  Properties: Additional attributes of both nodes and relationships that are represented as additional key-value pairs. Properties store relevant data about the node or relationship with the entity it describes. Examples of priorities for a node with a label of person include name, age, address, & date of birth. Relationships usually have properties including time, distance, cost, rating or weights which are also stored as key-value pairs.
•  Labels: Named graph construct that is used to group nodes into sets, and all nodes with the same label belongs to the same set. Many database queries can work with these sets instead of the whole graph, making queries easier to write and more efficient to execute. A node may be labeled with any number of labels, including none, making labels an optional addition to the graph.

Each node in the graph database model directly and physically contains a list of relationships that represent the connections to other nodes. Unlike traditional RDBMS, graph databases do not utilize foreign keys or join operations. Instead, all relationships are natively stored within vertices.

Graph databases are purpose-built for the analysis of interconnections and relationships of data entities. This design relates well to analysis of data retrieved from social media, web, and mobile applications,. Graph databases are also useful for working with data in business disciplines that involve analyzing complex relationships and dynamic schema, such as supply chain management, customer relationship management, law enforcement intelligence, and fraud detection.

Document NoSQL Database

Document NoSQL Database

A document database, also called a document store or document-oriented database, is a NoSQL database used for storing, retrieving, and managing semi-structured data. Unlike traditional relational database management systems (RDBMS), the data model in a document database is not structured in a table format of rows and columns. A document database uses documents as the structure for storage and queries. Subsequently a document database aggregates data from documents and stores the documents a searchable and organized format. The schema of document databases can vary, providing far more flexibility for data modeling than RDBMS. In this case, the term “document” may refer to a MS Word, MS Excel, MS PowerPoint or Adobe PDF document but is commonly a block of extensible markup language (XML) or javascript object notation (JSON) code and values. Instead of columns with names and data types that are used in RDBMS, a document contains a description of the data type and the value for that description. Each document stored within a document database can have the same or different structure.

Document databases use a tree-like structure that begins with a root node. And beneath the root node, there is a sequence of branches, sub-branches, and values. Subsequently, each branch has a related path expression that shows to navigate from the root of the tree to any given branch, sub-branch, or value. Most document stores group documents together within document collections. And these collections are similar in look and feel to the directory structure in a Windows or UNIX/Linux file system. Document collections can be used to navigate document hierarchies, logically group similar documents, and to store business rules including permissions, indexes, and triggers. Additionally, collections can contain other collections.

Documents within document databases are identified using a unique key, which contains a simple identifier. The key usually contains either a string, a URI, or a path. And the key can be used to retrieve the document from the database. Typically, the database retains an index on the key to speed up document retrieval. And sometimes, the key is used to create or insert the document into the database.

A key advantage of a document database is that all values within the document are automatically indexed when a new document is insert into the database. That means that every value within the document can be searched upon. This also means that if a user knows any property of the document, all documents with the same property can be easily retrieved. And even if the document structure is complex, a document store search provides an easy way to select either an entire document or a sub-set of a document. Additionally, document database searches can tell the user whether the search item is included within a document as well as the search items exact location utilizing the document path.

To add additional types of data to a document database, there is no need to modify the entire database schema as is with a RDBMS. Data can simply be added by adding objects to the database. Further document databases utilize internal structure within documents in order to extract metadata that the database engine uses for further optimization and query performance. Unlike traditional RDBMS, some document databases prioritize write availability over strict data consistency. This ensures that writes will always be fast even if there is a failure in one portion of the hardware or network.

Key-Value NoSQL Databases

Key-Value Database

A key-value database, also known as a key-value store, is the most flexible type of NoSQL database. Key-value databases have emerged as an alternative to many of the limitations of traditional relational databases, where data is structured in tables and the schema must be predefined. In a key-value data store, there is no schema and the value of the data can be just about anything. Values are identified and accessed via a key, and values can be numbers, strings, counters, JSON, XML, HTML, binaries, images, videos, and more. It is the most flexible NoSQL model because the application has complete control over what is stored in the value field with no restrictions.

A key-value database is a simple database that contains a simple string (the key) that is always unique, and an arbitrary large data field (the value). Key-value stores have no query language, but they do provide a way to add and remove key-value pairs. Additionally, values cannot be queried or searched upon. Only the key can be queried. Within a key-value database, only three functions can be executed (put, get, delete).

•  Put:  Adds a new key-value pair and updates the value if the key is already present.
•  Get:  Returns the value for any given key.
•  Delete:  Removes a key-value pair.

One of the benefits of the key-value database is that data of any data type can be stored in the value field including a binary large object (BLOB) value. Additionally, the key-value database is like a dictionary, as a dictionary has a list of words and each word has one or more definitions with various length. And like a dictionary, the key-value database is uniquely indexed by the key field. Thus, rapid retrieval of values can occur regardless of the number of records / items within the database. Key-value databases work in a very different fashion from traditional relational database management systems (RBMS). Traditionally RDBMS define data structures in the database as a series of tables containing fields with well-defined data types. In contrast, key-value databases treat data values as a single opaque collection, which may have different fields for every record. This offers considerable flexibility and more closely follows modern concepts like object-oriented programming. Because optional values are not represented by placeholders or input parameters, key-value databases often use far less memory to store the same database, which can lead to significant performance gains.

Business Rationale, Core Themes, & Misconceptions of NoSQL

Business Rationale for NoSQL

According to the book, “Making Sense of NoSQL”, by Dan McCreary and Ann Kelly, increases of data volume, data velocity, and data type variability within modern business organizations has created a high demand on conventional relational database management systems (RDBMS) and requires a new paradigm for organizations to remain effective. Organizations have been realizing that they now need to rapidly capture and analyze immensely large amounts of changing data that is being received in many different formats. Data volume and data velocity refer to the ability to process large data sets as they rapidly arrive. Data type variability refers to diversity in data types that don’t easily fit into structured database tables.

•  Data Volume:  Volume refers to the incredible amounts of data generated each second from social media, email, message, text documents, smart phones, sensors, photographs, video, etc. The vast amounts of data have become so large in fact that the data can no longer be stored and analyzed using traditional database technology. Now that data is generated by machines, networks, and human interaction, the amount of data to be analyzed is massive. We now use distributed systems, where parts of the data are stored in different locations and brought together by software. Collecting and analyzing this data is clearly an engineering challenge of immensely vast proportions. More sources of data with a larger size of data combine to increase the amount of data that must be analyzed. This is a major issue for those organizations looking to put that data to use instead of letting it just disappear.

•  Data Velocity:  Velocity refers to the speed at which vast amounts of data are being generated, collected and analyzed. Additional, velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction. And the flow of data is both continuous and massive in amount. Real-time data can help researchers and businesses make valuable & timely decisions that provide both strategic and competitive advantages, as well as a high return on investment (ROI). Not only must the data be rapidly analyzed, but the speed of transmission, and access to the data must also remain instantaneous. In the past, companies analyzed data using long-running batch processes. That paradigm worked well when the incoming data rate was slower than the batch processing rate and when the result was useful despite the delay in analysis execution. With new sources of data such as social, web, and mobile applications, the batch process paradigm has broken down. Now data is now streaming into servers in a real-time, continuous fashion and the result is only useful if data is immediately analyzed with very little delay.

•  Data Type Variability:  Variability refers to the many sources and types of data both structured and unstructured. In the past, data was managed primarily within spreadsheets and relational databases. Now data comes in the form of emails, text, photo, audio, video, web, GPS data, sensor data, relational databases, documents, messages, pdf, flash, etc. Data structures have changed to lose its rigid structure and hundreds of data formats are now being implemented. Organizations no longer have control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. Organizations that want to capture and report on exception data struggle when attempting to use rigid database schema structures imposed by traditional relational database management systems. More and more, data being created and being analyzed is of the unstructured variety. New and innovative technologies are now allowing both structured and unstructured data to be harvested, stored, and processed simultaneously.

 

Core Themes of NoSQL Databases

•  Multiple data formats:  NoSQL databases store and retrieve data from many formats: key-value stores, graph stores, wide column / column family stores, document stores, & search engines.
•  Free of table joins:  NoSQL databases allow for extraction of data using simple interfaces without the use of joins between tables.
•  Free of pre-defined schema:  NoSQL databases allow users to place data into a file folder and then query the data without defining a data schema.
•  Distributed processing:  NoSQL databases can use more than one or multiple computer processors in order to execute.
•  Horizontal scaling / scaling out:  NoSQL databases have direct increases of system performance with the addition of computer processors.
•  Design alternatives:  NoSQL databases offer multiple options to a traditional single method of storing, retrieving, and manipulating data.
•  High performance:  NoSQL database are optimized for specific data models and access patterns that enable higher performance than trying to accomplish similar functionality with relational databases.
•  Rapid implementations:  NoSQL databases generally provide flexible data schema that enable faster and more iterative development.

 

Common Misconceptions of NoSQL Databases

•  NoSQL is all not about the SQL query language:  NoSQL databases are not applications that utilize a language other than SQL. SQL as well as other query languages are used with NoSQL databases.
•  NoSQL is not all about open source projects:  Although many NoSQL database are built upon an open source model, commercial products use NoSQL concepts as well as open source initiatives.
•  NoSQL is not only used in big data projects:  Many NoSQL databases are driven by the inability of an application to efficiently scale when big data is utilized. While data volume and data velocity are important to NoSQL database implementations, NoSQL databases also focus on data type variability and the ability to rapidly implement solutions.
•  NoSQL is not only used in cloud environments:  NMany NoSQL databases do reside in cloud environments to take advantage of the cloud’s ability to rapidly scale. But NoSQL databases can run in both the cloud as well as on-premise data centers.
•  NoSQL is not all about a clever use of memory and SSD:  Many NoSQL databases do focus on the efficient use of computer memory and/or solid-state disks (SSD) to increase performance. While important, NoSQL databases can run on standard commodity hardware.
•  Design alternatives:  NoSQL databases offer multiple options to a traditional single method of storing, retrieving, and manipulating data.
•  NoSQL is not just a few products:  More and more NoSQL databases are constantly being developed. And existing NoSQL databases are constantly being enhanced to included additional functionality.
•  NoSQL databases are not just about solving one problem:  While many NoSQL databases have only been developed using one type of database model, many other NoSQL databases are multi-modal and can solve multiple types of problems.

 

What is NoSQL?

Definition of NoSQL

Recently, NoSQL databases have been developed that provide a high-performance and salable alternative to more traditional relational database management systems (RDBMS), especially when dealing with large amounts of unstructured or semi-structured data. NoSQL, which stands for “Not Only SQL” (https://en.wikipedia.org/wiki/NoSQL) is unlike RDBMS as it is designed for processing large collections of distributed data that don’t fit well into strict rows and columns. And NoSQL databases are ideal solutions for implementations of Big Data initiatives. Moreover, the substantial increase in amount, speed, and variation of Big Data in recent years has greatly increased the need for deployments of NoSQL databases. While traditional RDBMS are very useful for the processing of highly-structured data, NoSQL databases typically accommodate either semi-structured data, fully-unstructured data, documents, graphs, or dynamic schema.  And NoSQL databases are now widely recognized for their ease of development, functionality, and performance at scale.

The term NoSQL can be applied to some databases that were available before traditional RDBMS, but more often the term refers to databases developed in the mid to late 2000s for the purpose of large-scale database processing within web and mobile based applications. Within these emerging applications, requirements for performance and scalability outweighed the conventional requirement for the rigid data consistency that existing RDBMS provided to transactional applications.  Subsequently, NoSQL databases for web applications have tended to focus on very specific characteristics of data management. The ability to process very large volumes of data and quickly distribute that data across computing processors and clusters has been very desirable in large-scale web application design. There has also been a greater need for flexible data schema, or no schema at all, in order to better implement rapid changes to applications.

An advantage of NoSQL databases over traditional RBMS is that they store and manage data in ways that allow for high operational speed and great flexibility on the part of system developers. In addition, data can be stored in a schema-less or free-form fashion. Any data can be stored in any record. And unlike traditional RDBMS, many NoSQL databases can be scaled horizontally across hundreds or thousands of commodity servers. And NoSQL databases typically utilize lower amounts system memory than RDBMS. This allows for NoSQL databases to achieve much higher performance than traditional RDBMS.

 

NoSQL Databases Typically Contain the Following Types of Data:

•  Semi-structured Data:  CSV, Word, Excel, PowerPoint, Documents, PDFs, Logs, XML, JSON
•  Unstructured Data:  Emails, Text, Messages, Blog Entries, Twitter
•  Binary Data:  Graphics, Images, Audio, Video

 

NoSQL Database Types

•  Key-Value Stores: A simple data storage system that pairs a unique key with an associated value.  Typical uses include: dictionaries, image stores, document/file stores, query cache, lookup tables.
•  Document Stores:  Data stores that pair each key with a complex data structure known as a document.  Documents are typically semi-structured either in XML or JSON formats.  Typical uses include: MS Word documents, MS Excel documents, spreadsheets, presentations, PDF files, sales orders, invoices, product descriptions, web pages, forms.
•  Graph Stores:  Data stores that organize data as nodes, which are like records in a relational database, and edges, which represent connections between nodes.  Typical uses include: social networks, fraud detection, pattern matching, relationship-heavy data.
•  Wide Column / Column Family Stores:  Data stores that have the ability to hold very large numbers of dynamic columns. But unlike a relational database, the names and format of the columns can vary from row to row in the same table.  Typical uses include: web crawling, large sparsely populated tables, highly-adaptive systems, high-variance systems.
•  Native XML Databases:  Data stores that allow data to be stored in the extensible markup language (XML) format, a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML databases are a sub-category of document stores.
•  Search Engines:  Information retrieval systems designed to help find information stored on a computer system.
•  Multi-Modal Databases:  Data stores that contain aspects of multiple types of NoSQL database all within one product.

 

NoSQL Database Products (2018)

 

Overview of Data Lake Concept

Overview of Data Lake Concept

Data Lakes are storage repositories that hold a vast amount of raw data stored in the data’s native format.  Types of data included in a data lake include structured data, relational databases, semi-structured data, unstructured data, documents, binary files, & raw data. The concept of data lakes has been well received by organizations that need to capture and store raw data of many different types at scale and low cost to perform data processing.  The implementation characteristics of a data lake, namely inexpensive storage and schema flexibility, make it ideal for big data analysis and data science. The data lake has become a viable solution because it provides a cost-effective and technologically feasible way to store and analyze big data.

Data Lake Concept

Data Lakes include the capabilities to store data of any size, shape, and speed, and enable all types of processing and analytics across platforms and languages. Further, data lakes remove the complexities of ingesting and storing data while making it faster to get up and running with batch, streaming, and interactive analytics.  In the past, data lakes have typically been built using Hadoop, and enterprise Hadoop distributions such as Hortonworks and MapR which offer data lake architectures. Now organizations can also build data lakes by using infrastructure-as-a-service (IaaS) clouds including Amazon Web Services and Microsoft Azure. Amazon’s Elastic Compute Cloud (EC2) supports data lakes while Microsoft has a dedicated Azure Data Lake platform to store and analyze real-time data.

Data Lakes Can Include:
•  Structured data from relational databases (rows and columns)
•  Semi-structured data (CSV, Logs, XML, JSON)
•  Unstructured data (Emails, Word, Excel, Powerpoint, Documents, PDFs)
•  Binary data (graphics, images, audio, video)

Data Lakes are Characterized by Four Key Attributes:
1. Storage of All Data: Data lakes contain all types of data including structured, semi-structured, unstructured, and binary data formats.
2. Flexibility of Analysis:  Data lakes enable users across multiple business units to explore and analyze data on their own terms.
3. Multiple Access Techniques:  Data lakes enable multiple data access patterns including batch, interactive, online, search, in-memory, and other analysis engines.
4. Shared Infrastructure: Data lakes provide a single repository of data that is available to all data consumers within an organization.

What is Meant by Big Data?

The term Big Data describes a massive volume of structured, semi-structured, and unstructured data that can be collected within an organization that is so large that it is difficult to process using common database management tools or traditional data processing applications.   When dealing with extremely large datasets, organizations face difficulties in being able to create, manipulate, manage, transfer, and query the data.  In addition, big data is difficult to work with using most relational database management systems, business intelligence and analytics applications, and desktop statistics and visualization packages.  These types of applications and systems can typically handle large datasets but not the massively large datasets included in big data.  Instead big data could require massively parallel software running on tens, hundreds, or even thousands of concurrent servers.

Read more

Business Intelligence Capability – Data Visualization

Overview of Data Visualization Solutions

Data Visualization Solutions enable analysis by providing highly-graphical representations of information and data. By using visual elements including charts, graphs, and maps, data visualization is an accessible way to see and understand trends, outliers, and patterns in data. Further, data visualization refers to the techniques used to communicate data or information by encoding it as visual objects contained within graphics.

Data visualization tools go beyond the display of basic charts and graphs. Moreover, they display data in more sophisticated ways such as geographic maps, infographics, dials, gauges, sparklines, heat maps, networks, graphs, and charts. The images may include interactive capabilities, enabling users to manipulate them or drill into the data for querying and analysis.  Additionally, numerical data may be encoded within graphics using dots, lines, circles, bubbles, or bars, to visually communicate a quantitative message. Indicators designed to alert users when data has been updated or predefined conditions occur can also be included.  Because of the way the human brain processes information, using graphics to visualize large amounts of data is for more effective for rapid analysis then just looking at data within queries and reports.  Effective data visualization assists users to analyze and reason about data and to draw conclusions. These types of analytical tools make complex data sets more intuitive, understandable, and usable for decision-making.

 

Characteristics of Data Visualization Solutions

•  Highly graphical presentation of data.
•  Intuitive user interface with visual presentation of data.
•  Rapid comprehension of data.
•  Comparison of different pieces of data.
•  Ease of interpretation of data.
•  Quantitative evaluation of data.
•  Presents numerous numerical metrics in a relatively small space.
•  Numerical data encoded within graphics using dots, lines, circles, bubbles, or bars.
•  Reveals data at several levels of detail and perspectives.
•  Indication of data relationships, patterns, and outliers.
•  Makes large data sets coherent.
•  Enables analysis of trends.
•  Identification of data indicators.
•  Conversion of data into information.

Business Intelligence Capability – Data Exploration

Overview of Data Exploration Solutions

Data Exploration Solutions enable robust visual searches of vast amounts of data and are commonly used by data analysts.

Data Exploration Solutions include sophisticated data discovery capabilities that enable users to rapidly and intuitively search through large amounts of data in order to gain insights.  Subsequently data exploration solutions enable users to quickly retrieve answers about organizational data and enable users to quickly generate information about the data.  In addition data exploration solutions provide informative search capabilities in order to enable analysis of the available data and enable conversion of the data into information.  While data exploration solutions do enable queries of data, the data is exposed to users using business terminology and underlying database structures and data models are hidden to the user.

SAP BusinessObjects Explorer

Characteristics of Data Exploration Solutions

•  Rapid searches of large data sets
•  Sophisticated data discovery capabilities
•  Intuitive user interface with keyword searches
•  Filtering of data from multiple dimensions or perspectives
•  Drill-down, drill-up capabilities
•  Visualization of data in multiple formats
•  Integration with central databases and data warehouses
•  Guided discovery of data with contextual navigation
•  Abstraction of underlying database and data models
•  Conversion of data into information

November 2015 Meeting of Washington DC Business Objects User Group

WDCBOUG-User Group-Square-317x160

The November 2015 meeting of the Washington DC Business Objects User Group (WDCBOUG) is scheduled for November 16, 2015 at Pepco Holdings in Washington DC.

This meeting has no fee but you must register to attend. Register Now

Logistics:

Date: Monday, November 16, 2015
Time:  8:30 am – 2:00 pm
Location:
Pepco Holdings, Inc.
701 9th Street. NW
Washington, DC 20068

Read more