The Evolution of Data and Modeling

Garry Arko, Senior Enterprise & Solution Architect

November 12, 2018

Over 50 years ago, there was a need to codify data so that computers could understand and use it. The structuring and modeling of data became a necessity to gain value from computers, and the art and science of data modeling took shape. It didn’t take long for businesses to understand that structure was not the only aspect of data that was highly important in the computer world. The quality of the data, its source, how it was maintained, how it was replicated, etc. were also critical to obtain valued results from the computational processes relying on data. To address these other needs that went beyond the basic structure of data, the practice of Data Architecture developed and evolved. In the beginning, data modeling was all that mattered. Today, the various aspects of Data Architecture are highly relevant and key, and the need for structuring data continues to remain relevant. Looking to the future, data modeling itself will become less significant and may even eventually disappear as indicated later in this article.

The Birth of Information Architecture

As technologies evolve and the use of data expands, the way in which data is used to render information is becoming more relevant. It should be noted that in many places Data Architecture is taking on a new focus and is being termed “Information Architecture”. This is appropriate as it is not about the data per se, but about the way in which data can be used to bring value to a business. In the past, data was collected for a specific application to produce specific and well understood results. Things like Big Data and Crowdsourced Data did not exist. Today, Big Data makes it possible to produce a wide range of results using cube analytics and other data analytic advancements – some of the results may not have been anticipated when decisions were made to collect and manage the data in the first instance. And Crowdsourced Data still needs to be managed for effective use, but this data is not likely to be managed in the same way business-generated financial data would be managed. Just as Data Architecture brought a new perspective on data beyond Data Modeling, Information Architecture will expand the nature and scope of treating data as a key and critical business asset. Although we reference Data Architecture as the applied skill set that allows business value to be derived from data, Data Architecture may be replaced by Information Architecture in the near future. When this day comes, it is important to understand that some data will still need to be structured and all data will need to be managed. As this takes place, Information Architecture will be an extension and expansion of the previous skills, not a wholesale replacement of them.

The Proliferation of Unstructured Data

The use of data by interactive systems has also evolved significantly over the years. In the early years, small sets of data were used by individual applications. Later, more data needed to be organized into databases to relate varied components of the data and to feed multiple applications. Now, vast amounts of data can be ingested by AI (Artificial intelligence) systems and the relevance and relationships between the various datasets can, in many cases, be derived by the AI models themselves. While traditionally all of this data has been “structured data”, today, search engines, such as Google, and intelligent assistants, such as Siri, have begun to take unstructured data and spoken language as data inputs to generate valued results. While these more advanced systems may not always produce relevant and meaningful results, or accurate and useable recommendations, such systems are in their infancy and can be expected to evolve and improve in the coming years or sooner.

Unstructured data in the business world is not just about documents on Google. Take asset management for example. For certain types of assets and components of assets, there may be operating manuals, manufacturer specifications, test records, inspection reports, etc. If the information about the assets themselves is stored and managed in a structured database, what is to be done with all of the relevant unstructured information? Some asset management systems allow URL links to be made between the assets and the unstructured documents. This may not be the best solution as links eventually become broken and, with potentially tens of thousands of documents and tens of thousands of assets and asset components, keeping the links valid can turn into a nightmare. Other asset systems allow the unstructured documents to be imported in to the asset database and stored and managed there. This may also present challenges as a solution. Depending on how the documents are used by the business and what applications are accessible to all business users, the unstructured documents may have to be duplicated and managed with some effort for consistency. Another, and perhaps better, solution might be to store and manage the unstructured documents in a document repository and add some metadata to each document that can associate it with specific or with certain groups or classes of assets. Then some simple queries and displays can be developed to provide results where metadata in the unstructured documents can be matched against attributes in the asset database. Unstructured data can offer many challenges, but there can also be numerous ways to meet those challenges with solutions that best fit a given business situation.

Failure to Recognize Data Architecture

So what about Data Architecture today and its data modeling component? Is it useful and does it provide value added results? Think of something as simple as managing assets using automated tools. In the vast majority of organizations today, an asset has value to the business and requires some management for effective and efficient use. Recognizing that tens of thousands of oranizations have automated asset management systems, consider those that own almost all of their assets, but occasionally, rent or lease assets. Organizations that have overlooked Data Architecture may not have included all relevant details in their asset management systems. Rented and leased equipment got included in the same asset data set as owned assets, but rental and lease details were not appropriately captured. Eventually, discrepancies in system outputs began to highlight the fact that something was wrong, but money and effort were required to refit data and systems to get things back on track. In hindsight, it is probably safe to say that those organizations would have invested in structuring and defining two classes of assets up-front rather than going through great efforts to clean up the situation afterwards. It can be argued that such obvious errors just wouldn’t be made by well-managed organizations, but experience has demonstrated that “obvious” is not always obvious and that what is truly obvious becomes much more apparent in hindsight rather than in foresight.

Take assets again as an example and ask ten people in ten different organizations (or ten people in the same organization) to define what an asset is and list the relevant pieces of data necessary to properly manage the assets. Experience has shown that the probability of getting ten consistent answers is close to zero. For certain types of data today, having clear and sound definitions of the elements to the used by applications and systems (spreadsheets included) is key, and if the data is to be stored and managed in a database, a well-designed data structure is essential. In the world of computing, the phrase “Garbage In = Garbage Out” has become a workplace cliché. Data Architecture and Data Modeling are IT disciplines for ensuring that structured data is kept from the garbage pile. Many organizations today rely on their financial systems for projections used in making management decisions. These systems are almost exclusively based on well-managed, structured data. Today, it would be inconceivable for an executive to ask an “intelligent assistant”, using more unstructured and unmanaged data, what the next financial move for the business should be.

Data Quality is Key

Quality data has been described as having the following key characteristics: 1) Correct and accurate, 2) Complete, and 3) Timely. It is easy to comprehend the intent of these three simple qualities. But when looking at the entire application and system portfolio, similar data can come from a number of sources. Which source is correct or contains the most correct and accurate data? Data may be exchanged between applications or extracted for use in spreadsheets. What ensures the extracted data is complete and (i.e. the definitions and rules applied to the source data remain consistent in the extracted data)? We may believe that the data is timely because it was captured and made available today. However, how can we be sure that the data was not originally sourced and stored at a different time and place? It is the intent of Data Architecture to provide a discipline to analyze, model, and operationalize data effectively and efficiently, and ensure that data quality is maintained throughout the information lifecycle (which is another topic for another time known as Information Lifecycle Management). Since most organizations today rely on data for their success, they must ensure their data is a top priority for being well managed. This includes both structured and unstructured data.

Reference Data Models

According to OASIS (Organization for the Advancement of Structured Information Standards) a reference data model is “an abstract framework for understanding significant relationships among the entities of some environment, and for the development of consistent standards or specifications supporting that environment”. We sometimes call them industry reference data models because they may pertain to a specific industry. You can find reference data models for all kinds of industries, including aerospace & defense, airline, automotive, banking, broadcasting, government, consumer products & electronics, energy & petroleum, education, health, insurance, life sciences, retail, telecommunications, utilities, and more. Reference data models help organizations in the same industry understand standard data and data relationships, including the data inputs/outputs supporting standard end-to-end business processes across functional areas of an organization. In regards to data models representing specialized COTS business applications such as pipelines, wells, railways, etc., these have been developed and published as public domain information because they represent commercialized business processes (from a data perspective) that would be common to any organization engaged in that specific type of business. Such models should be leveraged where ever they exist and are applicable. It is sound business practice not to spend money on something that has already been done and is reasonably accessible for use.

ERP Data Models

Vendors for ERP (Enterprise Resource Planning) systems have taken functions like Finance, Human Resources, Materials Management, and many other common business components and encapsulated them into systems where it is no longer necessary to understand the ERP data models, their table definitions, and the data relationships that exist behind the scenes of an ERP

If we consider data common to most businesses such as Employees, Business Partners, General Ledger, Inventory, Assets, Accounts Receivable, Accounts Payable, etc., it is hard to rationalize the value of all the organizations modeling the same data over and over again – hence the value of ERPs and specific subject data models. It is likely that less than half of the data relevant to a business needs to be defined and modeled from scratch, as would have been the case a mere 50 years ago. More than ever before, applications need and rely on data to function effectively, but more and more a great deal of that data is becoming self describing and machine interpretable. As this trend accelerates exponentially, the need for data models will slowly disappear, except in special cases to describe and structure a small and select sets of data that are very different and unique. Furthermore, the need to apply and manage a Data Architecture will decrease as the components of quality data become inherent in the data sources themselves.

Where to go from here?

While it may sound like we are beginning to understand everything there is to understand about data, keep in mind the axiom, “The more you know, the more you know there is to know.” We are only beginning to understand data as a valuable corporate resource. Looking at things like Crowdsourced Data and Big Data, which didn’t even exist as a concept a few years ago, it would be premature to say we have met all of the Data challenges. Today, there are three philosophical truths that an organization should keep in mind:

1. There is a very long road ahead before all data becomes self describing so don’t abandon data modeling for the foreseeable future. Understand its value and apply it where it is needed and where it will add value.
2. While the three characteristics of quality data are simple (i.e. Correct/Accurate, Complete, and Timely), making data conform to those characteristics is not automatic so don’t abandon Data Architecture for the foreseeable future. Understand where it is needed and apply it appropriately to ensure garbage does not pollute the data being used to run the business.
3. If Information Architecture is not part of an organization today, it will be coming in the near future. Organizations want to leverage their data assets to the greatest extend possible in answering business questions and supporting a variety of business processes. Information Architecture goes that next step beyond Data Architecture to help the best uses of well managed data across the organization.

Mr. Garry Arko is a Senior Enterprise & Solution Architect with IT Architects in Calgary, Alberta, and has worked in various industries, including Oil & Gas, Transportation, Government, and Construction. IT Architects (www.itarchitects.ca) is an information consulting firm specializing in business process optimization, system evolution planning, and the deployment of leading-edge technologies. If you require further information, Garry can be reached at info@itarchitects.ca or 403-465-7661.