David Schlinker, Senior Data Warehouse & Analytics Architect
October 28, 2018
The Big Data and Advanced Analytics market is growing rapidly with a forecasted 11.4% annual growth rate from 2018-2027, according to Wikibon. This growth rate with the current adoption rate is of approximately 50% is a good indication of an upward trend. So, the question is, “What are the other half doing?” I’d say they’re probably still figuring it out, and I’m sure they’ll jump on the bandwagon soon enough. As Big Data grabs the headlines, a middle ground between Traditional Data Warehousing and Big Data is apparent. This “sweet spot” offers advanced analytics capabilities and support for very large data sets without the need to invest in a full-scale Big Data architecture. Leaders in this area include SAP Hana, Tableau, Qlik, Birst, and Microsoft. In this article, we’ll use Microsoft SSAS Tabular as a baseline to explain the in-memory approach and contrast it with Big Data. As you will discover, the lines are not always clear and require a little help from the experts.
Defining “Big Data”
The term “Big Data” has various meanings in the industry. One common framework to help us understand Big Data is the three “Vs” of Big Data – Volume, Variety and Velocity. Volume refers to the large amount of data that is generated by many organizations. Variety refers to the many types of data including structured, semi-structured, and unstructured. Velocity refers to the speed at which data is generated and consumed. It is often referenced as batch, near-real-time and real-time. There are numerous additions to the three “Vs” going as high as ten “Vs”! One important addition is Veracity. Veracity refers to the quality or accuracy of the data. It is somewhat ironic to add this “V” because Big Data inherently does nothing to address or improve veracity. In fact, the traditional data warehouse, although weaker at handling the original three “Vs”, is better at handling veracity because of its slower and more methodical processing of the data. When engaging in Big Data, veracity needs to be understood as a limitation and potential risk in analytics rather than a benefit.
Data Warehouse & Big Data Architectures
SSAS Tabular is a modern in-memory, column-based data store that allows organizations to store and analyze large datasets. Remember, data stores are real-time (or near real-time) while data warehouses are typically batch loaded when we’re all asleep. SSAS Tabular utilizes constructs from both the traditional data warehouse and modern Big Data. However, it does not rely on an underlying Hadoop/HDFS infrastructure. Like other in-memory database products, SSAS Tabular uses a proprietary database structure. This is often seen as a disadvantage over open-source products such as Hadoop because proprietary software tends to be more restrictive. However, very few organizations that have implemented Big Data use only the open-source software. Most often these organizations purchase proprietary software that is layered on top of Hadoop. Vendors that provide these ‘layers’ include Cloudera, IBM, MapR, Microsoft, and Oracle. These additional ‘layers’ add significant value to a Big Data environment but negate the low-cost benefit of open-source. They require a significant investment of time and money. No one said that Big Data is cheap.
The primary limitation of SSAS Tabular and other in-memory products is that performance does not scale linearly like Hadoop HDFS. Also, from a pure storage cost per byte, in-memory is more expensive than HDFS disk-based storage. Putting costs aside, in-memory databases have sophisticated compression algorithms that allow the database to store multiple terabytes of data. The SSAS Tabular compression algorithm is part of the xVelocity engine that is at the core of the product. Storage of 5TB or more is not uncommon. SAP Hana can reach 25TB or more. HDFS, on the other hand, can support petabytes of data using its divide-and-conquer approach on standard disk-based storage. As you can see, there are trade-offs to consider between in-memory and disk-based storage.
In reality, talking about in-memory and Big Data is not an “either-or” discussion. A Data Lake architecture can include both HDFS and in-memory data storage (and of course traditional relational database technologies). Everyone these days wants a Data Lake, which we will save for another article. For now, think of a Data Lake as a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics, and machine learning. Thus, Data Lakes can include structured (databases), semi-structured (CSV, logs, XML, JSON), unstructured (emails, documents, PDFs), and binary (images, audio, video) data. This discussion is further complicated by the fact that Big Data components such as Spark and HBase use in-memory technology to manage processing within the Hadoop environment. At the end of the day, Big Data is about the technologies and data management software that can deal with vast amounts of data.
Organizational data and processing needs are of course anything but homogenous. If query processing speed is critical, then an in-memory architecture is likely the best choice. If large data volumes (100TB +) and batch processing are the primary concern, then an HDFS approach is the best and probably only choice.
Traditional data warehouse architectures use a combination of relational databases and multi-dimensional databases. In the Microsoft stack, for example, this would include SQL Server and SSAS multi-dimensional data structures. The traditional data warehouse architecture works well for structured data. In addition to structured data, modern data warehouse architectures also support semi-structured and unstructured data (variety in the three “Vs”). Support for these structures is required by Data Scientists and other sophisticated users of data who perform advanced analytics. Along with large volumes, this support is one of the core values of Big Data. However, in-memory products such as SSAS Tabular are also fully capable of handling these structures and, with limitations, the volumes. For example, SSAS Tabular in Azure supports the blob data format for unstructured data. And for those of you who don’t know what a blob is, it’s the acronym for Binary Large OBject, which is a collection of binary data stored as a single entity in a database management system.
The traditional data warehouse often incorporates the star schema (multi-dimensional) data model. Many traditional data warehouse cube tools require strict adherence to this model. Big Data, on the other hand, has no prerequisites for how the data is structured. This is one of the reasons why Big Data is associated with advanced analytics where data scientists cannot be encumbered by specific data models. The in-memory products such as SSAS Tabular and others provide a “sweet spot” between these two worlds. They work well under a star schema model but are in no way restricted to it. They are as capable as the Big Data products at handling many data structures.
Mr. David Schlinker is Senior Data Warehouse & Analytics Architect affiliated with IT Architects in Calgary, Alberta, and has worked in various industries, including Energy, Mining, Utilities, Telecommunications, Transportation, Manufacturing, and Retail. IT Architects (www.itarchitects.ca) is an information consulting firm specializing in business process optimization, system evolution planning, and the deployment of leading-edge technologies. If you require further information, David can be reached at firstname.lastname@example.org or 403-874-9927.