The Evolution of Data & Analytics Technologies

Insight categories: Big Data & AnalyticsTechnology

A Google paper on the Google File System, published in October 2003, and Jeffrey Dean and Sanjay Ghemawat’s MapReduce paper in 2004, kicked off the era of big data technologies. Shortly thereafter, Doug Cutting and Mike Cafarella, then working with Yahoo! on a search engine called Apache Nutch (based on Apache Lucene indexing), created the new Hadoop subproject for running a large-scale computation on large clusters of commodity hardware in January 2006. 

Since these early efforts, the big data technology landscape has been enriched with numerous innovations and evolved in leaps and bounds. In part one of this blog series, we’ll look at the evolution of the data and analytics space across their core aspects.

Data Platforms Evolution

OSS based → Packaged Distributions → Cloud-Native Stack → Cloud Agnostic stack → Composable Data Analytics

Soon after Hadoop released an Apache open source project, it spawned several frameworks on top of Hadoop to perform different types of processing. Apache Pig, Apache Hive, Apache HBase, Apache Giraph, and Apache Mahout were a few of the diverse frameworks that allowed different ways to process data stored in a Hadoop cluster. 

In addition, there were parallel stacks that replaced one or more frameworks with Kafka, Cassandra, or Cascading. The initial deployments required teams to build and deploy software on commodity hardware based on open-source Hadoop ecosystem components. 

After Hadoop’s project came the commercial distribution of Cloudera, Hortonworks, MapR, Hadapt, DataStax, and Pivotal Greenplum, which packaged the required software in a user-friendly fashion and provided premium support. Then, Amazon EMR released the first cloud-based Hadoop distribution. 

Now there are cloud-specific Hadoop and Spark-based distributions like Azure Synapse Analytics GCP DataProc that come pre-installed with the required software and computing power. 

From there, cloud-agnostic stacks such as Snowflake and DataBricks evolved to work efficiently across different clouds. These platforms are adding innovative features which cater to key performance and cost metrics. As a result, these technologies are getting quite popular, with many enterprises now moving towards such cloud-agnostic stacks.

Enterprises are increasingly looking at a multi-cloud strategy to avoid lock-ins by a particular cloud and use the best technology for various purposes. The trend for the future is to move towards composable data analytics, where companies will build data platforms using components from two to three different technologies and cloud providers.

Data Architecture Evolution

Data Warehouse → Data Lakes / LakeHouse → Data Mesh / Data Fabric

For decades, data warehouses such as Teradata and Oracle have been used as central repositories for storing integrated data from one or more disparate sources. These data warehouses store current and historical data in one place that can create analytical reports for workers throughout the enterprise.

With the advent of big data frameworks like Hadoop, the concept of a data lake became incredibly popular. Typically, a data lake is a singular data storage and includes raw copies of source system data, sensor data, social data, and transformed data for reporting, visualization, advanced analytics, and machine learning tasks.

Data mesh is an organizational and architectural paradigm for managing big data that began to gain popularity in 2019. It is a process and architectural approach that delegates responsibility for specific data sets to business members who have the subject matter expertise to use the data properly. With the data mesh architecture, data domains become prominent with a base data platform that individual domain teams can use with their own data pipelines. 

In this situation, data resides within the foundational data platform storage layer (data lake or data warehouse) in its original form. Individual teams will choose how to process this data and then serve the datasets they own in a domain-specific manner.

Data fabric is a design concept defined by Gartner that serves as an integrated layer (fabric) of data and connecting processes. A data fabric utilizes continuous analytics over existing, discoverable, and inference metadata assets to support the design, deployment, and utilization of integrated and reusable data across all environments, including hybrid and multi-cloud platforms. 

Data mesh and data fabric are concepts that provide a unified view of the data distributed across the underlying data lakes and data warehouses. They, however, differ in how users access them. While data mesh is about people and processes, a data fabric is an architectural approach to tackle the complexity of the underlying data. Experts believe these concepts can be used simultaneously and will work on top of the existing data lakes and warehouses.

Data Processing Evolution

Batch Processing → Stream / Real-time Processing → Lambda → Kappa / Delta

Initially, the big data solutions were typically long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually, these jobs involved reading source files from scalable storage like the Hadoop Distributed File System (HDFS), processing them, and writing the output to new files in scalable storage. The key requirement for these batch processing engines is the ability to scale out computations in order to handle a large volume of data.

The stream, or real-time processing, deals with data streams captured in real-time and processed with minimal latency to generate real-time (or near-real-time) reports or automated responses. Frameworks like Apache Kafka, Apache Storm, Apache Spark Streaming, Amazon Kinesis, etc., help enable this capability.

The next evolution was the Lambda architecture, a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide views of online data

Jay Kreps then proposed the Kappa architecture as an alternative to the Lambda architecture. It has the same primary goals as the Lambda architecture but with an important distinction: all data flows through a single path, using a stream processing system.

Another alternative is the Delta architecture, which introduces a table storage layer to handle stream and table storage accessed through a single code base. Databricks proposed this architecture, with Delta Lake at the center of the architecture. Delta Lake is an open-source atomicity, consistency, isolation, durability (ACID) table storage layer over cloud object stores.

The Lambda architecture and its variants, the Kappa and Delta architecture, will continue to be valuable architectures in the near future.

This concludes the first part of the blog series. We’ll continue to explore the evolution of the data and analytics space in subsequent blog posts in this series in the coming months.

Resources:

Questioning the Lambda Architecture, Jay Kreps, 2014.

Author

Arun_FullLength_cropped_3730312

Author

Arun Viswanathan

Principal Architect

View all Articles

Top Authors

Chet Kolley

Chet Kolley

SVP & GM, Medical Technology BU

Ravikrishna Yallapragada

Ravikrishna Yallapragada

AVP, Engineering

Christina Gurgu

Christina Gurgu

Director, Client Engagement

Cosmin Stirbu

Cosmin Stirbu

Competency Center Manager, Engineering

All Categories

  • URL copied!