100 open source Big Data architecture papers.

Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. While on one hand it has been disruptive, on the other it has led to a complex ecosystem where new frameworks, libraries and tools are being released pretty much every day, creating confusion as technologists struggle and grapple with the deluge.

If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case. In my journey over the last few years, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.

One caution, most of the reference literature included is hugely skewed towards providing a deep architecture overview (in most cases it refers to the original research papers).

Jumping right in…

Key architecture layers

  • File Systems- Distributed file systems which provide storage, fault tolerance, scalability, reliability, and availability.
  • Data Stores– Evolution of application databases into Polyglot storage with application specific databases instead of one size fits all. Common ones are Key-Value, Document, Column and Graph.
  • Resource Managers– provide resource management capabilities and support schedulers for high utilization and throughput.
  • Coordination– systems that manage state, distributed coordination, consensus and lock management.
  • Computational Frameworks– a lot of work is happening at this layer with highly specialized compute frameworks for Streaming, Interactive, Real Time, Batch and Iterative Graph (BSP) processing. Powering these are complete computation runtimes like BDAS (Spark) & Flink.
  • DataAnalytics –Analytical (consumption) tools and libraries, which support exploratory, descriptive, predictive, statistical analysis and machine learning.
  • Data Integration– these include not only the orchestration tools for managing pipelines but also metadata management.
  • Operational Frameworks — these provide scalable frameworks for monitoring & benchmarking.

Architecture Evolution

The modern data architecture has evolved with a goal of reduced latency between data producers and consumers. This has lead to real time, low latency processing, bridging the traditional batch and interactive layers into hybrid architectures like Lambda and Kappa.

  • Lambda- Established architecture for a typical data pipeline.
  • Kappa– An alternative architecture which moves the processing upstream to the Stream layer.
  • SummingBird– a reference model on bridging the online and traditional processing models.

Before you deep dive into the actual layers, here are some general documents which can provide you a great background on NoSQL, Data Warehouse Scale Computing and Distributed Systems.

File Systems

As the focus shifted to low latency processing, there has been an evolution from traditional disk based storage file systems to an emergence of in memory file systems — which drastically reduced the I/O & disk serialization cost. Alluxio(Tachyon) and Spark RDD are examples of that evolution.

  • Google File System- The seminal work on Distributed File Systems which shaped the Hadoop File System.
  • Hadoop File System– Historical context/architecture on evolution of HDFS.
  • Ceph File System– An alternative to HDFS which converges block, file and object storage.
  • Alluxio, Inc. (formerly Tachyon Nexus) Tachyon– An in memory storage system to handle the modern day low latency data processing.

File Systems have also seen an evolution on the file formats and compression techniques. The following references gives you a great background on the merits of row and column formats and the shift towards newer nested column oriented formats which are highly efficient for Big Data processing. Erasure codes are using some innovative techniques to reduce the triplication (3 replicas) schemes without compromising data recoverability and availability.

  • Column Oriented vs Row-Stores– good overview of data layout, compression and materialization.
  • RCFile– Hybrid PAX structure which takes the best of both the column and row oriented stores.
  • Parquet– column oriented format first covered in Google’s Dremel’s paper.
  • ORCFile– an improved column oriented format used by Hive.
  • Compression– compression techniques and their comparison on the Hadoop ecosystem.
  • Erasure Codes– background on erasure codes and techniques.

Data Stores

Broadly, the distributed data stores are classified on ACID & BASE stores depending on the continuum of strong to weak consistency respectively. BASE further is classified into KeyValue, Document, Column and Graph — depending on the underlying schema & supported data structure. While there are multitude of systems and offerings in this space, I have covered few of the more prominent ones. I apologize if I have missed a significant one…

BASE

Key Value Stores

  • Dynamo — key-value distributed storage system
  • Cassandra — Inspired by Dynamo; a multi-dimensional key-value/column oriented data store.
  • Voldemort — another one inspired by Dynamo, developed at LinkedIn.

Column Oriented Stores

  • BigTable — seminal paper from Google on distributed column oriented data stores.
  • HBase — while there is no definitive paper , this provides a good overview of the technology.
  • Hypertable — provides a good overview of the architecture.

Document Oriented Stores

  • CouchDB — a popular document oriented data store.
  • MongoDB — a good introduction to MongoDB architecture.

Graph

  • Neo4j — most popular Graph database.
  • Titan — open source Graph database under the Apache license.

ACID

I see a lot of evolution happening in the open source community that is catching up with what Google has done — 3 out of the prominent papers below are from Google , they have solved the globally distributed consistent data store problem.

  • Megastore — a highly available distributed consistent database. Uses Bigtable as its storage subsystem.
  • Spanner — Globally distributed synchronously replicated linearizable database which supports SQL access.
  • MESA– provides consistency, high availability, reliability, fault tolerance and scalability for large data and query volumes.
  • CockroachDB — An open source version of Spanner (led by former engineers) in active development.
  • Snowflake — elastic data warehouse.

Resource Managers

While the first generation of Hadoop ecosystem started with monolithic schedulers like YARN, it evolved towards hierarchical schedulers (Mesos), that can manage distinct workloads, across different kind of compute workloads, to achieve higher utilization and efficiency.

  • YARN — Hadoop compute framework.
  • Mesos — scheduling between multiple diverse cluster computing frameworks.

These are loosely coupled with schedulers whose primary function is schedule jobs based on scheduling policies/configuration.

Schedulers

Coordination

These are systems that are used for coordination and state management across distributed data systems.

  • Paxos — a simple version of the classical paper; used for distributed systems consensus and coordination. NoPaxos– simplifies by Replacing Consensus with Network Ordering.
  • Chubby — Google’s distributed locking service that implements Paxos.
  • Zookeeper — open source version inspired from Chubby though is general coordination service than simply a locking service
  • Raft Consensus Algorithm

Computational Frameworks

The execution runtimes provide an environment for running distinct kinds of compute. The most common runtimes are

Spark — its popularity and adoption is challenging the traditional Hadoop ecosystem.

Flink — very similar to Spark ecosystem; strength over Spark is in iterative processing.

The frameworks broadly can be classified based on the model and latency of processing

Batch

MapReduce — The seminal paper from Google on MapReduce.
MapReduce Survey — A dated, yet a good paper; survey of Map Reduce frameworks.

Iterative (BSP)

  • Pregel — Google’s paper on large scale graph processing
  • Giraph — large-scale distributed Graph processing system modelled around Pregel
  • GraphX — graph computation framework that unifies graph-parallel and data parallel computation.
  • Hama — general BSP computing engine on top of Hadoop
  • Open source graph processing survey of open source systems modelled around Pregel BSP.
  • GraphTauis a new programming model for graph computation on time-evolving graphs which is built on top of Apache Spark.

Streaming

Streaming Data Architecture Overview- O’Reilly report on the state of stream processing.

  • Twitter Heron– exposes the same API interface as storm, however improves upon it to have higher scalability, better debugability, better performance, and easier to manage. Builds capabilities like backpressure, replaces Nimbus with Aurora scheduler.
  • Samza — stream processing framework from LinkedIn.
  • Spark Streaming — introduced the micro batch architecture bridging the traditional batch and interactive processing.
  • Kafka Streaming- stream processing over Kafka.

Interactive

  • Dremel — Google’s paper on how it processes interactive big data workloads, which laid the groundwork for multiple open source SQL systems on Hadoop.
  • Impala — MPI style processing on make Hadoop performant for interactive workloads.
  • Drill — A open source implementation of Dremel.
  • Shark — provides a good introduction to the data analysis capabilities on the Spark ecosystem.
  • Shark — another great paper which goes deeper into SQL access.
  • Dryad — Configuring & executing parallel data pipelines using DAG.
  • Tez — open source implementation of Dryad using YARN.
  • BlinkDB — enabling interactive queries over data samples and presenting results annotated with meaningful error bars
  • Apache Kudu- Fast processing of OLAP workloads that operates on columnar storage which provides high-throughput on both sequential (such as HDFS) and random (such as HBase or Cassandra) workloads simultaneously.

RealTime

  • Druid — a real time OLAP data store. Operationalized time series analytics databases
  • Pinot — LinkedIn OLAP data store very similar to Druid.

Data Analysis

The analysis tools range from declarative languages like SQL to procedural languages like Pig. Libraries on the other hand are supporting out of the box implementations of the most common data mining and machine learning libraries.

Tools

  • Apache Zeppelin– for interactive data analytics and visualization
  • Pig — Provides a good overview of Pig Latin, while Pig provides an introduction of how to build data pipelines using Pig.
  • Hive — provides an introduction of Hive, while Hive another good paper tries to share the motivations behind Hive at Facebook.
  • Phoenix — SQL on Hbase.
  • Join Algorithms for Map Reduce — provides a great introduction to different join algorithms on Hadoop.

Machine Learning

  • MLlib — Machine language library on Spark.
  • SparkR — Distributed R on Spark framework.
  • Tensor Flow — most popular framework for distributed ML.
  • Apache SystemML– open sourced by IBM, it allows a developer to write a single machine learning algorithm and automatically scale it up using Spark or Hadoop.
  • MxNet — A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.

Data Integration

Data integration frameworks provide good mechanisms to ingest and outgest data between Big Data systems. It ranges from orchestration pipelines to metadata framework with support for lifecycle management and governance.

Ingest/Messaging

Sqoop– a tool to move data between Hadoop and Relational data stores.

Kafka — distributed messaging system for data processing

ETL/Workflow

  • Apache Nifi- data distribution and processing system; provides a way to move data from one place to another, making routing decisions and transformations as necessary along the way.
  • Apache Beam — An open source version of Google’s Cloud DataFlowFlumeJava & MillWheel- which unifies the model for batch and streaming data processing (uber-API for big data).
  • Apache Airflow- author, schedule and monitor workflows.
  • Crunch — library for writing, testing, and running MapReduce pipelines.
  • Falcon — data management framework that helps automate movement and processing of Big Data.
  • Oozie — a workflow scheduler system to manage Hadoop jobs.

Metadata

  • Apache Atlas- Data governance platform, designed to exchange metadata, track lineage with other tools and processes within and outside of the Hadoop stack — enables enterprises to effectively and efficiently meet their compliance requirements and allows integration with the whole enterprise data ecosystem.
  • Ground — a system to manage all the information that informs the use of data.

Security

  • Hadoop Security Design– seminal paper which captures key aspects of Hadoop design.
  • Apache Metron- is a cyber security application framework that provides organizations the ability to ingest, process and store diverse security data feeds to detect anomalies.
  • Apache Knox- is the Web/REST API Gateway solution for Hadoop. It provides a single access point to access all of Hadoop resources over REST. It acts as a virtual firewall enforcing authentication and usage policies on inbound requests and blocking everything else.
  • Apache Ranger — is a policy administration tool for Hadoop clusters. It includes a broad set of management functions, including auditing, key management, and fine grained data access policies across HDFS, Hive, YARN, Solr, Kafka and other modules.
  • Apache Sentry- fine-grained authorization to data stored in Apache Hadoop. Enforces a common set of policies across multiple data access path in Hadoop

Serialization

ProtocolBuffers — language neutral serialization format popularized by Google. Avro — modeled around Protocol Buffers for the Hadoop ecosystem.

Operational Frameworks

Finally the operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage workloads.

Monitoring Frameworks

  • OpenTSDB — a time series metrics systems built on top of HBase.
  • Ambari — system for collecting, aggregating and serving Hadoop and system metrics

Benchmarking

  • Background on big data benchmarking with the key challenges associated.
  • NDBench- open-source project from Netflix which is used benchmark data systems like Cassandra, Redis, and Elasticsearch for throughput and latency.
  • YCSB — performance evaluation of NoSQL systems.
  • GridMix– provides benchmark for Hadoop workloads by running a mix of synthetic jobs

Summary

I hope that the papers are useful as you embark or strengthen your journey. I am sure there are few hundred more papers that I might have inadvertently missed and a whole bunch of systems that I might be unfamiliar with — apologies in advance as don’t mean to offend anyone though happy to be educated….

Vice President, Data Platforms Walmart

Vice President, Data Platforms Walmart