Scientific Programming

Review Article

Big Data in Cloud Computing: A Resource Management Perspective

Comparison of big data frameworks.


Attribute	Framework
Attribute	Hadoop	Spark	Storm	Samza	Flink

Current stable version	2.8.1	2.2.0	1.1.1	0.13.0	1.3.2

Batch processing	Yes	Yes	Yes	No	Yes

Computational model	MapReduce	Streaming (microbatches)	Streaming (microbatches)	Streaming	Supports continuous flow streaming, microbatch, and batch

Data flow	Chain of stages	Directed acyclic graph	Directed acyclic graphs (DAGs) with spouts and bolts	Streams (acyclic graph)	Controlled cyclic dependency graph through machine learning

Resource management	YARN	YARN/Mesos	HDFS (YARN)/Mesos	YARN/Mesos	Zookeeper/YARN/Mesos

Language support	All major languages	Java, Scala, Python, and R	Any programming language	JVM languages	Java, Scala, Python, and R

Job management/optimization	MapReduce approach	Catalyst extension	Storm-YARN/3rd-party tools like Ganglia	Internal JobRunner	Internal optimizer

Interactive mode	None (3rd-party tools like Impala can be integrated)	Interactive shell	None	Limited API of Kafka streams	Scala shell

Machine learning libraries	Apache Mahout/H2O	Spark ML and MLlib	Trident-ML/Apache SAMOA	Apache SAMOA	Flink-ML

Maximum reported nodes (scalability)	Yahoo Hadoop cluster with 42,000 nodes	8000	300	LinkedIn with around a hundred node clusters	Alibaba customized Flink cluster with thousands of nodes