Review Article

Big Data in Cloud Computing: A Resource Management Perspective

Table 1

Comparison of big data frameworks.

Attribute Framework
HadoopSparkStormSamzaFlink

Current stable version2.8.12.2.01.1.10.13.01.3.2

Batch processingYesYesYesNoYes

Computational modelMapReduceStreaming (microbatches)Streaming (microbatches)StreamingSupports continuous flow streaming, microbatch, and batch

Data flowChain of stagesDirected acyclic graphDirected acyclic graphs (DAGs) with spouts and boltsStreams (acyclic graph)Controlled cyclic dependency graph through machine learning

Resource managementYARNYARN/MesosHDFS (YARN)/MesosYARN/MesosZookeeper/YARN/Mesos

Language supportAll major languagesJava, Scala, Python, and RAny programming languageJVM languagesJava, Scala, Python, and R

Job management/optimizationMapReduce approachCatalyst extensionStorm-YARN/3rd-party tools like GangliaInternal JobRunnerInternal optimizer

Interactive modeNone (3rd-party tools like Impala can be integrated)Interactive shellNoneLimited API of Kafka streamsScala shell

Machine learning librariesApache Mahout/H2OSpark ML and MLlibTrident-ML/Apache SAMOAApache SAMOAFlink-ML

Maximum reported nodes (scalability)Yahoo Hadoop cluster with 42,000 nodes8000300LinkedIn with around a hundred node clustersAlibaba customized Flink cluster with thousands of nodes