Review Article
Big Data in Cloud Computing: A Resource Management Perspective
Table 1
Comparison of big data frameworks.
| Attribute | Framework | Hadoop | Spark | Storm | Samza | Flink |
| Current stable version | 2.8.1 | 2.2.0 | 1.1.1 | 0.13.0 | 1.3.2 |
| Batch processing | Yes | Yes | Yes | No | Yes |
| Computational model | MapReduce | Streaming (microbatches) | Streaming (microbatches) | Streaming | Supports continuous flow streaming, microbatch, and batch |
| Data flow | Chain of stages | Directed acyclic graph | Directed acyclic graphs (DAGs) with spouts and bolts | Streams (acyclic graph) | Controlled cyclic dependency graph through machine learning |
| Resource management | YARN | YARN/Mesos | HDFS (YARN)/Mesos | YARN/Mesos | Zookeeper/YARN/Mesos |
| Language support | All major languages | Java, Scala, Python, and R | Any programming language | JVM languages | Java, Scala, Python, and R |
| Job management/optimization | MapReduce approach | Catalyst extension | Storm-YARN/3rd-party tools like Ganglia | Internal JobRunner | Internal optimizer |
| Interactive mode | None (3rd-party tools like Impala can be integrated) | Interactive shell | None | Limited API of Kafka streams | Scala shell |
| Machine learning libraries | Apache Mahout/H2O | Spark ML and MLlib | Trident-ML/Apache SAMOA | Apache SAMOA | Flink-ML |
| Maximum reported nodes (scalability) | Yahoo Hadoop cluster with 42,000 nodes | 8000 | 300 | LinkedIn with around a hundred node clusters | Alibaba customized Flink cluster with thousands of nodes |
|
|