Spark Standalone Cluster Internals
Version: 1.0 Author(s): Sandeep Mahendru Creation Date: Jan 15, 2018 Introduction This article describes the internal workings of a Spark cluster operating in a standalone mode. The primary motivation is to understand the internal architecture and topology of the Spark cluster execution platform. I do have experience with building distributed systems using other clustering frameworks like Coherence and Hazelcast. · The architecture is based on a multicast group, a master and slave nodes. The execution managers in these systems follow a Task execution model operating on an in memory distributed key-value data structure [a Concurrent Hashmap]. · Tasks or Entry Processors are the basic unit of execution. · These managers provide guarantees like fault-tolerance, distribution of data based on partitioning schemes like hash partitioning and restoring the execution from a faulty node. [This is obtained by maintainin