호빗의 인간세상 탐험기

Spark 본문

Hadoop

Spark

딜레이라마 2017. 2. 23. 19:05
반응형

Apache Spark Overview

Apache Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIsforJava, Python, and Scala and consists of Spark core and severalrelated projects: 

• Spark SQL - Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs. 

• Spark Streaming - API that allows you to build scalable fault-tolerant streaming applications. 

• MLlib - API that implements common machine learning algorithms.

• GraphX - API for graphs and graph-parallel computation.

You can run Spark applicationslocally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad-hoc analysis. To run applications distributed across a cluster, Spark requires a cluster manager. Cloudera supports two cluster managers: YARN and Spark Standalone. When run on YARN, Spark application processes are managed by the YARN ResourceManager and NodeManager roles. When run on Spark Standalone, Spark application processes are managed by Spark Master and Worker roles. 

Spark Application Overview

Spark Application Model

 Apache Spark is widely considered to be the successor to MapReduce for general purpose data processing on Apache Hadoop clusters. Like MapReduce applications, each Spark application is a self-contained computation that runs user-supplied code to compute a result. As with MapReduce jobs, Spark applications can use the resources of multiple hosts. However, Spark has many advantages over MapReduce. In MapReduce, the highest-level unit of computation is a job. A job loads data, applies a map function, shuffles it, applies a reduce function, and writes data back outto persistentstorage. In Spark,the highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. A Spark job can consist of more than just a single map and reduce. MapReduce starts a process for each task. In contrast, a Spark application can have processes running on its behalf even when it's not running a job. Furthermore, multiple tasks can run within the same executor. Both combine to enable extremely fast task startup time as well as in-memory data storage, resulting in orders of magnitude faster performance over MapReduce.

Spark Execution Model

 Spark application execution involves runtime concepts such as driver, executor, task, job, and stage. Understanding these concepts is vital for writing fast and resource efficient Spark programs. At runtime, a Spark application maps to a single driver process and a set of executor processes distributed across the hosts in a cluster. The driver process managesthe job flow and schedulestasks and is available the entire time the application isrunning. Typically, this driver process is the same as the client process used to initiate the job, although when run on YARN, the driver can run in the cluster. In interactive mode, the shell itself is the driver process. The executors are responsible for executing work, in the form of tasks, as well as for storing any data that you cache. Executor lifetime depends on whether dynamic allocation is enabled. An executor has a number of slots for running tasks, and will run many concurrently throughout its lifetime.

Invoking an action inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assemblesthe datasettransformations into stages. A stage is a collection of tasks that run the same code, each on a different subset of the data.

Developing Spark Applications

When you are ready to move beyond running core Spark applications in an interactive shell, you need best practices for building, packaging, and configuring applications and using the more advanced APIs. This section describes how to develop, package, and run Spark applications, aspects of using Spark APIs beyond core Spark, how to access data stored in Amazon S3, and best practices in building and configuring Spark applications. 


reference: Spark Guide for Cloudera


반응형

'Hadoop' 카테고리의 다른 글

solr  (0) 2017.02.16
HDFS2  (0) 2017.02.12
HDFS  (0) 2017.02.12
Nonrelational Database Systems, NotOnly SQL or NoSQL?  (0) 2017.02.09
HBASE  (0) 2017.02.09
Comments