Different Modes of Submitting Spark Job on Yarn

If you are using spark on Yarn, then you must have observed that there are different ways a job can be run on yarn cluster. In this post we will try to explore this.
Spark Job to yarn can be submitted in two modes.

1. yarn-client
2. yarn cluster

Yarn-Client : When we submit spark job to Yarn and mode is set to yarn-client, then spark driver runs on client machine. Let me elaborate on that, as we know spark driver is kind of controller of the job. When a spark job is submitted in client mode , driver runs on local machine and while spark job runs we can see logs of spark job on client machine. This will not allow you to run anything else on client untill job copletes. you can do this as following

spark-submit -mode yarn-client com.mycomp.Example ./test-code.jar

here -mode yarn-client is the important parameter that needs to be added. You can also use nohop and diver all logs to a log file and still run this spark job in  client mode. 

Yarn-Cluster : Yarn-cluster mode is another way of submitting spark jobs to Yarn. In this case, the driver  of the job is started on one of the nodes on spark cluster. This mode is used when you want to submit a spark job and without waiting for completion of the job , you want to start doing some other activity. This can be done as following

spark-submit -mode yarn-cluster com.mycomp.Example ./test-code.jar

here -mode yarn-cluster denotes running job in cluster mode on yarn.

When your jobs are handling huge data and driver needs more memory, then I prefer to use yarn-cluster mode. In this way resources on my local machine are not chocked and i can use one of the nodes on cluster to assign more memory to driver. Moreover whenever you are processing huge amount of data and your job takes huge time to complete, prefer cluster mode.  

Comments

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation