Different Modes of Submitting Spark Job on Yarn
If you are using spark on Yarn, then you must have observed that there are different ways a job can be run on yarn cluster. In this post we will try to explore this.
Spark Job to yarn can be submitted in two modes.
1. yarn-client
2. yarn cluster
Yarn-Client : When we submit spark job to Yarn and mode is set to yarn-client, then spark driver runs on client machine. Let me elaborate on that, as we know spark driver is kind of controller of the job. When a spark job is submitted in client mode , driver runs on local machine and while spark job runs we can see logs of spark job on client machine. This will not allow you to run anything else on client untill job copletes. you can do this as following
spark-submit -mode yarn-client com.mycomp.Example ./test-code.jar
here -mode yarn-client is the important parameter that needs to be added. You can also use nohop and diver all logs to a log file and still run this spark job in client mode.
Yarn-Cluster : Yarn-cluster mode is another way of submitting spark jobs to Yarn. In this case, the driver of the job is started on one of the nodes on spark cluster. This mode is used when you want to submit a spark job and without waiting for completion of the job , you want to start doing some other activity. This can be done as following
spark-submit -mode yarn-cluster com.mycomp.Example ./test-code.jar
here -mode yarn-cluster denotes running job in cluster mode on yarn.
When your jobs are handling huge data and driver needs more memory, then I prefer to use yarn-cluster mode. In this way resources on my local machine are not chocked and i can use one of the nodes on cluster to assign more memory to driver. Moreover whenever you are processing huge amount of data and your job takes huge time to complete, prefer cluster mode.
Spark Job to yarn can be submitted in two modes.
1. yarn-client
2. yarn cluster
Yarn-Client : When we submit spark job to Yarn and mode is set to yarn-client, then spark driver runs on client machine. Let me elaborate on that, as we know spark driver is kind of controller of the job. When a spark job is submitted in client mode , driver runs on local machine and while spark job runs we can see logs of spark job on client machine. This will not allow you to run anything else on client untill job copletes. you can do this as following
spark-submit -mode yarn-client com.mycomp.Example ./test-code.jar
here -mode yarn-client is the important parameter that needs to be added. You can also use nohop and diver all logs to a log file and still run this spark job in client mode.
Yarn-Cluster : Yarn-cluster mode is another way of submitting spark jobs to Yarn. In this case, the driver of the job is started on one of the nodes on spark cluster. This mode is used when you want to submit a spark job and without waiting for completion of the job , you want to start doing some other activity. This can be done as following
spark-submit -mode yarn-cluster com.mycomp.Example ./test-code.jar
here -mode yarn-cluster denotes running job in cluster mode on yarn.
When your jobs are handling huge data and driver needs more memory, then I prefer to use yarn-cluster mode. In this way resources on my local machine are not chocked and i can use one of the nodes on cluster to assign more memory to driver. Moreover whenever you are processing huge amount of data and your job takes huge time to complete, prefer cluster mode.
Comments
Post a Comment