Posts

Showing posts from June, 2016

Different Modes of Submitting Spark Job on Yarn

If you are using spark on Yarn, then you must have observed that there are different ways a job can be run on yarn cluster. In this post we will try to explore this. Spark Job to yarn can be submitted in two modes. 1. yarn-client 2. yarn cluster Yarn-Client : When we submit spark job to Yarn and mode is set to yarn-client, then spark driver runs on client machine. Let me elaborate on that, as we know spark driver is kind of controller of the job. When a spark job is submitted in client mode , driver runs on local machine and while spark job runs we can see logs of spark job on client machine. This will not allow you to run anything else on client untill job copletes. you can do this as following spark-submit -mode yarn-client com.mycomp.Example ./test-code.jar here -mode yarn-client is the important parameter that needs to be added. You can also use nohop and diver all logs to a log file and still run this spark job in  client mode.  Yarn-Cluster : Yarn-cluster mode is

Custom UDF in Apache Spark

Image
Apache Spark has become very widely used framework to build Big data application. Spark SQL has made adhoc analysis on structured data very easy, So it is very popular among users who deal with huge amount of structured data. However many times you will realise that some functionality is not available in Spark SQL. For example in Spark 1.3.1 , when you are using sqlContext basic functions like variance , standard deviation and percentile are not available(You can access these using HiveContext). In a similar situation, you may like to do some string operation on an input column but functionality may not be available in sqlContext. In such situations , we can use spark udf to add new constructs in sqlContext. I have found them very handy, specially while processing string type columns. In this post we will discuss how to create UDFs in Spark and  how to use them. Use Case : Let us keep this example very simple. We will create a udf which will take a string as input and it will conve