Harjeet's blog

Posts

Showing posts from 2016

Enterprise Kafka and Spark : Kerberos based Integration

September 25, 2016

In previous posts we have seen how to integrate Kafka with Spark streaming. However in a typical enterprise environment the connection between Kafka and spark should be secure. If you are using cloudera or hortonworks distribution of Hadoop/Spark, then most likely it will be Kerberos enabled connection. In this post we will see what are the configurations we have to make to enable kerberized kafka spark integration. Before getting into details of integration, i wanted to make it clear that I assume you are using cloudera/hortonworks/MapR provided kafka and spark kerberos enablled installation. we will see that as a developer what additional things has to be performed by you. For this we have to first select security protocol provided by kafka. if you see kafka documentation, security.protocol property is used for this. we can select following values 1. PLAINTEXT 2. SASL_PLAINTEXT 3. SASL_SSL We will select SASL_PLAINTEXT for kerberos. This property has to be set ...

Advanced Spark : Custom Receiver For Spark Streaming

September 17, 2016

Apache Spark has become very widely used tool in Big data world. It provides us a one stop shop to do lot of activities with data. You can do data warehouse activities , build Machine Learning application, Build Reporting solutions, create Stream processing Applications. Spark Streaming provides API for connecting different frameworks. As the big data world is growing fast, there are always some new tools that we may need to use in our projects and integrate with spark or spark streaming. In this post , we will look see how can we integrate any system with spark streaming using custom receiver API. There can be many reasons one has to do this. I am just listing few below for your understanding. 1. Spark streaming does not provide integration with your data source 2. Spark streaming integration provided is using different version of API that you source is using 3. You want to handle lower level details of how spark connects with source. 4. A situation where you are using o...

Kafka Saprk Integration Issue - Task Not Serializable

August 15, 2016

When you Integrate Kafka with spark streaming. There are some very small things that needs to be taken care. These things are often ignored and we end up wasting lot of time in resolving issues which occur because of this. As part of this post , i will discuss these small small issues with you. 1. Task Not Serializable: If you have tried posting messages from spark streaming to kafka, there is very high probability that you will face this issues. Let us understand the problem first. when does it occur - when you create a object of a class and try to use it in any of transformation logic for dstream or rdd. Example - example code is below. If KafkaSink class is not declared as serializable, and you are using it as below, You will definitely face Task not serializable issue. kafkaStream.foreachRDD{ rdd => rdd.foreachPartition{ part => part.foreach(msg => kafkaSink.value.send("kafka_out_topic",msg.toSt ri...

Spark Streaming and Kafka Integration

August 15, 2016

Spark Streaming and Kafka are two talked about technologies these days. Kafka is a distributed queue. we can post message to kafka queue and also read message from kafka queue. Spark Streaming can keep on listening to a source and process the incoming data. Let us understand how can we integrate these two technologies. As part of this post, we will learn following 1. Read Message from Kafka queue 2. Write Message to Kafka Queue Reading Message from Kafka queue : First of all we should create a stream which will keep listening to Kafka topic. Following is code for this. import org.apache.spark.streaming.kafka._ @transient val kafkaStream = KafkaUtils.createStream(streamingContext,"zookpr-host:2181","kafka-grp", Map("Kafka_input" -> 1.toInt)) There is another way to read messages from Kafka topics, i.e. Kafka Direct. Kafka Direct is out of scope for this scope. We will discuss about that in another post. once kafkaStrea...

Different Modes of Submitting Spark Job on Yarn

June 19, 2016

If you are using spark on Yarn, then you must have observed that there are different ways a job can be run on yarn cluster. In this post we will try to explore this. Spark Job to yarn can be submitted in two modes. 1. yarn-client 2. yarn cluster Yarn-Client : When we submit spark job to Yarn and mode is set to yarn-client, then spark driver runs on client machine. Let me elaborate on that, as we know spark driver is kind of controller of the job. When a spark job is submitted in client mode , driver runs on local machine and while spark job runs we can see logs of spark job on client machine. This will not allow you to run anything else on client untill job copletes. you can do this as following spark-submit -mode yarn-client com.mycomp.Example ./test-code.jar here -mode yarn-client is the important parameter that needs to be added. You can also use nohop and diver all logs to a log file and still run this spark job in client mode. Yarn-Cluster : Yarn-cluster...

Custom UDF in Apache Spark

June 12, 2016

Apache Spark has become very widely used framework to build Big data application. Spark SQL has made adhoc analysis on structured data very easy, So it is very popular among users who deal with huge amount of structured data. However many times you will realise that some functionality is not available in Spark SQL. For example in Spark 1.3.1 , when you are using sqlContext basic functions like variance , standard deviation and percentile are not available(You can access these using HiveContext). In a similar situation, you may like to do some string operation on an input column but functionality may not be available in sqlContext. In such situations , we can use spark udf to add new constructs in sqlContext. I have found them very handy, specially while processing string type columns. In this post we will discuss how to create UDFs in Spark and how to use them. Use Case : Let us keep this example very simple. We will create a udf which will take a string as input and it will c...

Parquet File Format

January 09, 2016

Hadoop support many File formats. These include plain text files in Hadoop and storing files in Hadoop specific format like Sequence Files. There some more sophisticated file formats like Avro and Parquet. Every File format in Hadoop brings its own strengths. In this blog post we will discuss what is Parquet File Format and how is it useful for us. Parquet File format was created by Twitter and Cloudera to make a efficient file format HDFS. Parquet File format comes from class of columnar file formats. Columnar File formats are more usable when you plan to access only few columns of data. These kind of formats are very useful for columnar databases. It has following advantages. 1. Columnar File formats are more compression friendly, because probability having common values in a column is more as compare to at row level. 2. While reading only those columns are read which are required. so you en...

Spark Performance Tuning

January 09, 2016

Starting with spark to develop big data applications is very easy. Spark provide so many options to do a simple thing. Performance tuning in big data is very important aspect. If you can do even minor improvements, on huge scale you can save a lot of time and resources. In this blog post, we will discuss that what are those small things that we can do to get maximum out of our spark cluster. Before jumping on the specifics, let us try to understand that how spark works. Actors: Spark has two major components, 1. Driver 2. Executor Driver is kind of master process which controls everything. Spark runs multiple Executors. Executors are like slaves, they do the actual execution of tasks. we should also understand that a work is just set of some tasks. Driver hands the work and executors handle tasks. One executor can run multiple tasks at a time. it depends on how much resources we have. Once the job is submitted to spark. spark will ...