Advanced Spark : Custom Receiver For Spark Streaming

Apache Spark has become very widely used tool in Big data world. It provides us a one stop shop to do lot of activities with data. You can do data warehouse activities , build Machine Learning application, Build Reporting solutions, create Stream processing Applications. Spark Streaming provides API for connecting different frameworks. As the big data world is growing fast, there are always some new tools that we may need to use in our projects and integrate with spark or spark streaming. In this post , we will look see how can we integrate any system with spark streaming using custom receiver API. There can be many reasons one has to do this. I am just listing few below for your understanding.

1. Spark streaming does not provide integration with your data source
2. Spark streaming integration provided is using different version of API that you source is using
3. You want to handle lower level details of how spark connects with source.
4. A situation where you are using older version of spark (1.3.x or 1.5.x) and you want kerberized connection with source in Enterprise environment etc.

While understanding custom receivers we will implement our own way in spark to connect with kafka. we will not use kafkaUtil API provided by spark distribution.

What is CustomReceiver? First of all, let us understand what is custom receiver. As mentioned earlier, you can ask spark to receive data from any source. As a developer you should know only following

1. How to connect with source
2. How to read data from source.

for this we have to implement a abstract Receiver. for this we have to implement following methods. 
1. onStart()        // what to do to start receiving message
2. onStop()        // what to do to stop receiving message

Let us implement this for kafka.

Here are some imports that we need to do.


Now we need to create a Custom Receiver class.


As part of the receiver I have written , how to kafka and it keeps on getting message. Once a message is received, it is stored using store function. store is a method provided by Receiver class. there are many flavours of store. Fault tolerance and reliability of your receiver depends on which store method you are using. Read the docs of spark before concluding, what is suitable for your use-case.

Finally we can run run this example by following code.


In this code. we have creating spark context and spark streaming context. After this we set receiver in spark streaming context. once this is done, we get dstream as output and we continue with our logic.

Thanks a lot for reading is post. please share your thoughts on this. You can get this code
at my github

Comments

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation