Posts

Showing posts from August, 2016

Kafka Saprk Integration Issue - Task Not Serializable

When you Integrate Kafka with spark streaming. There are some very small things that needs to be taken care. These things are often ignored and we end up wasting lot of time in resolving issues which occur because of this. As part of this post , i will discuss these small small issues with you. 1. Task Not Serializable:  If you have tried posting messages from spark streaming to kafka, there is very high probability that you will face this issues. Let us understand the problem first.   when does it occur - when you create a object of a class and try to use it in any of transformation logic for dstream or rdd.   Example  - example code is below. If KafkaSink class is not declared as serializable, and you are using it as below, You will definitely face Task not serializable issue.  kafkaStream.foreachRDD{ rdd => rdd.foreachPartition{ part => part.foreach(msg => kafkaSink.value.send("kafka_out_topic",msg.toSt ring) }

Spark Streaming and Kafka Integration

Spark Streaming and Kafka are two talked about technologies these days. Kafka is a distributed queue. we can post message to kafka queue and also read message from kafka queue. Spark Streaming can keep on listening to a source and process the incoming data. Let us understand how can we integrate these two technologies. As part of this post, we will learn following 1. Read Message from Kafka queue 2. Write Message to Kafka Queue Reading Message from Kafka queue :  First of all we should create a stream which will keep listening to Kafka topic. Following is code for this. import org.apache.spark.streaming.kafka._ @transient val kafkaStream = KafkaUtils.createStream(streamingContext,"zookpr-host:2181","kafka-grp", Map("Kafka_input" -> 1.toInt)) There is another way to read messages from Kafka topics, i.e. Kafka Direct. Kafka Direct is out of scope for this scope. We will discuss about that in another post.         once kafkaStrea