Spark Performance Tuning

Starting with spark to develop big data applications is very easy. Spark provide so many options to do a simple thing. Performance tuning in big data is very important aspect. If you can do even minor improvements, on huge scale you can save a lot of time and resources. In this blog post, we will discuss that what are those small things that we can do to get maximum out of our spark cluster.

Before jumping on the specifics, let us try to understand that how spark works.

Actors:  Spark has two major components, 
1. Driver
2. Executor

Driver is kind of master process which controls everything. Spark runs multiple Executors. Executors are like slaves, they do the actual execution of tasks. we should also understand that a work is just set of some tasks. Driver hands the work and executors handle tasks. One executor can run multiple tasks at a time. it depends on how much resources we have.
         Once the job is submitted to spark. spark will create an execution plan for executing the job. This plan divides the whole job into set of stages. Each stage contains multiple tasks. Sequence of stages to run is decided by the dependency. when we start only those stages run for which data is already available and they do not depend on any other stage and this is subsequently repeated. Creation of the stages also depends on shuffling of data. Within a stage data shuffle must to be required.

     One more important thing to consider here is that, more is the shuffles, slower are the things. shuffle is normally required in Wide transformations. People who are not aware, spark has two kind of transformation Narrow and wide. Narrow Transformation output is generated using only one partition data. In wide transformation, data from multiple partitions may be required. Eg map is narrow transformation and groupByKey is wide transformation  .  I will discuss it in detail in another post. 

 Now we understand some of basics of Spark. Let us explore, where we can add more power

1. Define Right number of Executors and Executor cores -  You can change number of while submitting a job or while starting shell using property  --num-executors. After this job is not done. you need to also define number of executor cores. Executor core defines number of concurrent tasks each executor can run. If we give executors cores very high, then spark will create very high number of tasks for each executor. These tasks will compete each other for resources and It will reduce data I/O throughput.

Eg. spark-shell --num-executers 15 --executor-cores 5 --master yarn test.jar com.test.Example   

2. Changing No of Partitions - No of tasks created in a stage is equal to no of partitions of data. So having vary huge number of partitions can reduce your clusters performance. This is very common problem when you run spark code locally,  if your data has huge number of partitions , you will see your code running very slowly. change number of partitions using coalesce method. If you are using spark data frames, then use repartition method. 

3. Use Broadcast Variables - This is like Distributed cache in Hadoop. we can share data across worker nodes in read only way. If used intelligently , we can optimize our code with this.

4. Caching your Data - If you are doing lot of exploration on a single data frame, caching that data will speed up any execution that you try on that data frame.

5. Prefer ReduceByKey than groupByKey - If I explain it in Hadoop terms, then reduceByKey is reduce operation with combiner and groupByKey is recude without combiner. So when data shuffle happens during these operations as these are wide transformations, more data transfer happen in case of groupByKey. so it is slower as compare to reduceByKey. 

6. flatmap-join-groupBy vs cogroup - Use cogroup wherever possible, because it has better performance than flatmap-join-groupBy pattern. It avoid extra overheads of packing and unpacking of data.

      These were some of my thoughts on improving performance of any work that we do in Spark. Please feel free to share your thoughts. 
     
Here is a Video on My Youtube channel which cover this topic in more detail with more scenarios.


Comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Good explanation. Thank you venu

    ReplyDelete
  3. Very nice blog,keep doing you best.
    Please share more posts on big data hadoop course

    Thank you....

    ReplyDelete
  4. Hi thanks to share ur knowledge Harjeet, after five years i am commenting again.. yes i agree with you if applied all these optimization techniques easily spark optimize performance. I already subscribed your youtube channel , now i realise your knowledge is Harjeet (in youtube you haven't mention your details.

    Venu spark training in Hyderabad

    bigdata training in Hyderabad

    ReplyDelete

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Enterprise Kafka and Spark : Kerberos based Integration