Posts

Showing posts from January, 2016

Parquet File Format

Hadoop support many File formats. These include plain text files in Hadoop and storing files in Hadoop specific format like Sequence Files. There some more sophisticated file formats like Avro and Parquet. Every File format in Hadoop brings its own strengths. In this blog post we will discuss what is Parquet File Format and how is it useful for us. Parquet File format was created by Twitter and Cloudera to make a efficient file format HDFS.                     Parquet File format comes from class of columnar file formats. Columnar File formats are more usable when you plan to access only few columns of data. These kind of formats are very useful for columnar databases. It has following advantages. 1. Columnar File formats are more compression friendly, because probability having common values     in a column is more as compare to at row level. 2. While reading only those columns are read which are required. so you end up saving a lot of time       on IO and decompression on data.

Spark Performance Tuning

Starting with spark to develop big data applications is very easy. Spark provide so many options to do a simple thing. Performance tuning in big data is very important aspect. If you can do even minor improvements, on huge scale you can save a lot of time and resources. In this blog post, we will discuss that what are those small things that we can do to get maximum out of our spark cluster. Before jumping on the specifics, let us try to understand that how spark works. Actors:  Spark has two major components,  1. Driver 2. Executor Driver is kind of master process which controls everything. Spark runs multiple Executors. Executors are like slaves, they do the actual execution of tasks. we should also understand that a work is just set of some tasks. Driver hands the work and executors handle tasks. One executor can run multiple tasks at a time. it depends on how much resources we have.          Once the job is submitted to spark. spark will create an execution plan for execu