Parquet File Format

Hadoop support many File formats. These include plain text files in Hadoop and storing files in Hadoop specific format like Sequence Files. There some more sophisticated file formats like Avro and Parquet. Every File format in Hadoop brings its own strengths. In this blog post we will discuss what is Parquet File Format and how is it useful for us. Parquet File format was created by Twitter and Cloudera to make a efficient file format HDFS.
        
           Parquet File format comes from class of columnar file formats. Columnar File formats are more usable when you plan to access only few columns of data. These kind of formats are very useful for columnar databases. It has following advantages.

1. Columnar File formats are more compression friendly, because probability having common values     in a column is more as compare to at row level.
2. While reading only those columns are read which are required. so you end up saving a lot of time       on IO and decompression on data.

     Columnar storage has evolved for Hadoop over period of time. Hive has something called RCFile format, which is very popular columnar file format.One another file format is ORC(Optimized Row Columnar) File Format.
I would say ORC and Parquet File Formats are very similar , they are mature version of RCFile format. Parquet File format following advantages over RCFile format.

1. Parquet is more compression efficient because you can specify column specific compression.
2. Meta data of parquet files is stored as part of file at the end. so this format is self documenting.
3. Parquet file format can be used with Avro and Thrift also.
4. Parquet file are tuned for better query performance than RCFile format. 

          Spark provides with very good support for parquet files. you can load parquet files very easily. In fact spark provided support for parquet file very early than avro. Avro support was added in spark 1.6, while parquet support existed for long. 

Eg. val data = sqlContext.parquet("testFile")

Advantage of using parquet file with spark is that, you do not need to bother about schema of file. spark takes schema from parquet file and creates a data frame and gives you. 

To Be Continued...

Comments

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Enterprise Kafka and Spark : Kerberos based Integration