Spark DataFrame

December 23, 2018

Spark has evolved over the years and Spark Provide Multiple ways to write applications. You Can develop a spark application using following API. You can watch Video on This topic as part of Free Spark Course

RDD
DataFrame
Dataset

As of Today, There are 3 base Data structures in Spark to write applications. As part Of This Post we will focus on Dataframe API.

When Spark was developed, In beginning it had only RDD API to develop application. RDD api were completely functional in nature. People who were from java background or SQL backround used to find it difficult to write huge applications in Spark using RDD. Currently, when i Take interviews, If i ask candidates to write code in rdd, Very few of them seem confident and try to solve the problem using RDD API. So Spark guys, got inspired from world of pandas and R and brought in concept of dataframe. Dataframe helps you Imagine your data as a table. you can do your traditional operations on dataframe like grouby , count , max , min etc. Dataframe became so easier to learn that developers from SQL and Java world adopted it very quickly. currently most of Spark batch applications are developed in dataframe.
Let us understand Data structure from definition perspective. Spark documentation Site says
"A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood"

Don't worry about first line in definition. let us try to simplify this statement. Think of Dataframe as a table in Relational DB. You can not edit data in this table, But you can process this table's data and create a new table(i.e new Dataframe). It is very similar to dataframes that you use in Pandas and R. When you write your application logic using Dataframe, at runtime spark automatically creates optimized code in RDD API. As a user of DataFrame API, you can run SQL queries on dataframe also using Spark SQL
Key Features are following

Automatic Optimization of code.
you can run SQL queries on dataframe using spark SQL
language support available for pyspark, scala, R and java
Provide datasource API to read dataframe Multiple Formats

Let us Create a dataframe . For This, First we will create some dummy data

          val data =Seq(("harry",27),
                        ("kevin",28),
                        ("mel",28),
                        ("stuart",32),
                        ("bob",25))

Creating an Row RDD

val rdd = sc.makeRDD(data).map(x => Row(x._1,x._2))

Define Schema For DataFrame. also define some imports

  import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true)))

Create DataFrame

  val df_frm_rdd = spark.createDataFrame(rdd,schema)

Now Let us experiment
Show Data
  df_frm_rdd.show()

Print Schema

df_frm_rdd.printSchema()

Create Temp table and Try Spark SQL

  df_frm_rdd.createOrReplaceTempView("people")
  spark.sql("select * from people").show()
  spark.sql("select count(*),max(age),min(age) from people").show()
  spark.sql("select * from people where age>27").show()

If You want to read a file and Create Dataframe directly from It use Dataource API of Spark

val df = spark.read.json("/databricks-datasets/samples/people/people.json")

I hope This Blog was useful for you.I have Youtube channel, where I have Videos on different topics of Spark. Please click here

You can get code of above example at here

To Find out how are RDD, dataframe and Dataset different, please check following video

Search This Blog

Harjeet's blog

Spark DataFrame

Comments

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation