Spark DataFrame
Spark has evolved over the years and Spark Provide Multiple ways to write applications. You Can develop a spark application using following API. You can watch Video on This topic as part of Free Spark Course
When Spark was developed, In beginning it had only RDD API to develop application. RDD api were completely functional in nature. People who were from java background or SQL backround used to find it difficult to write huge applications in Spark using RDD. Currently, when i Take interviews, If i ask candidates to write code in rdd, Very few of them seem confident and try to solve the problem using RDD API. So Spark guys, got inspired from world of pandas and R and brought in concept of dataframe. Dataframe helps you Imagine your data as a table. you can do your traditional operations on dataframe like grouby , count , max , min etc. Dataframe became so easier to learn that developers from SQL and Java world adopted it very quickly. currently most of Spark batch applications are developed in dataframe.
Let us understand Data structure from definition perspective. Spark documentation Site says
"A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood"
Don't worry about first line in definition. let us try to simplify this statement. Think of Dataframe as a table in Relational DB. You can not edit data in this table, But you can process this table's data and create a new table(i.e new Dataframe). It is very similar to dataframes that you use in Pandas and R. When you write your application logic using Dataframe, at runtime spark automatically creates optimized code in RDD API. As a user of DataFrame API, you can run SQL queries on dataframe also using Spark SQL
Key Features are following
Let us Create a dataframe . For This, First we will create some dummy data
Creating an Row RDD
val rdd = sc.makeRDD(data).map(x => Row(x._1,x._2))
Define Schema For DataFrame. also define some imports
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true)))
Create DataFrame
val df_frm_rdd = spark.createDataFrame(rdd,schema)
Now Let us experiment
Show Data
df_frm_rdd.show()
Print Schema
df_frm_rdd.printSchema()
Create Temp table and Try Spark SQL
df_frm_rdd.createOrReplaceTempView("people")
spark.sql("select * from people").show()
spark.sql("select count(*),max(age),min(age) from people").show()
spark.sql("select * from people where age>27").show()
If You want to read a file and Create Dataframe directly from It use Dataource API of Spark
val df = spark.read.json("/databricks-datasets/samples/people/people.json")
I hope This Blog was useful for you.I have Youtube channel, where I have Videos on different topics of Spark. Please click here
You can get code of above example at here
To Find out how are RDD, dataframe and Dataset different, please check following video
- RDD
- DataFrame
- Dataset
When Spark was developed, In beginning it had only RDD API to develop application. RDD api were completely functional in nature. People who were from java background or SQL backround used to find it difficult to write huge applications in Spark using RDD. Currently, when i Take interviews, If i ask candidates to write code in rdd, Very few of them seem confident and try to solve the problem using RDD API. So Spark guys, got inspired from world of pandas and R and brought in concept of dataframe. Dataframe helps you Imagine your data as a table. you can do your traditional operations on dataframe like grouby , count , max , min etc. Dataframe became so easier to learn that developers from SQL and Java world adopted it very quickly. currently most of Spark batch applications are developed in dataframe.
Let us understand Data structure from definition perspective. Spark documentation Site says
"A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood"
Don't worry about first line in definition. let us try to simplify this statement. Think of Dataframe as a table in Relational DB. You can not edit data in this table, But you can process this table's data and create a new table(i.e new Dataframe). It is very similar to dataframes that you use in Pandas and R. When you write your application logic using Dataframe, at runtime spark automatically creates optimized code in RDD API. As a user of DataFrame API, you can run SQL queries on dataframe also using Spark SQL
Key Features are following
- Automatic Optimization of code.
- you can run SQL queries on dataframe using spark SQL
- language support available for pyspark, scala, R and java
- Provide datasource API to read dataframe Multiple Formats
Let us Create a dataframe . For This, First we will create some dummy data
val data =Seq(("harry",27),
("kevin",28),
("mel",28),
("stuart",32),
("bob",25))
Creating an Row RDD
val rdd = sc.makeRDD(data).map(x => Row(x._1,x._2))
Define Schema For DataFrame. also define some imports
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true)))
Create DataFrame
val df_frm_rdd = spark.createDataFrame(rdd,schema)
Now Let us experiment
Show Data
df_frm_rdd.show()
Print Schema
df_frm_rdd.printSchema()
Create Temp table and Try Spark SQL
df_frm_rdd.createOrReplaceTempView("people")
spark.sql("select * from people").show()
spark.sql("select count(*),max(age),min(age) from people").show()
spark.sql("select * from people where age>27").show()
If You want to read a file and Create Dataframe directly from It use Dataource API of Spark
val df = spark.read.json("/databricks-datasets/samples/people/people.json")
I hope This Blog was useful for you.I have Youtube channel, where I have Videos on different topics of Spark. Please click here
You can get code of above example at here
To Find out how are RDD, dataframe and Dataset different, please check following video
Comments
Post a Comment