Apache Spark Data Frames : Part 1

December 13, 2015

In this post we will discuss how can we use Data Frame API in Apache Spark to process data. Lot of users of Apache Spark who come from python or R language data science , always use to complaint about Spark RDDs difficult to process. As a Spark Developer I also used to find it difficult to use rdd. But after adding Data Frames support spark has become more easy to use tool. developers can easily query and manipulate data. Let us first understand the basics of Spark Data Frames

Spark Data Frame is a distributed collection of data organized into columns. It provide functions to easily select specific columns, filter data, group by and aggregate data. People who are from SQL background can simply write a sql query and pass to SQLContext, which will do everything for us. people who are from R or Python background they will be aware of data frames. people who are from other background , can imagine data frame as a table.

Creating a data frame from a file in HDFS : In this section we will see how can we create Data Frame from a text file in HDFS.

1. Creating a spark context and sqlcontext.

// Create a spark conf set app name and master
val conf = new SparkConf().setAppName("Data Frame Test").setMaster("local")

//Use spark conf and create SparkContext
val sc : SparkContext = new SparkContext(conf)

//Create SQLContext
val sqlc : SQLContext = new SQLContext(conf)

//This import statement helps in any kind of implicit conversion required for Data //frames. One example is converting RDD to data frame using rdd.toDF() method
// other is converting column name to column types while passing column name to any data //frame function.
import sqlc.implicits._

2. Next step is to load a file from HDFS.

val dataRDD = sc.textFile("hdfs://user/harjeet/data.csv")

3. Get schema of file from header.

//Get Header from RDD
val header = dataRDD.first()

  //Get all lines from RDD other than header
val data = dataRDD.filter(x => x!=header)

val columns = StructType([ StructField(col , StringType, true) for col in header.split("//,")])

4. Converting RDD to Data Frame.

// For This step, we should be aware about number of columns in data set and following line //should change accordingly
val rowRDD = data.map(x => x.split("//,").map( p => Row ( p(0) , p(1) , p(2) ) )

// Creating data Frame

  val df = sqlc.createDataFrame(rowRDD,columns)

These steps will create a data frame , which can be used to filter, query or aggregate the data. we can print the schema of dataframe also using following code.

  df.printSchema()

This is how we can create data frames from rdd in apache spark. In part 2 we will see what kind of
oerpations we can do on data frames.

Comments

veera cynixit29 July 2020 at 22:49
Very nice post,keep sharning more posts with us.

Thnank you....

big data training
ReplyDelete
Replies
Anonymous17 May 2022 at 04:03
perde modelleri
sms onay
Mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
trafik sigortası
DEDEKTOR
WEB SİTE KURMA
Ask Kitaplari
ReplyDelete
Replies
Anonymous31 May 2022 at 10:24
smm panel
Smm panel
iş ilanları
instagram takipçi satın al
hirdavatci
beyazesyateknikservisi.com.tr
SERVİS
Tiktok hile
ReplyDelete
Replies

Search This Blog

Harjeet's blog

Apache Spark Data Frames : Part 1

Comments

Post a Comment

Popular posts from this blog

Hive UDF Example

Hadoop series : Pseudo Distributed Mode Hadoop Instalation

Enterprise Kafka and Spark : Kerberos based Integration