Apache Spark Data Frames : Part 2

In the previous post, we saw how can we create a data frame in spark. In this post we will see what kind of operations we can do on data frames.
One thing that i missed to mention in previous post is that, if we have import sqlc.implicits._ added in code, then we can use a toDF() method also. we can create a data frame using following statement

   val df = data.toDF()

after this statement, we can use the created data frame as usual.

For understanding of next steps, let us assume the data frame has following columns

1. name
2. city
3. age

Let us now see, what are the operations we can do on data frames

1. Select a Column : Following is the statement to select a specific column from data frame.

     val name_col = df.select("name")

2. All Distinct values  :

     val distinct_names = df.select("name").distinct

3. Counting Number of rows  :

     val row_count = df.select("name").count

4. Aggregate operation  : Example of average age in every city

    // $ sign converts city into column type
    val avg_age_per_city = df.groupBy("city").agg($"city" , avg($"avg"))

5. Aggregate operation  : Example of over all average age

   val avg_age = df.col("age").avg()

6. Joining two data frames : For this example , let us assume there is one more data frame named state having city and state_name as columns.

   val joined = df.join(state, "df.city = state.city" , "inner")

7. Filter data frames : Get all rows from Bangalore city

   val filtered_data = df.filter("city = 'bangalore'")

8. Get Execution plan :

df.filter("city = 'bangalore'").explain
9. Processing data frame using SQL : Once we register a data frame as table, we can run SQL queries on the data frame as following

   df.registerTempTable("temp_tb")
   val results = sqlc.sql("select * from temp_tb")


These are some of ways we can process data using data frames. Using sql queries, you can very easily process data if you are from sql background. Feel free to comment on this post and share your thoughts on this.

Comments

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation