Posts

Showing posts from December, 2015

Spark Streaming Tutorial

Image
Spark has evolved a lot as a distributed computing framework. In the beginning, it was only intended to be used for in-memory batch processing. As part of Spark 0.7, it started to provide support for spark streaming. Spark streaming opened door for new set of applications that can be built using spark.         Streaming systems are required for applications when you want to keep processing data as it arrives in the system. For example if you want to calculate distance covered by a person from it source using stream of GPS Signals.Batch processing systems are not suitable for such problems because they have very high latency, which means the result calculated will be received after considerable amount of delay.      Let us understand some of the basic concepts of spark streaming, then we will run through some code for spark streaming as part of this post. DStream:  Dstream is the core concept of Spark streaming which makes spark streaming possible. Dstream is stream of RDDs. In s

Apache Avro

Hadoop as we all know is a distributed processing system. Any distributed processing system is incomplete without serialization and de-serialization API. Serialization and de-serialization API help us to write any data into a file , then transfer to other machine and then read it at receiver end. Hadoop has writable and writable comparable interface for this purpose. Any record in Hadoop can be written on hard disk using implementing writable or writable-comparable interface.        Writable API of Hadoop are compact but not very easy to extend and cannot be used with language other than Java. However people who use streaming API come from different backgrounds. There are some other API like Thrift by Facebook and protocol buffers by Google. But most popular is Avro. it was specifically designed to overcome the disadvantages of Writables. Avro was designed by Doug Cutting.   Avro Schema: Before Storing any data using Avro, we need to create a schema file. this schema file should

Apache Spark Data Frames : Part 2

In the previous post, we saw how can we create a data frame in spark. In this post we will see what kind of operations we can do on data frames. One thing that i missed to mention in previous post is that, if we have import sqlc.implicits._ added in code, then we can use a toDF() method also. we can create a data frame using following statement     val df = data.toDF() after this statement, we can use the created data frame as usual. For understanding of next steps, let us assume the data frame has following columns 1. name 2. city 3. age Let us now see, what are the operations we can do on data frames 1. Select a Column : Following is the statement to select a specific column from data frame.       val name_col = df.select("name") 2. All Distinct values  :       val distinct_names = df.select("name").distinct 3. Counting Number of rows  :       val row_count = df.select("name").count 4. Aggregate operation  : Exampl

Apache Spark Data Frames : Part 1

In this post we will discuss how can we use Data Frame API in Apache Spark to process data. Lot of users of Apache Spark who come from python or R language data science , always use to complaint about Spark RDDs difficult to process. As a Spark Developer I also used to find it difficult to use rdd. But after adding Data Frames support spark has become more easy to use tool. developers can easily query and manipulate data. Let us first understand the basics of Spark Data Frames      Spark Data Frame is a distributed collection of data organized into columns. It provide functions to easily select specific columns, filter data, group by and aggregate data. People who are from SQL background can simply write a sql query and pass to SQLContext, which will do everything for us. people who are from R or Python background they will be aware of data frames. people who are from other background , can imagine data frame as a table. Creating a data frame from a file in HDFS :  In this section