Posts

Showing posts from 2015

Spark Streaming Tutorial

Image
Spark has evolved a lot as a distributed computing framework. In the beginning, it was only intended to be used for in-memory batch processing. As part of Spark 0.7, it started to provide support for spark streaming. Spark streaming opened door for new set of applications that can be built using spark.         Streaming systems are required for applications when you want to keep processing data as it arrives in the system. For example if you want to calculate distance covered by a person from it source using stream of GPS Signals.Batch processing systems are not suitable for such problems because they have very high latency, which means the result calculated will be received after considerable amount of delay.      Let us understand some of the basic concepts of spark streaming, then we will run through some code for spark streaming as part of this post. DStream:  Dstream is the core concept of Spark streaming which makes spark streaming possible. Dstream is stream of RDDs. In s

Apache Avro

Hadoop as we all know is a distributed processing system. Any distributed processing system is incomplete without serialization and de-serialization API. Serialization and de-serialization API help us to write any data into a file , then transfer to other machine and then read it at receiver end. Hadoop has writable and writable comparable interface for this purpose. Any record in Hadoop can be written on hard disk using implementing writable or writable-comparable interface.        Writable API of Hadoop are compact but not very easy to extend and cannot be used with language other than Java. However people who use streaming API come from different backgrounds. There are some other API like Thrift by Facebook and protocol buffers by Google. But most popular is Avro. it was specifically designed to overcome the disadvantages of Writables. Avro was designed by Doug Cutting.   Avro Schema: Before Storing any data using Avro, we need to create a schema file. this schema file should

Apache Spark Data Frames : Part 2

In the previous post, we saw how can we create a data frame in spark. In this post we will see what kind of operations we can do on data frames. One thing that i missed to mention in previous post is that, if we have import sqlc.implicits._ added in code, then we can use a toDF() method also. we can create a data frame using following statement     val df = data.toDF() after this statement, we can use the created data frame as usual. For understanding of next steps, let us assume the data frame has following columns 1. name 2. city 3. age Let us now see, what are the operations we can do on data frames 1. Select a Column : Following is the statement to select a specific column from data frame.       val name_col = df.select("name") 2. All Distinct values  :       val distinct_names = df.select("name").distinct 3. Counting Number of rows  :       val row_count = df.select("name").count 4. Aggregate operation  : Exampl

Apache Spark Data Frames : Part 1

In this post we will discuss how can we use Data Frame API in Apache Spark to process data. Lot of users of Apache Spark who come from python or R language data science , always use to complaint about Spark RDDs difficult to process. As a Spark Developer I also used to find it difficult to use rdd. But after adding Data Frames support spark has become more easy to use tool. developers can easily query and manipulate data. Let us first understand the basics of Spark Data Frames      Spark Data Frame is a distributed collection of data organized into columns. It provide functions to easily select specific columns, filter data, group by and aggregate data. People who are from SQL background can simply write a sql query and pass to SQLContext, which will do everything for us. people who are from R or Python background they will be aware of data frames. people who are from other background , can imagine data frame as a table. Creating a data frame from a file in HDFS :  In this section

Using Lzo compression codec in Hadoop

While working on Hadoop, most of the time, the files that we handle are very huge. It is very much required that we compress these kind of files and then use them with Hive or Pig. Hadoop provide various compression formats. there are different advantages and disadvantages of each format.     Let us start with different options of compression of a file available with us. Name                                Tool                            Splittable gzip                                  gzip                                No LZO                                 lzop                               Yes(If Indexed ) bzip                                  bzip2                             Yes Snappy                             NA                                 No Normally, you will like to chose an option where you can split the file and use power of Map Reduce to process that file. otherwise you will be forced to use single Mapper to process that file. I normally prefer LZO form

Data Science : Kaggle Titanic Problem

Image
Introduction ¶ When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here Data ¶ You can also get the data from same location. There are two important files there. train.csv test.csv train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. Th