LXJS in review

October 13th, 2013 Comments off

Last week I spent some days in Lisbon attending LXJS (Lisbon JavaScript). This was the second edition of this JavaScript conference. Considering the quality and the very positive feedback of the first edition I decided not to miss it this year.

The event was hosted at Cinema São Jorge, an old-style cinema theather located at the heart of Lisbon. It was two days packaged of talks about JavaScript. The event is structured in one-single track. There’s not much detail about the specific contents of each talk until the very last day. Instead of listing each talk individually, they are organized by topic (infraestructure, gaming, robots, etc). I particularly liked this approach. First, no stress about what talk to go. Second,  although you may not be interested in a topic, attending all talks can give you a good understanding of what’s going on in JavaScript nowadays. And if you’re not interested in a particular topic anyway, you can skip the whole block.

The quality of the talks were very good in general. Some of the speakers had participated in other JavaScript events, like JSConf EU, and most of them were international speakers, which gives an idea of the global scope of the event. Without further do, here’s a brief summary of the talks I found more interesting:

  • Robert Nyman. Review of Firefox OS. Robert Nyman is a technology evangelist at Mozilla. Good review of the current state of Firefox OS.
  • Laura Kalbag. Designing for accessibility. Good reminder of what to do and what not to do for improving websites design and accessibility. If you’ve read “Don’t make think” or follow some UX course like hackdesign, you maybe familiar with most of the concepts. Good review of tools at the end of the talk.
  • Jonahattan Lipps. Appium. Appium is functional testing environment for webapps and mobile. Don’t miss Jonahattan’s live performance with a guitar.
  • Tim Park. Nitrogen.js. Nitrogen is a framework that makes it easier to build connected real world devices and the applications that use them. Very interesting if you are keen on domotic. Park worked at Nest, so all his experience in domotic devices is built into the Nitrogen.
  • Michal Budzynski. Gaming in JavaScript. Michal is a Firefox OS developer and JavaScript game developer. This talk, conducted like an arcade game, gives a review at his latest work on JavaScript game developing and some of his experiments. Very nice and enjoyable talk.
  • Mark Boas. HyperAudio. This is an idea that’s been around for a while. First time I heard about HyperAudio was like two years ago or so, if I remember correctly. It was good to see the effort is still alive, mostly pushed by people from Mozilla. To put it briefly, HyperAudio’s goal is to make audio a first-class citizen in the web. Think of all the audio available on the internet and how little of that content is still searchable and difficult to share. Imagine you would like to share a fragment of an interview with your favourite writter broadcasted by the BBC one-year ago. Still today, impossible to do.
  • Bill Mills and Angelina Fabro. JavaScript for Science. Good talk conducted by two speakers. Bill is a physicist working at Triumf and Angelina is a experienced Mozillian. This talk is about how much software engineering good practices, which are quite adopted in the developement of open-source, can help improving science software.
  • Alex Feyerke and Caolan McMahon. Hoodie. Presentation of the Hoodie framework, an interesting framework for rapid prototyping in JavaScript.
  • Hannah Wolfe. Ghost. Review of the current state of Ghost, a new blogging platform built in node.js. The project counted with funding from Kickstater and it’s currently available for downloading. It looks promising.
  • Vyacheslav Egorov. Performance and benchmarking. Review of what to do and what not to do when writing performance tests in JavaScript. Very nice and enjoyable to watch talk.
  • James Halliday. Modularidade para todos. Good insight about the anarchich, but operative, structure and organization of npm. Very fun talk.
  • Raquel Velez. NodeBots. Good introduction to hacking on robots with JavaScript.
  • Andrew Hebbit. Nodecopter Hacks. Andrew is an organizer of many nodecopter events in the UK. In this talk gives a review of all the things he and other people have hacked on a nodecopter. Andrew also help with the NodeCopter event held the day after the conf.

There were some other interesting talks, specially on gaming, infraestructure and development but whether that may be interesting for you or not depends very much on your needs. This was just a brief selection of those I particularly enjoyed.

Modularidade para todos!!

Modularidade para todos!!

Lastly, I’d like to thank all the people involved in the organization of the event for their work, their time, the parties and for making that everything worked just perfectly; a big hug to all the people I met there during those two days and, last but not least, thanks to Igalia for sponsoring my attendance. I hope to be there again next year.

Hacking nodecopters at the Nodecopter workshop

Hacking nodecopters at the Nodecopter workshop

And of course, parties!

Farewell party at Ministerium

Categories: igalia, javascript Tags:

Kandy: a Google Reader app for your Kindle

March 14th, 2013 Comments off

For the last couple of weeks I’ve been working on a Google Reader web app for the Kindle. I love my Kindle and I use it for reading wikipedia and blog posts too. Lately I was using it to read Google Reader but the interface, aimed for desktop computers, was hard to use with the Kindle.

So, as a pet project, I started to code a new web app that could access the Unofficial Google Reader API and present the subscriptions and entries in a more pleasant way.

Today Google announced it will discontinue Google Reader on 1st July. Kandy is still very basic but it’s functional. There were things I wanted to polish and I got some other ideas in mind, but with today’s news I don’t feel like investing more time on this app.

You can check Kandy online here: http://kandy.herokuapp.com (better access from your Kindle :)). The code is available at github: https://github.com/dpino/Kandy

Enjoy it while you can :)

 

Categories: google, igalia, javascript, web Tags:

Some Vim tuning

February 17th, 2013 Comments off

Recently I took some time off to improve my Vim skills. My rule of thumb is whenever you find  yourself doing a repetitive task that’s a hint there might be a better way of doing it. I’ve learned some tricks and about a few extensions.

First, I rediscovered exuberant ctags. Exuberant ctags is an external tool that creates a dictionary of data from source code. Then you can navigate through code more comfortably. First run ctags from the project home folder:

$ ctags -R  -f.tags

Then open a source code file. Position the cursor over a function call, and press Ctrl + ] to jump to the declaration of that function. Press Ctrl + T to come back.

By default, Vim search for a file named tags. I like to name it .tags, so I adjusted my .vimrc file like this:

set tags=./tags,./.tags,tags,.tags;

As a side note, another great discovery was using Ubuntu One for synchronizing my .vimrc file and .vim folder on my desktop and laptop.

With regard to Vim new commands I discovered ci{. Amazing, I think I won’t be able to live without it now. It removes all the code within an inner block code. A similar trick can be applied to a conditional on an if by typing ci(.

With regard to plugins I installed a few and I still trying most of them, but there are some that I consider now a must on any Vim setting.

  • Pathogen. To handle vim plugins.
  • NERDTree. To navigate through the filesystem. Ctrl + N to open the navigation window makes it very convenient. Before NERDTree what I used to do was open a shell with !sh, located the file, back to Vim and open.
  • NERDCommenter. This is so good! A syntax that understands how to proper comment something in a myriad of programming languages. Just simply visual select a block and type “\c<space>” to comment, do the same to uncomment. Very convenient for programming languages that allow only per-line comments, such as Perl.
  • Fugitive. A nice git plugin. At first I thought I won’t use it as I like Disintegrated Development Environments, but the :GBlame command is so good, it already makes it worth. I’m eager to learn more features about this plugin.
  • Supertab. Text completion by typing tab. Before I used a function I had copied somewhere on my .vimrc.
  • SnipMate. A snippet manager. I think it’s an implementation of the same feature from TextMate. I got to learn how to use it more, but it’s already saving me some typing, specially for loops.
  • ZenCoding. I knew about it a long time ago, although I never used it much. Zen Coding is a special syntax for writing CSS and HTML. The result is that you can write more with less typing and the syntax follows very much the same principles as CSS selectors. So, if you know jQuery or CSS you’ll get it very fast. The syntax was created by and there’s support for various editors.

And this is what I’m using now. Now that I know about these things, I ask myself why I’ve waited so long to pause and learn about these topics. However,  I think I know the answer.

Once we learn how to do something, although it’s not the most efficient way of doing it, we stuck to it as it allow us to get our way through. It does the job and at the end of the day that’s what really matters. The point is that we never think about how much time we waste daily when we are not doing things in the best way possible. And that’s particularly true when doing things better doesn’t cost a big effort.

Summarizing, next time you get absorbed by a software project, don’t fool yourself and think that only doing project’s code is work. Work is also everything that allow us to work better and that includes learning a new tool, learning more about your favourite editor or reading a book or an article.

Categories: Uncategorized Tags:

Connecting HBase to Hadoop

December 14th, 2012 Comments off

Continuing with the series about Hadoop, this post covers how to connect HBase and Hadoop together. This makes possible for instance to feed a MapReduce job from a HBase database or to write MapReduce results to a HBase table.

Taking the Hadoop Word Count example as starting point, I’m going to change it so it writes its output to a HBase table instead of to the filesystem.

First thing is to add HBase dependencies in the pom.xml.

<!-- HBase -->
<dependency>
   <groupId>org.apache.hbase</groupId>
   <artifactId>hbase</artifactId>
   <version>0.90.5</version>
</dependency>

In my case I’m using HBase 0.90.5 which is compatible with my current version of Hadoop (V1.0.2). Before trying a higher version of Hadoop or HBase, check both versions are compatible with one each other. Upgrading current versions of both tools may imply modifications in the example code below.

I also need to create a table where to write the results.

$ hbase shell
$ create 'words', 'number'

The table words will contain a key with a word, which is unique, and number of repetitions for each word.

Writing results to a HBase database basically implies to change the Reducer task. Other changes are needed too, like changing the preparation of the job.

public static class Reduce extends
      TableReducer<Text, IntWritable, ImmutableBytesWritable> {

   @Override
   protected void reduce(Text key, Iterable<IntWritable> values,
      Context context) throws IOException, InterruptedException {

      int sum = 0;
      for (IntWritable value : values) {
         sum += value.get();
      }
      Put put = new Put(toBytes(key.toString()));
      put.add(toBytes("number"), toBytes(""), toBytes(sum));
      context.write(null, put);
   }
}
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

FileInputFormat.setInputPaths(job, new Path("/tmp/wordcount/in"));
TableMapReduceUtil.initTableReducerJob(
   OUTPUT_TABLE,
   WordCount.Reduce.class,
   job);

Code is available at Hadoop-Word-Count under the remote branch word-count-hbase-write.

Likewise, it’s possible to read the contents for a MapReduce job from a HBase table and write the results to the filesystem or to a HBase table. Continuing with this example, I’m going to modify the Mapper class so it reads its contents from a HTable table.

First thing is to bulk load some files into a HBase table. To ease this step I created a basic tool called HBaseLoader. To run it:

$ mvn exec:java -Dexec.mainClass=com.igalia.hbaseloader.HBaseLoader
-Dtablename=files -Ddir=dir

And this how the Mapper changes:

public static class MapClass extends
      TableMapper<Text, IntWritable> {

   @Override
   protected void map(ImmutableBytesWritable key, Result row,
         Context context) throws IOException, InterruptedException {
      // Do stuff
   }
}
TableMapReduceUtil.initTableMapperJob(
   INPUT_TABLE,
   scan,
   WordCount.MapClass.class,
   Text.class, IntWritable.class,
   job);

You can check this implementation at https://github.com/dpino/Hadoop-Word-Count.git under the word-count-hbase-read-write branch.

Categories: big data, hadoop, igalia Tags:

Introduction to HBase and NoSQL systems

October 31st, 2012 1 comment

HBase is an open source, non-relational, distributed database modelled after Google’s BigTable (Source: HBase, Wikipedia). BigTable is a data store that relies on GFS (Google Filesystem). Since Hadoop is an open source implementation of GFS and MapReduce, it perfectly made sense to build HBase on top of Hadoop.

Usually HBase is categorized as a NoSQL database. NoSQL is a term often used to refer to non-relational databases. For instance, Graph databases, Object-Oriented databases, Key-Value data stores or Columnar databases. All of them are NoSQL databases. In the recent years there have been an emerging interest in this type of systems as the relational model has proved to be no effective to solve certain problems, especially those related to storing and handling large amounts of data.

In the year 2000, Berkeley researcher Eric Brewer published a now foundational paper known as the CAP Theorem. This theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency. All nodes see the same data at the same time.
  • Availability. A guarantee that every request receives a response about whether it was successful or failed.
  • Partition tolerance. The system continues to operate despite arbitrary message loss or failure of part of the system.

According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three (Source: CAP theorem, Wikipedia).

Usually NoSQL systems are depicted within a triangle representing the CAP Theorem, being each of the angles one the aforementioned guarantees: consistency, availability and partition tolerance. Each system is located in one of the sides of the triangle, depending on the pair of features favoured.

Visual Guide to NoSQL Systems (Source: http://blog.nahurst.com/visual-guide-to-nosql-systems)

As it is shown in the figure above, HBase is a columnar database that guarantees consistency of data and partition tolerance. On the other hand , systems like Cassandra or Tokyo Cabinet favour availability and partition tolerance.

Why NoSQL sytems have become relevant in the recent years? For the last 20 years the storage capacity of hard drives have multiply by several orders of magnitude, however seek and transfer times have not evolved at the same pace. Websites like Twitter receive more data everyday that it can write to a single hard drive, so data has to be written in clusters. Twitter users generate more than 12 TB per day, about 4 PB per year. Other sites, such as Google, Facebook or Netflix, handle similar figures, which means facing the same type of problems. But, big data storage and analysis is not something that only affects websites, for instance, the Large Scale Hadron Collider of Geneva produces about 12 PB per year.

HBase Data Model

HBase is a Columnar data store, also called Tabular data store. The main difference of a column-oriented database compared to a row-oriented database (RBMS) is about how data is stored in disk. Check how the following table would be serialized using a row-oriented and a column-oriented approach (Source: Columnar Database, Wikipedia).

EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

Row-oriented

1,Smith,Joe,40000;
2,Jones,Mary,50000;
3,Johnson,Cathy,44000;

Column-oriented

1,2,3;
Smith,Jones,Johnson;
Joe,Mary,Cathy;
40000,50000,44000;

Physical organization has an impact on features such as partitioning, indexing, caching, views, OLAP cubes, etc. For instance, since common data is stored together, column-oriented excel at operations about aggregating data.

In HBase data is stored in tables, same as in the relational model. Each table consists of rows, each identified by a RowKey. Each row has a fixed number of column families. Each column family can contain a sparse number of columns. Columns also support versioning, that means, that different versions of the same column can exist at the same time. Versioning is usually implemented using a timestamp.

So, to fetch a value from a table we will need to specify three values: <RowKey, ColumnFamily, Timestamp>.

Perhaps the most obscure concept of this model are Column Families. Column Families consist of two parts: a prefix and a qualifier. The prefix is always fixed and it has to be specified when the table is created. However, the qualifier is dynamic and new qualifiers can be added to prefixes at run time. This allows to created an infinite collection of columns inside of column families. Take a look at the table below representing information about Students.

RowKey Timestamp ColumnFamily
Student1 t1 courses:history=”H0112″
Student1 t2 courses:math=”M0212″
Student2 t3 courses:history=”H0112″
Student2 t4 courses:geography=”G0112″
Student2 t5 courses:geography=”G0212″

It is possible to add new courses just by storing new column families with the prefix courses. To get the code of the history subject Student1 is enrolled in , we need to provide three values: Student1; courses:history; timestamp. It is possible to retrieve all values in case a column family, with a different timestamp, is repeated. In the table above, we can see how Student2 is enrolled in the Geography course (timestamp=t4, courses:geography=”G0112″, however he enrolled again because the code of the subject changed (timestamp=t5; courses:geography=”G0212″).

Column Families work as a sort of light schema for the tables. Column families have to be specify when a table is created, and it is actually very hard to modify this schema. However, as the qualifier part of a column family is dynamic, it is very easy to add new columns to existing column families.

For further explanation about HBase Data Model I recommend the following articles: Hadoop Wiki –  HBase DataModel, Understanding HBase and BigTable.

Features of HBase

HBase is built on top of Hadoop, that means, it relies on HDFS and it integrates very well with the MapReduce framework. Relying on HDFS provides a series of benefits:

  • A distributed data storage running on top of commodity hardware
  • Redundancy of data
  • Fault-tolerant

In addition, HBase provides other series of benefits (among other features):

  • Random reads and writes (this is not possible with plain Hadoop).
  • Autosharding. Sharding, horizontal data distribution, is done automatically.
  • Automatic failover based on Apache Zookeeper.
  • Linear scaling of capacity. Just add new nodes as you need them.

Installing HBase

HBase depends on Hadoop, so it is necessary to install Hadoop before installing HBase. Currently there are several Hadoop branches being developed at the same time, so it is strongly recommended to install a HBase version that is compatible with a specific version of Hadoop (v1.0, v.0.22, v.0.23, etc). HBase v.0.90.6 is compatible with Hadoop v.1.x. To install HBase follow these steps:

Now, uncompress it:

$ sudo tar xfvz hbase-0.90.6.tar.gz

Run HBase:

bin/start-hbase.sh

By default, HBase listens on port 60010. When a HBase server is running it is possible to check its state by connecting to http://locahost:60010.

Interacting with the HBase shell

First, start a HBase shell:

$ hbase shell

Once you are into a HBase session, create a new table:

> create 'students', 'courses'

Now insert some data into the table ‘students’.

put 'students', 'Student1', 'courses:history', 'H0112'
put 'students', 'Student1', 'courses:math', 'M0212'
put 'students', 'Student2', 'courses:history', 'H0112'
put 'students', 'Student2', 'courses:geography', 'G0112'
put 'students', 'Student2', 'courses:geography', 'G0212'

And now try the command ‘scan’ to show the contents of a table.

> scan 'students'
ROW                                    COLUMN+CELL
Student1                              column=courses:history, timestamp=1351332046854, value=H0112
Student1                              column=courses:math, timestamp=1351332046914, value=M0212
Student2                              column=courses:geography, timestamp=1351332047022, value=G0212
Student2                              column=courses:history, timestamp=1351332046950, value=H0112
2 row(s) in 0.0390 seconds

The command ‘scan’ returns two rows (Student1 and Student2). Notice that the column family ‘courses:geography’ was inserted twice with different values, but only one value is shown. One of the features of HBase is versioning of data, this means that different versions of the same data can exist in the same table. Generally, this is implemented via a timestamp. So, why those two values don’t show up? To do so, it is necessary to tell scan how many versions of a row we would like to retrieve.

scan 'students', {VERSIONS => 3}
ROW                                    COLUMN+CELL
Student1                              column=courses:history, timestamp=1351332046854, value=H0112
Student1                              column=courses:math, timestamp=1351332046914, value=M0212
Student2                              column=courses:geography, timestamp=1351332047022, value=G0212
Student2                              column=courses:geography, timestamp=1351332046990, value=G0112
Student2                              column=courses:history, timestamp=1351332046950, value=H0112
2 row(s) in 0.0170 seconds

By default, when a table is created, the number of versions per Column Family is 3. However, it is possible to specify more:

> create 'students', {NAME => 'courses', VERSIONS => 10}

To fetch one Student:

> get 'students', 'Student2'
COLUMN                                 CELL
courses:geography                     timestamp=1351332047022, value=G0212
courses:history                       timestamp=1351332046950, value=H0112
2 row(s) in 0.3100 seconds

The result displays all the current subjects Student2 is enrolled in.

If now we want to disenroll Student2 from subject ‘history’:

> delete 'students', 'Student2', 'courses:history'

 

<pre>get 'students', 'Student2'
COLUMN                                 CELL
courses:geography                     timestamp=1351332047022, value=G0212
1 row(s) in 0.0140 seconds

Lastly, we drop table ‘students‘. Dropping a table is two-step operation:

> disable 'students'
0 row(s) in 2.0420 seconds

And now we can actually drop the table:

> drop 'students'
0 row(s) in 1.0790 seconds

Summary

This post was a brief introduction to HBase. HBase is a Columnar Database, usually categorized as a NoSQL database. HBase is built on top of Hadoop and shares many concepts with Google’s BigData, mainly its data model. In HBase data is stored in tables, being each table composed of rows and column families. A column family is a pair prefix:qualifier, where prefix is a fixed part and qualifier is variable. HBase also supports versioning by storing timestamp information for every column family. So, to retrieve a single value from a HBase table, the user has to specify three keys: <RowKey, ColumnFamily, Timestamp>. This way of storing/retrieving data makes that HBase is sometimes referred as a large sparse hash map.

As HBase relies on Hadoop it comes with the common features Hadoop provides: a distributed filesystem, data node fault tolerancy and good integration with the MapReduce framework. In addition, HBase provides other very interesting features such as autosharding, automatic failover, linear scaling of capacity and random reads and writes. The last one is a very interesting feature as plain Hadoop only allows to process the whole dataset in a batch.

And this is all for now. On a following post I will explain how to connect HBase with Hadoop, so it is possible to run MapReduce jobs over data stored in a HBase data store.

Categories: big data, databases, igalia Tags:

Starting with Hadoop

October 14th, 2012 No comments

Recently I gave a talk about the Hadoop ecosystem at Igalia’s Master Open Sessions. The presentation is available here: “The Hadoop Stack: An overview of the Hadoop ecosystem”. Together with the presentation I promised the crowd there to post some introductory articles about Hadoop and HBase. How to install them, use them, create a basic Hadoop project, etc.

So in this post I will explain how to create a basic project with access to the Hadoop API. The project will be managed with Maven, and it will be able to execute Hadoop jobs from Eclipse.

Installing Maven

Maven is project management tool, similar to Ant or Make, that allow us to build a project, add dependencies and control many different aspects of the project. Maven3 is the latest version.

$ sudo apt-get install maven3

It is possible to have both Maven3 and Maven2 in the same system. In case you already had Maven2 installed, installing Maven3 will make the mvn command to point to Maven3. The command mvn2 will execute Maven2.

$ mvn --version

/usr/java/default
Apache Maven 3.0.3 (rNON-CANONICAL_2011-10-11_11-57_mockbuild; 2011-10-11 13:57:02+0200)
Maven home: /usr/share/maven
Java version: 1.6.0_25, vendor: Sun Microsystems Inc.
Java home: /usr/java/jdk1.6.0_25/jre
Default locale: en_US, platform encoding: UTF-8
OS name: “linux”, version: “3.1.9-1.fc16.i686″, arch: “i386″, family: “unix”

Now install Eclipse. If you don’t have it in your system, go to http://www.eclipse.org/downloads/, download Eclipse classicand install it.

Installing the Maven plugin for Eclipse

Together with Eclipse we will install the Maven plugin for Eclipse. This plugin makes working with Maven from Eclipse easier. The plugin is called M2E. To install it follow these steps:

  • Help->Install new software…
  • Click on Add and set the following values and click Ok:
  • Name: m2e
  • Location: http://download.eclipse.org/technology/m2e/releases/
  • Click Next to install plugin

Creating a Maven project

Now we will create a basic skeleton of a project:

  • File->New->Project…
  • Select->Maven->Maven project. Next.
  • Click Next again.
  • Archetype: org.apache.maven.archetypes, maven-archetype-quickstart (selected option by default)
  • Set group, artifact and version for project:
  • Group: com.igalia
  • Artifact: wordcount
  • Version: 0.0.1-snapshot
  • Click Finnish.

To run the project from Eclipse:

  • Right-click on project
  • Run as->Java application

or simply type ‘Shift + Alt + X’ and then ‘J’.

To run the project from the command line, it is necessary first to add the Exec Maven Plugin. All Maven projects have a pom.xml file in their root folder. This file contains all the descriptions need to manage a software project (configurations, plugins, dependencies, etc). Open it and add the following:

<!-- Exec Maven Plugin -->
<build>
    <plugins>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>1.2.1</version>
            <executions>
                <execution>
                    <goals>
                        <goal>java</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <mainClass>com.igalia.wordcount.App</mainClass>
            </configuration>
        </plugin>
    </plugins>
</build>

It is also a good idea to add a section specifying compiler configuration. This way we can tell Maven which version of Java to use to compile our project. Add the following under <build><plugins>:

<!-- Compiler configuration -->
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-compiler-plugin</artifactId>
    <configuration>
        <verbose>true</verbose>
        <source>1.6</source>
        <target>1.6</target>
        <encoding>UTF-8</encoding>
    </configuration>
</plugin>

Now to run the project from command line:

mvn exec:java -Dexec.mainClass="com.igalia.wordcount.App"

The output will show:

Hello World!

Adding Hadoop as a dependency

The development of Hadoop is very active. Currently, there are several branches going on. The latest one is 0.24. At the same same time the 0.20, 0.22 and 0.23 branches are also being developed. The current 1.0 version is fork of 0.20.205. We will install the latest version of the 1.0 branch.

<dependencies>
    <!-- Hadoop -->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-core</artifactId>
        <version>1.0.2</version>
    </dependency>
</dependencies>

Now compile the project:

mvn clean install

Check that the variable M2_REPO is added to the list of accessible libraries:

  • Right-click on Project, select Properties (Alt + Enter)
  • Click on Java Build Path option.
  • Click on the Libraries tab.
  • Click on the Add Variable button.
  • Select variable M2_REPO and click OK.
  • Click OK again to close the window.

Now we should get access to the Hadoop API. Let’s add the following statement in the main method to check if Eclipse can resolve the import.

public static void main( String[] args ) {
    IntWritable one = new IntWritable(1);

    System.out.println("This is a one:" + one);
}

Run it:

mvn exec:java -Dexec.mainClass="com.igalia.wordcount.App"

Output:

This is a one:1

NOTE: Clicking on Ctrl + Shift + O will make Eclipse to include the import statement automatically. If Eclipse still cannot import the library, try to refresh Maven dependencies. Right-click on project Maven->Update dependencies.

Introduction to MapReduce

Remember that Hadoop is two things:

  • A distributed file system (HFS)
  • An implementation of MapReduce.

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks using a large number of computers (nodes). In consists of two steps:

  • Mapping step. The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node.
  • Reduce step. The master node then collects the answers to all the sub-problems and combines them in some way to form the output.

MapReduce was introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

The WordCount example

The WordCount example is the Hello World of MapReduce and Hadoop. It consists of a MapReduce job that counts the number of words in a set of files. The Map function maps the dataset to many nodes. Each node counts the words in their smaller dataset. After that the result is combined together by the Reduce function.

Firstly, we will add a new class called WordCount.

public class WordCount extends Configured implements Tool {

    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        return 0;
    }

}

And in the main class, App, create a new configuration with the MapReduce job.

public static void main( String[] args ) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCount(), args);
    System.exit(res);
}

Now, we should implement the MapReduce job inside the class WordCount. In pseudo-code it would be like this:

mapper (filename, file-contents):
   for each word in file-contents:
      emit (word, 1)

reducer (word, values):
   sum = 0
   for each value in values:
      sum = sum + value
      emit (word, sum)

So now we proceed to implement the MapClass class.

public static class MapClass extends MapReduceBase
    implements Mapper {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector output,
            Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer itr = new StringTokenizer(line);
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            output.collect(word, one);
        }
    }
}

The mapper receives a pair (key, value) and will emit another pair (key, value). The input pair is formed by two values: a Long and a Text. The Long is the offset of the file that contains the Text; the Text is a piece of text from a file. The output pair will be formed by a Text and an Int. The Text (the key) will be a word and the Int (the value) a 1 unit. So what we are doing is emitting a 1 for each word from the text received as input. Later, all these output pairs will be grouped, sorted and shuffle together in a process called Sort & Shuffle. The reducers will receive an input pair formed by a key and a list of values.

public static class Reduce extends MapReduceBase
    implements Reducer {

    public void reduce(Text key, Iterator values,
            OutputCollector output,
            Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

The reducer takes all the words and how many ones were emitted for each one, sums them up and returns another pair (key, value) where the key is a word and the value is the number of times that word was repeated in the text.

Lastly, we need to implement the run method. Generally, here is where the MapReduce job is setup.

public int run(String[] args) throws Exception {
    Job job = Job.getInstance(getConf());
    job.setJarByClass(WordCount.class);
    job.setJobName("product");

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(MapClass.class);
    job.setCombinerClass(ReducerClass.class);
    job.setReducerClass(ReducerClass.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, new Path("/tmp/wordcount/in"));
    FileOutputFormat.setOutputPath(job, new Path("/tmp/wordcount/out"));

    boolean success = job.waitForCompletion(true);
    return success ? 1 : 0;
}

Places the files to count the words from at /tmp/wordcount/in and create an empty directory /tmp/wordcount/out to place the results. After running the jobs there should be a part-0000 file at /tmp/wordcount/out with the results.

And this is all for the moment. In the next post I will give a brief introduction to HBase, how to install it, basic commands and how interface HBase with Hadoop.

Lastly, you can check out all the code explained in this post here: Hadoop-Word-Count

Categories: big data, igalia, java Tags:

Metamail: MapReduce Jobs Benchmarking

August 14th, 2012 No comments

Thanks everybody for the feedback about Metamail. Some people have asked me about the performance of the MapReduce Jobs. I’ve added some code in the git repo to benchmark the jobs. I’ve executed each job 20 times in my laptop and the results are the following:

Job Average Time (seconds)
Standard Deviation
Messages By Size 39.94 3.91
Messages By Thread Length 39.12 1.60
Messages By Time Period 44.58 3.18
Messages Received 43.12 1.41
Messages Sent 37.82 1.41

More or less all jobs take the same amount of time, except for ‘Messages By Time Period’ and ‘Messages Received’. Both jobs do more work parsing the body of an email than any other job. ‘Messages By Time Period’ does several calculations on the same job to calculate messages by year, month, day of the week and hour of the day. On the other hand, although ‘Messages Received’ and ‘Messages Sent’ are similar tasks and therefore should take the same amount of time, the code for parsing the ‘To:’ header is more expensive than parsing the ‘From:’ header.

What indeed takes a lot of time is to import the more than 500.000 mails (2.6 GB) that conform the Enron Email Dataset into HBase, 16 minutes approximately. When I started coding Metamail I used some sample data from my own mail archive, compiled in a single mbox file. The code required to read a file and split it into several records, being each record an email. There was no database. If I remember correctly a job crunching a 15MB file, with the same hardware settings, used to take longer than a minute. Definitely it looks like a good idea to use HBase to store thousands of small files and feed Hadoop with data records.

Hardware settings:

Laptop: Lenovo X220
OS: Fedora 16
CPU: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz (4 CPUs)
RAM: 8GB DDR3
HD: INTEL SSDSA2M160G2LE
Categories: big data, igalia, java Tags:

Metamail: Email Analysis with Hadoop

August 7th, 2012 1 comment

Hadoop has become a sort of standard for processing large amounts of information (BigData). Once a niche product, the adoption of Hadoop among large corporations, like Facebook or Twitter, and small startups has increased dramatically in the recent years. Hadoop, an open-source implementation of Google’s MapReduce + GFS, provides a high-level API  that makes distributed programming easier.

In the MapReduce paradigm, a dataset problem is divided into a small chunk of data. Every split is processed by a node, transforming a given input into an output (Map). Later, the results of several nodes are put together again and processed (Reduce). MapReduce is a zero-shared architecture, that means that each node cannot depend on the the results of another node. Surprisingly, many problems can be resolved using this approach, from creating a page-index, to analyzing the interconnections within a network, etc. For the multiple applications of MapReduce I recommend the book “Data-Intensive Processing with MapReduce”.

Together with Hadoop, there’s a whole stack of other tools built on top of it. Such tools are for example HBase, a key-value datastore built upon HDFS (Hadoop FileSystem); Hive, a sort of SQL for coding MapReduce jobs, initially developed by Facebook; Pig, another MapReduce high-level dialect or Mahout, a framework for machine learning and data mining. It’s really a big ecosystem. Some people compare the term Hadoop to Linux, in the sense that although it’s a small part of a bigger thing it has become to mean the whole thing. If you want to know more about the Hadoop ecosystem I recommend you the talk “Apache Hadoop – Petabytes and Terawatts” by Jakob Homan.

If you have ever taken a look at Hadoop, you have probably coded the famous WordCount example, the HelloWorld! of Hadoop. The WordCount example, reads several files, tokenize them into words, and counts the total number of words. As with the classic HelloWorld! example, it’s helpful for understanding the basics but pretty much useless. We wanted to go beyond that and build something useful, a real-world example that could help people understand the potential of Hadoop. With that goal in mind, we decided to implement an email analysis tool.

Metamail, is an email analysis tool that relies on Hadoop for doing data analysis and processing. The application is divided into three modules:

  • Email Analyzer. A collection of MapReduce jobs for computing statistics. For example, number of emails sent by year, by month, by day of the week, etc; size of emails; top 50 thread lengths, etc
  • Database importer. Before running the MapReduce jobs, it’s necessary to import emails into a HBase database. Later those emails are retrieved and processed by the MapReduce jobs.
  • Web UI. Results from the MapReduce jobs are meaningless if there’s no way to manipulate, understand or visualize those results. The UI shows different charts for each type of statistic crunched.

You can check an online version of the Web UI at http://igalia-metamail.herokuapp.com. As sample data, I have used the Email Enron Dataset, which is a small dataset (0.5GB) freed during the Enron trial. In any case, the analysis jobs are valid for any email dataset, so it’s possible to use Metamail for analysing your personal email or the email from your organization. You can check the code of the project and instructions about how to deploy it at the Github repository.

Working on this project has been very interesting. It has help me to understand several things about Hadoop and data analysis:

  • Hadoop is pretty much useless on its own. Hadoop provides a wonderful architecture for distributed computing. Compared to other distributed computing architectures like PVM and MPI, it’s definitely much easier to use. However, data analysis, machine learning, natural-language processing, graph processing, etc are the other part of the equation that makes the magic of BigData happen. One thing without the other is useless.
  • Hadoop excels at processing large amounts of data. Initially, I did everything with Hadoop parting from this article by Cloudera: “Hadoop for Archiving Email”. Hadoop expects that the minimum chunk of data to be 64MB. When the data processed is much smaller than this size, as it happens with almost all emails, a previous step consists in attaching data together into a single chunk. Processing the data with Hadoop also required to implement a custom RecordReader for extracting emails from a split. Finally, I decided to store everything in HBase, and let HBase feed the data to Hadoop. This approach is good when you want to process pieces of data that are smaller than a HDFS data-block.
  • Visualization is essential to understand data. Measuring is the first step to understand a system, but if you cannot easily visualize the results of your measuring it’s difficult to understand and interpret data. In Metamail, I have used D3 for visualizing the results. D3 is a javascript library for graphic plotting which is becoming more and more popular. Initially, I also post-processed the results with R, the popular statistic programming language, for generating the charts. Both tools are commonly used together with Hadoop.
  • People tend to work less on Mondays and Fridays :) I knew that, but now the data confirms it. There’s less email sent on Monday and Friday than the rest of the working days. I observed also this pattern analysing some of our email at Igalia. In compensation, there’s also people who send emails on Saturday and Sunday (those workaholics :))

Emails by day of week

URL: http://igalia-metamail.herokuapp.com

Categories: big data, igalia, java Tags:

Libreplan: Improving performance

May 31st, 2012 No comments

Recently I’ve been working in improving LibrePlan performance. There are many strategies to improve the performance of a web application from improving  Javascript code, optimizing SQL queries to business logic refactoring. When it comes to a Java application build upon Hibernate there are a known set of features that is possible to tune to boost performance.

Hibernate provides some strategies to improve performance. One common strategy is to use batch-fetching. A 1-to-N relationship between two entities generally means that entity A has a set of entities B. Take the following LibrePlan relationship between WorkReports and WorkRepotLines:

<set name="workReportLines" cascade="all-delete-orphan" inverse="true">
   <key column="work_report_id"/>
   <one-to-many/>
</set>

This means an entity WorkReport has many WorkReportLines. When workReportLines are attached to the current session a SQL query is executed for each of them. If the set contains many elements, this means many queries. Batch-fetching allows to prefetch a certain amount of elements, reducing significantly the number of queries.

<set name="workReportLines" cascade="all-delete-orphan"
   inverse="true" batch-size="10">
   <key column="work_report_id"/>
   <one-to-many/>
</set>

The book “Java Persistence with Hibernate” recommends using batch-sizes between 3 and 15. Hibernate also provides a default parameter (hibernate.default_batch_fetch_size) to turn on batch-fetching for all collections, however I wouldn’t recommend using it and turn on batch-fetching only for large collections. Lastly, Hibernate also provides other fetching strategies such Join fetching, Select fetching and Subselect fetching.

Another mechanism to improve performance in Hibernate is to use the second-level cache. But, what does it mean? Well, perhaps to understand what the second-level cache is, I should explain first what’s the first-level cache.

First-level cache has to do with a session lifespan. It’s active by default. When a transaction is being executed all the objects retrieved are cached in the same session. So, think of first-level cache as the cache attached to a session (transaction-scope-cache), it allows reusability of objects within a session.

But, how we could cache objects retrieved during different sessions? That’s what the second-level cache allows. Think of the second-level cache as a process-scope-cache. Second-level cache is pluggable, it means it can be turned on or not. In addition, it can be configured on a per-class and per-collection basis.

To activate second-level cache, first modify your Hibernate default settings.

<property name="hibernate.cache.provider_class">net.sf.ehcache.hibernate.EhCacheProvider</property>
<property name="hibernate.cache.use_second_level_cache">true</property>
<property name="hibernate.cache.use_query_cache">false</property>
<property name="hibernate.cache.provider_configuration_file_resource_path">classpath:ehcache.xml</property>

Then activate caching for a specific class. I do it for Label in LibrePlan.

<class name="Label" table="label">
   <cache usage="nonstrict-read-write"/>
   ...
</class>

One useful tip to check whether second-level cache is working is to add a log appender in Log4Java configuration.

<appender name="second-level-cache-file">
   <param name="file" value="/tmp/libreplan-second-level-cache.log"/>
   <param name="MaxFileSize" value="5000KB"/>
   <param name="MaxBackupIndex" value="4"/>
   <layout>
      <param name="ConversionPattern" value="%d [%t] %-5p %l - %m%n"/>
   </layout>
</appender>

 

<logger name="org.hibernate.cache">
   <level value="ALL"/>
   <appender-ref ref="second-level-cache-file" />
</logger>

When checking the log, you should see something like this:

2012-05-28 12:05:52,523 [19765316@qtp-4334864-0] DEBUG org.hibernate.cache.ReadWriteCache.get(ReadWriteCache.java:85) - Cache hit: org.libreplan.business.calendars.entities.BaseCalendar#202
2012-05-28 12:06:15,049 [19765316@qtp-4334864-0] DEBUG org.hibernate.cache.ReadWriteCache.put(ReadWriteCache.java:148) - Caching: org.libreplan.business.resources.entities.Resource#1718
2012-05-28 12:06:15,050 [19765316@qtp-4334864-0] DEBUG org.hibernate.cache.ReadWriteCache.put(ReadWriteCache.java:169) - Item was already cached: org.libreplan.business.resources.entities.Resource#1718

Going back to the on class cache configuration, there are 4 possible values: transactional, read-write, nonstrict-read-write and read-only. Use transactional and read-only for read-mostly data, nonstrict-read-write doesn’t guarantee consistency. Lastly, use read-write for mostly-read data with eventual write.

Another important feature of second-level cache configuration is the second-level cache provider. Different cache providers support different cache operations and features. I won’t get deeper into this. In LibrePlan we use EhCache, which is the most popular open-source second-level cache provider and widely used in many Hibernate projects. I recommend this article, Hibernate Caching, to know more about fetching, caching and second-level cache providers.

So, after configuring all these settings it was time to do some benchmarking and see what was the real gain. To do the benchmarking I used JMeter. We used it some time ago in LibrePlan also for measuring performance. The benchmark consisted of a large dataset with 10 use cases. After executing 60 samples for each use case I stopped the benchmark and got the following results:

  • Average. It’s the average response time. On average there was no gain.
  • Aggregate_report_min. It’s the min response time. With cache the min is 0, without cache 1.
  • Aggregate_report_max. It’s the max response time. With cache the max is 9, without cache 34, so there’s a 75% gain.
  • Aggregate_report_stddev. Standard deviation. Without cache was high because the the average is 2 and the min and max are 1 and 34 respectively.

Summarizing, second-level cache and the new batch-fetching strategies provided a big drop on maximum response time, which also reduces dramatically the standard deviation. Without any doubt, a big gain.

Notice that time of benchmarks is measured in milliseconds. To know more about aggregate reports in JMeter check JMeter Aggregate Report.

And this is all. If you’re a LibrePlan user I hope you enjoyed knowing more about what kind of things we do to run LibrePlan faster. If you’re a LibrePlan developer or a Java developer in general I hope you found this information useful and can help you in your future projects.

Categories: igalia, java, libreplan Tags:

Back from Fosdem

February 8th, 2012 No comments

This was my first Fosdem and I have to say I enjoyed the experience a lot, despite the freezing cold in Brussels last weekend :) The amount and quality of the talks was overwhelming and there were many parties all along the event. I loved Fosdem atmosphere. Now I understand why so many igalians say Fosdem is one of the best events of year.

My main goal in this Fosdem was to give a talk about LibrePlan. The talk was held on Sunday morning and it was scheduled in the Lighting Talk track. I was happy to introduce LibrePlan to so many people and answer some questions that arose after the talk. Thanks everybody for coming. If you couldn’t attend Fosdem this year and would like to know more about LibrePlan, checkout the slides and video online (not available yet).

Apart from giving the LibrePlan talk, I also attended some other talks. Fosdem features many tracks with many interesting talks happening at the same. That makes it really difficult to arrange a solid agenda with all the talks you’d like to see. Fortunately some of the tracks were fully recorded, and talks will be available online soon.

The event started with the traditional welcome speech at Janson room. Actually, Fosdem started the day before with the also traditional get-all-together meeting at Delirium Tremens :)

After the welcome speech I moved to the room holding the Cloud track. The room was next to Mozilla’s room, which was also holding some talks I was interested in. I ended up spending more time at Mozilla’s room that at any other room. I especially liked the talk ‘Hacking Gecko‘ by Bobby Holley. It was a great introduction to Gecko and how to start hacking on Firefox. Bobby has a deep knowledge on the subject, but he was also very confident and shown a lot of attitude. It was together with Bdale Garbee’s keynote the best talk I attended at Fosdem.

At the Mozilla track I also attended:

  • Developing Firefox in 2012: Add-ons, Jetpack, Github and more by Dietrich Ayala. Also great talk although Ayala was a bit chilly. It was good to see Ayala in person, I still remember when I was a student about to graduate and I sent him an email with regard his NuSoap library, one of the first implementations of Soap for PHP.
  • Hack the web by Jeff Griffiths and Matteo Ferretti it was a nice introduction to JetPack, the new mechanism for developing extensions for Firefox.
  • Howto: Extensions for Thunderbird by Jonathan Protzenko, it was an introduction on how to develop extensions for Thunderbird, with real examples. Developing extensions for Thunderbird has always being harder than for Firefox as there is much less documentation. Fortunately, Jetpack will arrive soon to Thunderbird.
  • The State of Firefox Mobile by Lucas Rocha and Chris Lord, it was a good summary of the challenges to make Firefox mobile (Fennec) faster. The talks was particularly focused on Fennec for Android. It was a nice talk and very humble, I cannot think of many corporations which will openly acknowledge their products last up to 30 seconds to boot. I guess the first step to overcoming limitations is to admit them.
  • Boot to Gecko and Web API by Chris Jones and Andreas Gal. Boot to Gecko is an operating system based in open web technologies. The idea is to provide a single open platform where developers can build new applications, instead of targeting three different native mobile platforms (iPhone, Android and Windows). It definitively looks like the way to go and it is good Mozilla is pushing for it, but convincing developers to drop successful platforms with very proved successful business models is going to be hard. In any case, Boot to Gecko is still in its infancy.

On Sunday I gave my talk about LibrePlan and later I was talking with some people interested in LibrePlan. I also got the chance to talk with Apostolos Bessas from Transifex. Transifex is a localization platform that makes localization of applications very easy. We are using it now for LibrePlan and we love it. It was nice to meet you all.

Then in the afternoon I stuck to the Dev track and attended the following talks:

  • The Apache Cassandra storage engine by Silvain Brisbaine. It was a good talk but it got very much in detail in certain topics to my taste. I’d have preferred a broader and more introductory talk.
  • From Dev to Devops by Carlos Sánchez. Before the talk I was completely ignorant about the devops movement. Actually I was not thinking of attending but Carlos López, fellow igalian, sat next to me and convinced me to stay. I knew Carlos Sánchez from his work maintaining Maven and also because he studied at University of Coruña and is a friend of some Igalians. He gave an explanatory talk about what the devops movement is about. To put it in one line, devops is basically applying the principles of agile to system operations. Imaging using developing tools like Maven, Hudson, etc for system operations. The term is a combination of development + operations. After an introduction to the devops principles, he introduced Puppet, a declarative language for managing system operations.

Fosdem closed with a fascinating keynote by Bdale Garbee. Last year Fosdem brought a vision called FreedomBox. Everyday more and more of our lives is being uploaded online, while we trust our communications and data on third party applications. FreedomBox is about taking back the control of our own data and build federated networks that can communicate together knowing that our data won’t be compromised or abused by any corporation or public administration. Bdale Garbee gave a summary of all the progress made during last year in the FreedomBox project. It was good to see that a foundation has been set,  groups have been structured and things are rolling. However, there’s still much left to do. FreedomBox is a big challenge. I don’t know if it will be successful or even if it will ever be implemented one day, but I’m sure something will come out of FreedomBox. Progress and innovation happens when there are constraints and challenges to overcome, and the FreedomBox project is full of them.

Unfortunately due to the tight schedule I couldn’t attend to some of the talks by other Igalians. There were quite a few. API had a talk about the Status of accessibility on Gnome 3.4. Guillaume talked about Hacking in the real world, with a room fully packed of curious people. Mario and Philippe did a review of the latest progress and future roadmap of WebkitGTK+ at WebkitGTK+ status and roadmap to Webkit2. Xan López closed the Cross-desktop track with Web Applications in Gnome, featuring a live-demo of Angry Birds that really put the crowd in his pocket.

In summary, Fosdem was great. I’m looking forward to attending next year, with even more fellow igalians if possible. This year we were 14 in total, not bad. By the way, beer was terrific but I guess I don’t need to say it ;-)

Categories: igalia Tags: