Contents:
CloudCamp Barcelona 2009
Last Monday took place in Barcelona the first CloudCamp ever done in the city. Altough I was expecting more technical stuff it was good to be there and listen to what people have to say.
The first part of the event consisted of some quick explanations from different companies related with cloud computing. Basically, were explaining the cloud choises and advantages they were offering. The one I enjoyed the most was the Abiquo’s presentation of their new software, Abicloud. Through a really nice GUI developed with Flex, Abicloud, among other stuff, allows you to set up virtual machines configuring automatically an apache server, mysql database… with just a few drag & drop actions. You can use you own machines, servers from an ISP or even combine both. Elastically, you can increase or decrease the number of virtual machines. This can be very convenient for sites with hight traffic peaks or testing environements.
I am not going to talk more about it as with a five minutes presentation just could get the main idea. Can’t wait to have some free time to start playing with it. Just will add that Abicloud is completely open source.
After the quick talks, the following topics were discussed:
- What guarantees do I have with Cloud Computing?
- What legal issues are there with your data?
- Are standards important? If so, wich ones?
- What is the benefit for a company with only a few dozens of servers?
- Best platfrom to starting a cloud hosting company?
- Is cloud computing green? If so, what?
In the end people were divided in groups depending on in wich topic wanted to go deeper. I attended to “How to develope applications that are going to run in the cloud”. There I could have an interesting quick chat about application scalability and how to dump mysql databases to HDFS using the Cloudera’s tool Sqoop.

Performance measurement with JMeter 2.3.3
Last week was launched a new release of JMeter. JMeter 2.3.3 is a powerful java application designed to do web application functionality testing and performance measurement, allowing you to do powerful server stress tests.
I have been doing some practices with it and I really liked the easy way you can set up a test plan and start stressing your machines to check response times when lot’s of threads are doing requests.
You just need to create a .jmx file wich will contain all the information needed to do the requests. Host name, port number, protocol, method, url path, url variables… You can actually tell JMeter to read the url variables from an external .dat file. It will allow you to give different values to the variables for each request.
The .jmx can be written manually but it’s much easier to create it via the JMeter’s GUI.
You will have to tell JMeter the number of threads that must be executing requests and the number of requests per thread. It allows you to leave the threads making requests indefinitely.
Once a test is launched you can see in real time the number of samples that have been executed and the Deviation, Throughput, Average and Median of the requests done by the threads (think of a thread as a user doing a request via browser).
This is just how to do a basic test plan but the application is really more complete than this and has much more interesting features.
Analyzing java heaps with jmap and jhat
Jmap and jhat are a couple of tools really useful to analyze the memory consume of a java program. Both are included in the JVM 1.6 so there is no need to install any extra stuff.
Jmap allows you to create a dump of the java memory heap at any moment in the life of your running application. It will contain all the live objects and classes at that moment. To create the heap dump it’s as easy as:
jmap -dump:file=my_stack.bin 4365
Where my_stack.bin is the name of the file where you want the dump and 4365 is the pid of the java application process.
If you are running a servlet application under a java server and it ends with a:
java.lang.OutOfMemoryError: Java heap space
You can trigger a dump of the java heap at the OutOfMemory moment specifying these parameters to the server:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/sturlese/stack_test/
This will create a .hprof file (named with the pid’s process) containing the dump in the specified path.
HeapDumpPath param is not compulsory. If we don’t specify it the dump will be created in the folder where Tomcat launches the webapps.
Now we have the dump of the java heap. To analyze it we will use jhat. Once we launch jhat specifying the dump to analyze it will start an HTTP server (in the port 7000 by default) and will let you surf along all the classes and objects. You will be able to check how many instances of each class where alive in the moment the heap was created. To launch jhat:
jhat my_stack.bin
It’s easy to get an OutOfMememory exception when opening the java heap. The dump file can be very memory consuming if you app was in the moment it was taken. If you experience the problem you should give to the JVM as much memory as you can:
jhat -J-mx2000m my_stack.bin
Now is the moment to point your web browser to http://localhost:7000 and start analyzing the heap!
JAD Java Decomplier
Today I needed to check some old java source from wich only I just kept the class files.
Find a java decompiler for my Ubuntu was not as easy job as I tought. Couldn’t find one in the repositories and all what I found in the network was not updated at all.
JAD Java Decompiler is definitely not new stuff but it is really easy to use and did pretty good job for me. The problem was that almost all links guided me to http://www.kpdus.com but the software is not available in there anymore.
In the end I found it not just for Ubuntu but for other platforms aswell.
I leave here the JAD version for Ubuntu (and other linux distributions) that worked for me.
Lucene TrieRangeQuery
Lucene TrieRangeQuery is a cool contrib in Lucene (think not yet in the official release) created by Uwe Schindler. I had heard about it before but learned about it in the LuceneMeetUp in ApacheCon EU. Uwe gave a great speach about it. As I found it a really useful feature will try to explain the basics.
TrieRangeQuery mainly sort out some RangeQuery problems:
- Tipical RangeQuery can end in TooManyClausesException if our ranges are so large.
- Tipical RangeQuery or even ConstantScoreRangeQuery are slow if have to classify using large ranges or the index is huge.
To explain it in an easy way, what TrieRangeQuery do is to search the data values skipping the less relevant “digits” in function of a precision parameter.
Let’s say for example we need to classify thousands of numbers of 6 figures. This could be a slow process using ConstantScoreRangeQuery in a huge index, not with TrieRangeQuery. Ranges will be divided recurively in function of a precision parámeter (set at index time). Numbers from the middle of the range will be classified using the minimum precision value while numbers from extrems will use a higher precision. This will make the query run extremely much faster.
Depending on the level of presicionStep parameter given at index time we will be able to search with more or less precision. The more precision marging we choose the more the lucene document will occuppy. It is due to we will have to index the field more times with the different precisions.
We need to index data in a special way to be able to search it using Lucene TrieRangeQuery. We must index our fields using TrieUtils. We can index numbers directly. It supports java signed int, long, float, double. There’s no loss of precision for doubles or floats. There’s no round for their creation, instead a long/int representation is used for cents.
Indexing numbers with TrieUtils will make us forget about maual padding.
We can index Dates aswell (from java timestamps data type).
As seen, Lucene TrieRangeQuery is totally a step forward for Lucene queries scalability.
ApacheCon Europe 2009
Last week I had the chance to go to the ApacheCon Europe 2009. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there.
Was good to share use cases and experiences in person with people who I had just spoken with in forums.
I spend the first two days in the hackathon doing some research and test of different ASF projects. Put special interest in Pig.
There were really interesting chats. I found specially great Mahout project. I had discovered it in ApacheCon 2008 in New Orleans, I almost just heard about it in there but paid more atention this time and looks full of possibilities. It is used for machine learning and runs under Hadoop.
Was also good to get some info about Servlet 3.0 and learn about servlets doFilter function and some other stuff.
HBase is another project I was interested in. Looks good to be used as a “data warehouse” but seems really difficult (at least at first impression) to deal with the stored data.
Meetups were so good too. There was a presentation about the new Lucene contrib TrieRangeQuery. It is still not available in the official release but you can use it graving a nightly build. In the next few days I will try to write with more detail about this and other presented projects.
Lucene 2.4.1 available from today
A new official release of Lucene in now available! Lucene 2.4.1 is a bug fix version.
We will be able to see more new features in the Lucene 2.9 release (available in developers version).
Here I mention all the improvements of Lucene 2.4.1, wich I read from the official lucene’s site:
- Fixed silent data-loss case whereby binary fields are truncated to 0 bytes during merging if the segments being merged are non-congruent (same field name maps to different field numbers).
- Don’t throw incorrect IllegalStateException from IndexWriter.close() if you’ve hit an OOM when autoCommit is true.
- If IndexReader.flush() is called twice when there were pending deletions, it could lead to later false AssertionError during IndexReader.open.
- Fix false AlreadyClosedException from IndexReader.open (masking an actual IOException) that takes String or File path.
- Multiple-valued NOT_ANALYZED fields can double-count token offsets.
- Ensure IndexReader.reopen() does not result in incorrectly closing the shared FSDirectory. This bug would only happen if you use IndexReader.open with a File or String argument.
- Fix possible overflow bugs during binary searches.
- Fix CachingWrapperFilter to not throw exception if both bits() and getDocIdSet() methods are called.
- Fix int overflow bug during segment merging.
- Fix int overflow bug when flushing segment.
- Fix deadlock in IndexWriter.addIndexes(IndexReader[]).
- NearSpansOrdered returns payloads from first possible match rather than the correct, shortest match; Payloads could be returned even if the max slop was exceeded; The wrong payload could be returned in certain situations.
- Add Analyzer.close() to free internal ThreadLocal resources.
- Fix IndexWriter.addIndexes(IndexReader[]) to properly rollback IndexWriter’s internal state on hitting an exception.
Index scalability using Pig
Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from SQL databases, HBase or other data sources.
After checking the example with detail, what comes now to my mind is if it would be possible to create a Lucene index using Pig and MapReduce jobs retrieving data from a distributed HBase data store system… I am wandering if there would be Lucene analyzers problems (or any other), for example.
I have read that Pig is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the MapReduce jobs.
How fast would it be? I still have lots of research and tests to do…
Solr and Hadoop integration against scalability problems
Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing.
I don’t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop’s distributed file system) sounds like pretty good job!
All the process is well described in the article, I just want to mention the basic steps they followed:
- Store huge amounts of log data in the HDFS.
- MapReduce is used to create Lucene indexs from the stored data using Solr.
- Once built, indexes are compressed in Hadoop nodes.
- These index are merged using Solr webapps, deployed in Tomcat servers witch are stored in Hadoop nodes too (that is for me the most impressive part). These Solr instances allow fast search request aswell.
Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.

SeedRocket & EyeOS
Last Tuesday I had the chance to go to one of the SeedRockets talks. I went to listen to the founder of EyeOS. EyeOS is an interesting open source project. We could say EyeOS is a simple operating system in the cloud. It has it’s own file system. Once installed you can edit different types of documents, use widgets and FTP, listen mp3 files, use it from the movile phone, manage processes, read feeds… but if you start going deeper you can install hundreds of applications.
The speech was mainly about how to create a users community and the advantages and disadvantages of giving a project to open source. I found it really interesting and learned some stuff!
