create new tag
view all tags

Meeting Minutes - 2012-09-27

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress report

Gang reports on the progess made since the previous meeting.

  • The database dump obtained from the DIRAC instance of the LHCb experiment is about 5GB in size and contains accounting data records for the period 2008-2012. Considering that LHCb's DIRAC instance is very busy, this information is interesting to estimate the data volumes we are talking about. The accounting data volume makes it possible to exploit tools that allow for having the full data set in memory, considerable speeding up the data processing. We need to bear this in mind.
  • Gang's work in the previous week has focused on two points. First, doing a comparison on the strenghts and weaknesses of several unstructured data bases, and second, implementing a prototype of a tool for generating the same kind of graphics generated by the DIRAC web portal.
  • In the attached presentation prepared by Gang, we can see his findings about Cassandra, CouchDB, HBase and Riak.
  • In the second part of the presentation, Gang exposed his results using some Python code he wrote for generating plots from the accounting records stored in MySQL database by using his laptop computer. The time for generating the plots are very high to be practical (in the order of 100 seconds per plot) and not considered representative of what can be achieved on a server-style computer. For comparison purposes, the same kind of plots are generated in less than 10 seconds by the production LHCB's DIRAC web portal.

We agreed on the following actions:

  • Perform the same plot generation tests using a server machine and record the time needed for generating each type of plot. This will be our baseline for comparison of other solutions.
  • Explore the usage of HBase as the first disk-based unstructured data store to evaluate for this problem. HBase was selected because there is in IHEP a cluster running Hadoop, which uses HadoopFS which is itself based on HBase. The goal here is to avoid spending some time on installation and configuration of a Hadoop cluster and use the one already available in IHEP. This work involves modelling the DIRAC job accounting data to fit HBase model and write some MapReduce jobs to extract and format the data for the plotting generation step. Gang will measure the time needed by this way of generating the plots and compare to the results obtained in a comparable machine using MySQL.
  • At a later stage, explore MongoDB and Redis as a way of keeping a cache of recent accounting data in memory, for speeding up the generation of plots. Given the size of the accounting data set, this way seems worth exploring.

Next meeting

Because of October national holiday and of Hepix workshop organized at IHEP, the next meeting is scheduled for October 25th.

-- FabioHernandez - 2012-10-24

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointpptx Comparison.pptx r1 manage 821.5 K 2012-10-24 - 05:24 FabioHernandez Comparison of NoSQL implementation and measurements of plotting generation with MySQL backend
Topic revision: r1 - 2012-10-24 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback