create new tag
view all tags

Meeting Minutes - 2012-11-08

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress report

Gang reports on the progess made since the previous meeting. His presentation is attached.

  • Gang pursued his study of HBase. He found some reports that lead him to think that the database schema he used for his initial test of HBase may not be the most adequate. First, he used a timestamp as a key for each entry in the database, which is known to lead to some performance penalty, because all the records will be stored on the same region in the key space. It is however not clear how this affect the performance of an HBase installation composed of a single machine.
    Second, using multiple colum families (he used 2) for such a low number of columns (21) is not a good decision.
  • In spite of those shortcomings, Gang used the 'sqoop' tool to do a bulk insert of the 35M records of job accounting data currently in MySQL to HBase. The total process took less than 4 hours in his personal computer and the 5GB of data of MySQL need 35GB of storage in HBase. In other words, there is a storage overhead factor of 7 between the two tools. The explanation for this is that each record in HBase contains not only the data but also the column name. A way to reduce this overhead is to design a schema with shorter column names.
  • We acknowledge the information provided by the DIRAC developers about their desire to select a database for accounting records that allows for a "single box" deployment of the DIRAC servers. We are therefore explore the suitability of Cassandra for storing accounting records in a single machine. We agreed to test the performance of a single-box deployment of Cassandra and compare the results against the figures obtained with MySQL. We have concerns that when deploying both MySQL and Cassandra in the same box they will compete for accessing the disks, so such a configuration may not be beneficial. A cleaner separation between accounting records (in particular historical records) from live data would allow for better scalability in the long term.
  • We also agreed to look at open source implementation of column-oriented databases, in particular MonetDB which is used by several scientific projects. It could offer us an alternative way to store accounting records in the "single box" deployment model.
  • Given the "single box" constraint, Fabio suggests also to consider MongoDB which could be well adapted to such a deployment model.
  • Fabio mentions a recent performance comparison of several NoSQL data bases available here. As usual with this kind of reports, the conclusions may not apply to all possible use cases but it is worth reading, if only to understand the methodology of the test.


We agreed on the following actions:

  • Investigate the possible ways to tune MySQL to be as representative as possible of the configuration of the current DIRAC service for LHCb. The goal is to collect performance figures that will be the baseline to compare other tools to.
  • Deploy Cassandra on a single box and collect performance figures about query response time and storage overhead. Those figures are to be compared to MySQL on the same machine.
  • Study and deploy MonetDB and collect performance figures of the same use cases than Cassandra and MySQL.

Next meeting

November 22nd, 2012

-- FabioHernandez - 2012-11-09

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointpptx Discussion.pptx r1 manage 393.6 K 2012-11-09 - 06:40 FabioHernandez  
Topic revision: r1 - 2012-11-09 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback