create new tag
view all tags

Meeting Minutes - 2013-06-13

Attendants: Ricardo (via Skype), Ziyan, Gang, Fabio

Time: 4:00 p.m.

Secretary: Fabio

Progress Report

The goal of this meeting was to make a summary of all the work done by Gang so far and collect feedback from Dirac experts. The slides presented during the meeting are attached.

Gang presented an overview of MongoDB and how it works: its storage engine, its data model, the possibilities it offers for scaling out, its replication capabilities, etc. MongoDB was selected after the first round of tests showed that it was a good candidate as the storage backend for job accounting records for DIRAC. The first phase of those tests was devoted to implement the functionalities needed by the DIRAC portal for generating the most typical plots. This phase was summarized in a previous meeting on XXX. This phase of tests was devoted to explore MongoDB as the storage engine for storing individual accounting records in such a way that detailed interactive analysis can be performed.

Compared to MySQL, the storage overhead of MongoDB for our use case is more than 2: storing these date in MySQL requires 5.6GB of disk storage while in MongoDB it needs 14GB. This is mainly due to the fact that Mongo stores the names of the fields in each document, so there is a lot of inherent redundancy, as expected for this kind of data bases.

The testbed used for evaluating Mongo was composed of 2 physical machines. 2 shards were configured on those 2 machines: the sharding criteria was the user id.

For querying, MongoDB proposes two ways: aggregation framework and "find". The aggregation framework is modelled as a pipeline of actions at the end of which a response document (in the MongoDB sens, i.e. a JSON structure) is returned to the application. This output document cannot be bigger than 16MB. In contrast, the "find" query returns a cursor which is used by the application to retrieve documents at its own pace. Several indexes were created to make the response to the queries faster. The indexes take a significant amount of storage (relative to the volume of storage required for the individual records). In addition, they need to have a size so that to be kept in memory of the machine running MongoDB.

Gang performed several tests for generating the detailed analysis example plots documented in the DIRAC paper. The total amount of time required for those plots (i.e. data retrieval and graphic output) varies from 1 second to 2 minutes. It is not clear in which step most of the time is spent, so we have to measure that precisely. However, although it would be desirable to shorten the time needed to generate those plots, it is interesting to note that such a solution allows for an acceptable level of interactivity. Some of those plots require handling more than 4M data points so this is a non trivial operation anyway.

A solution based on MongoDB could be implemented for the DIRAC portal, likely in conjunction with the current MySQL-based solution. MongoDB could be an optional component used to store individual non-aggregated accounting job records for installation of DIRAC that require that kind of functionality. Such a solution requires more hardware available (ideally 3 machines, for high-availability purposes) which may be a constraint for some installations of DIRAC.

Next steps

We agreed to do some further work:

  • measure the time taken for generating the detailed analysis plots, to understand if the time is spent in the data retrieval phase or in the graphic generation phase.
  • explore the possibility of using the individual job accounting records for generating summary plots (such as the one showed in slide 12). The goal is to understand if the data aggregation step (currently by period of time) is still required if DIRAC uses MongoDB.
  • measure the CPU and memory consumption of the machines running MongoDB while responding to queries. This is useful to understand if the typical queries of our use case are pushing to the limits the machines of the testbed or if instead there is still margin for growth.
  • explore the possibilities of performing queries against MongoDB which involve several fields, even if there is no specific index created for answering such a query. The goal is to determine if a single index which contain discriminant fields used in the queries (such as FinalMajorStatus, User, JobType, JobClass, ProcessingType, ...) can be used, and avoid to create several specific indexes.
  • write a summary of this work so to make it known and prepare a presentation for an upcoming meeting (for LHCb or for DIRAC users' forum or for a conference).

Next meeting: date to be selected by e-mail.

-- FabioHernandez - 2013-06-14

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointpptx discussion(06.13).pptx r1 manage 1822.3 K 2013-06-14 - 01:54 FabioHernandez  
Topic revision: r1 - 2013-06-14 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback