Meeting Minutes - 2013-05-02

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress Report

Gang reported on his work since the last meeting. Gang's presentation is attached.

  • Gang worked for getting more familiar with MongoDB v2.4.
  • The data in the MySQL JobType table was inserted in MongoDB. It tool about 14GB (to be compared to 5GB in MySQL). An index was also created for these data.
  • Gang explored the possibilities for indexing the data to support specific queries. The indexes required by the queries need to fit into RAM of the MongoDB servers. MongoDB uses RAM for storing the whole index and a working set (i.e. a fraction of the data).
  • MongoDB offers some options for configuring the database in a highly-available way using replica sets. Members of a replica set can be primary, secondary and arbiter. The members of a replica set replicate the data among them and ensure automated failover. The client of MongoDB (i.e. the application) knows what is the primary server. When the primary is down, a new primary is elected from the secondary servers. In a typical installation, there are replica sets composed of 3 MongoDB servers. Writing data goes only through the primary server but secondary servers can serve read requests.
  • After configuring a 3-members replica set, Gang made some tests for understanding how the process of electing a new primary server works, by explicitely shuting down the primary. As a result, after about 10 seconds, a new primary server is selected and the application receives the response to its query.
  • For scalability, MongoDB's approach is to create shards. In a sharded configuration, each MongoDB stores a fraction of the data. Usually, each shard is configured as a replica set. A MongoDB cluster is composed of router processes (called mongos) which route the client application reads and writes to the shards. There are also configuration servers which are responsible for storing metadata about the cluster. Finally, there are the shards which actually store the data.
  • Sharding the data is recommended when the dataset approaches the storage capacity of a node, when the working set approaches the max amount of RAM or when there is a large amount of write activity.
  • For testing purposes, Gang configured a MongoDB cluster using 2 physical nodes. The cluster was configured with 2 shards, one on each machine. Each shard is a replica set with 3 copies (all the copies in the same machine).
  • For querying the cluster, MongoDB proposes an aggregation framework. Gang used this framework for querying the data required for reproducing a plot that shows the CPU efficiency distribution for a particular user. This kind of plots is reported in the DIRAC accounting paper provided by Ricardo, so is a typical example of the kind of detailed analysis that a DIRAC operator would be able to perform with the job accounting data.

Next steps

  • Use the MongoDB cluster query mechanisms to reproduce the plots for the the detailed analysis use-cases reported in the DIRAC accounting paper and measure the time to produce each one of them.

Next meeting: Friday May 24th, 14:30, Fabio's office

-- FabioHernandez - 2013-05-22

This topic: Dirac > WebHome > DiracAccounting > 20130502
Topic revision: r1 - 2013-05-22 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback