Meeting Minutes - 2013-04-11

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress Report

Gang reported on his work since the last meeting. Gang's presentation is attached.

  • Gang made some tests with a Cassandra cluster composed of 3 nodes. The configuration of the cluster used 2 seed nodes and a replication factor of 2. The data used for testing was extracted from the bucket MySQL table which aggregates job records over time (hour, day, week) as a function of data age. The amount of data in Cassandra was ~400MB.
  • Testing this cluster with the same queries needed for generating the plots didn't produce faster results, compared to the single node configuration: it seems that generating the data for a single query requires several hundreds of seconds.
  • Gang decided to deploy and test MongoDB. MongoDB uses memory-mapped files, so the data is actually on memory. The dataset used for testing was the same bucket table used for testing against Cassandra. It is composed of 1.3M records for a total of 312MB of data in the MySQL dump. Once imported into Mongo the data volume was 618MB, so a storage overhead of 2. MongoDB uses BSON (Binary JSON) for serializing the data.
  • Post-meeting, Gang confirmed that the MySQL table which contains 35M records for a total volume of 6GB needs 14GB in MongoDB, so again, a storage overhead of 2.
  • The current test did not include the insertion of the description of several fields, such as the user identity. Only the user id was stored in Mongo.
  • The insertion rate was considerably faster than inserting into Cassandra (see slide 15).
  • Slide 17 shows a comparison of the time required for retrieving the data for generating a typical plot of CPU time grouped by site for a period of 2 years. The time required for generating this plot with Cassandra was 335 secs and with Mongo was 3 secs. See slide 18 for the generated plot.

Next steps

We agreed to perform the following tests:

  • Test with more data, much more than the RAM capacity of a single server. The goal is to understand what are the scalability possibilities offered by Mongo. There are several publicly available horror stories of people trying to make Mongo scale out, so we need to understand what the issues may be.
  • Test performing simultaneous requests to the Mongo server. This is to understand the behavior of Mongo when serving multiple simultaneous clients: 10, 50, 100.
  • Understand the data replication possibilities offered by Mongo. What happens if the machine hosting the MongoDB breaks? Can I still retrieve the data from other machines? Is this transparently handled by Mongo itself? Does it need human intervention?

Information sent by Gang post-meeting

After this meeting, Gang performed additional tests. Here are his (slightly edited) notes:

  • Replication -- MongoDB supports two way: master-slave replication and replica sets, replication sets is recommended because it has automatic failover support. I have set a 3 nodes replica sets on badger01, and when I shutdown the primary node, the secondary node automatically be the primary node to response read/write request, when the primary node restart, it become a secondary node automatically.
  • Scalability -- MongoDB supports sharding. I have tested it with 2 shards cluster on badger01, as long as set a reasonable shard key, MongoDB will automatic balance the data between shards.

Next meeting

Thursday April 25th, 2:30 pm, Fabio's office.

-- FabioHernandez - 2013-04-16


This topic: Dirac > WebHome > DiracAccounting > 20130411
Topic revision: r2 - 2013-04-22 - FabioHernandez
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback