create new tag
view all tags

Meeting Minutes - 2012-12-20

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress report

Gang reports on the progess made since the previous meeting. His presentation is attached.

  • The problem with bulk insertion into Cassandra has been solved. It was very likely related to the limited hardware configuration of the test machine (insufficient memory). Now, Gang is using a server in IHEP computing center that is better configured for these tests. The whole test dataset is now in Cassandra and took 21 GB [reminder: the MySQL dump is 5GB]. The data is modeled as a Cassandra column family.
  • Gang performed an initial test querying Cassandra and MySQL in the same machine to compare the time needed to retrieve the response. The results of this test are shown in slide 3. The time taken by MySQL is consistent and about 31 seconds, independent of the size of the result set. For Cassandra, the smaller the result set the shorter the time taken for retrieving the response. According to Gang's tests, if the response set is below 20.000 records, Cassandra is faster that MySQL for this particular use case.
  • Gang explored 2 ways of modeling the job records data in Cassandra, in order to understand what is better for our purposes. In both cases, data is aggregated by day and separated according to the field we need to plot. For instance, there is a column family for storing the aggregated CPU time consumption: in this column family, there is a row per user id and each column stores the aggregated CPU time consumed by that user in that particular day. Using this representation, it is possible to retrieve all the data necessary for generating a pie plot of CPU time grouped by user in less than 1 second. However, this data representation implies that the data to be preprocessed (to perform the aggregation) and to have a column family per type of plot that we need to generate (e.g. CPU time, execution time, number of jobs, etc.).
  • Another way of modeling was explored. It exploits a Cassandra feature that allows for creating columns with composite (a.k.a. aggregate) keys. The comparison in slide 15 shows that better performance is obtained while querying the data when organized in super-column families.
  • Some tests of aggregating the data per hour were also performed and can be used if necessary. However, it was felt that for data older than say 1 month (or even 1 week) a per-hour granularity does not seem very useful in practice.
  • We agreed that Cassandra will store the raw job accounting records, even if for generating the plots we will use the preprocessed data for speeding up the process.

Next steps

Below are the agreed next steps:

  • to perform tests to confirm that the data retrieved from MySQL is exactly the same as the data retrieved from Cassandra for the same queries. This is to make sure that there is no loss of information in the insertion process and in the way of modeling the data in Cassandra.
  • to chose the keys for the columns and rows in a way that is meaningful for humans. For instance, if a column name represent a date, something like '2012-12-20' should be used.
  • to test the data aggregation on the fly. In other words, using simulated job records, perform the aggregation while inserting each job record (as opposed as aggregating the data once all the raw records are alreay in the data store), because this way represents better the way the system needs to work in a production environment.
  • for the next meeting, to prepare a demonstration of generating several kinds of plots using the Cassandra back-end.

Next meeting

Thursday January 10th, 2013, 2:30 pm, Fabio's office. NOTE: the start time of this meeting may be delayed because of Fabio's constraint. So the final time will be confirmed by e-mail.

-- FabioHernandez - 2012-12-21

Topic revision: r1 - 2012-12-21 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback