create new tag
view all tags

Meeting Minutes - 2013-01-24

Attendants: IHEP: Ziyan, Gang, Fabio | Remote: Ricardo Graciani, Adrian Casajús

Time: 4:30-6:30 p.m.

Secretary: Fabio


The main goal of this meeting is to present the status of this work to the DIRAC team and to get feedback from them. Thanks to Ricardo and Adrian for the time devoted to this discussion and for the very useful feedback provided to us.

Progress report

Gang reports on the progess made since the he began working on this back in August 2012. His presentation is attached.

  • The first part of the presentation was devoted to show the progress made in understanding the NoSQL technologies and their suitability to the DIRAC accounting data use case. Several tools have been explored and Cassandra was selected as the backend for developing a detailed prototype solution.
  • Gang showed with some level of detail how, after several iterations, he decided to organize the accounting data in Cassandra in order to efficiently answer the queries that are to be made against this system. In the testing configuration, Cassandra runs on a single host (4 cores@2.4GHz, 16GB RAM).
    Question: can this setup be used to test the scalability of Cassandra, for instance running more Cassandra processes in the same host?
  • The way the accounting data is organized in Cassandra can be summarized in the following way (see slide 23 on). 6 column families were created for each one of the grouping criteria needed when generating the plots (i.e. group by user, group, site, job processing type, country, grid). In each one of those column families, the data associated to each row is aggregated by day and by the relevant accounting criteria (i.e. consumed CPU time, used disk space, execution time, size of the input and output sandboxes, and number of jobs). For instance, in the column family that groups by site, the row for site 'LCG.FZKGLEXEC.de' contains a map associating dates to values. Each value represents the aggregated amount of CPU time (or used disk space, or execution time, etc.) provided by that particular site on that particular date (see slide 25).
  • In addition, the raw data, i.e. the individual job accounting records, are also stored in Cassandra, but are not used whe retrieving data for plot generation.
  • The current data organization suits well the required way to retrieve the data for plotting, when no detailed selection is performed. Plots like 'cpu time consumed over a time period grouped by site' can be easily generated with this structure. However, no tests have been performed to understand if this organization is also well suited for retrieving the data with further selection criteria, such as the data needed to generate a plot of 'cpu time consumed by jobs of type simulation over a time period grouped by site'. The DIRAC portal allows for this kind of selection options, so the accounting backend storage needs to support this also.
  • Gang is using a feature of Cassandra that allows to define cells which contain values which are defined as counters. This allows for fast update in the data aggregation phase because the application does not need to retrieve the value, increase it and write it back into the cell. Cassandra does the increment itself.
  • The dataset being used for testing contains 35 million accounting records over the period 2008 to mid-2012. The total amount of storage taken by these data when storage in MySQL is 4GB. In Cassadra, the same raw data needs 21GB of disk storage. The significant storage overhead is very likely due to the fact that there is a lot of redundancy in the way Cassandra stores its data to support unstructured datasets. For instance, for each row Cassandra stores the column name and its associated value. This allows for rows with different columns and different names in the same column family.
  • The amount of storage used by Cassandra for storing the 6 column families which contain aggregated data per day of the whole data set is about 22MB.
  • The process of aggregating the data per day is performed at insertion time. Each job record is inserted in the raw data space and also the column families containing aggregated data per day are updated. This process takes 18 seconds for aggregating and storing the information of every 1100 records in 4 column families.
  • Q: is the day the best granularity we can have? Is Cassandra suitable for storing finer granularity, for instance per-hour granularity? What is the storage overhead and what is the performance?
  • Q: is it possible to reaggregate (or 'rebucketing') the data using other granularities from the raw records already stored in Cassandra, as opposed to do this at insertion time?
  • Gang set up a demonstration portal that partially mimics the user interface provided by the DIRAC production portal. This portal uses the Python client of Cassandra, a.k.a. pycassa for querying the data and Matplotlib for generating the plots. Slide 28 and on show some plots generated with this portal and compared to equivalent plots (when possible) generated by the LHCb DIRAC portal (from which the dataset for testing was extracted). The figures also contain the time needed to query the data from Cassandra, before giving it to Mapplotlib for generating the images. Some discrepancies in the plots (slide 29) can be explained by the aggregation method used in the production portal (per week) and the prototype portal (per day). However, the plot in slide 30 looks strange. This needs to be understood.
  • Regarding the implementaion (slide 26), it would be wiser to use a cursor for paging the results than hardcoding the maximum number of results expected by the client (currently 365.000). This maximum value depends on the aggregation criteria (per hour, per day, per week) so to be accurate it should change with the granularity selected for storing the data. An alternative way for retrieving the results should be investigated in order to allow for more flexibility.

Next steps

Gang was congratulated for the progress achieved so far and encouraged to continue this exploration.

Below we present the agreed next steps to continue this work.

  • Try to do the aggregation of the whole dataset per hour, quantify and compare the results.
  • Perform the tests with more detailed selection criteria to verify that the data organization suits also this need.
  • Determine ways to understand if Cassandra can be used for storing job accounting records in the finest possible usable granularity (i.e. the individual job record) while allowing to efficiently query the database. This needs to understand all the possible current and desirable queries we want the database be able to respond to. We already can identify the necessary queries for generating the accounting plots supported by the DIRAC portal, but there may be some other kind of queries that experience has shown are also desirable. We need to document them.

Next meeting

We will call for a next meeting with Ricardo and Adrian when we have made further progress, in a few weeks time. In the meantime, Gang and ZiYan will enjoy their will reserved holiday period around Chinese new year.

The next meeting will be scheduled by e-mail.

-- FabioHernandez - 2013-01-25

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2013-01-28 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback