create new tag
view all tags

Meeting Minutes - 2012-10-25

Attendants: Ziyan, Gang, Fabio

Time: 2:30 p.m.

Secretary: Fabio

Progress report

Gang reports on the progess made since the previous meeting.

  • An instance of MySQL has been installed in a server machine in IHEP. The DIRAC accounting data has been inserted in the database and a Python script was used to generate similar plots to those generated by the DIRAC web portal. The goal of this exercise was to collect baseline performance figures of plot generation using MySQL. The results are tabulated in the attached presentation. We can see substantial improvements compared to the results obtained using Gang's own computer but are significantly low compared to the performance of the production DIRAC web portal.
    As the installed MySQL instance has not been tuned, it is suspected that it lacks indexes that can speed up the data retrieval process. Gang will look at how to tune the MySQL database for improving its performance and getting more realistic figures.
  • Gang tried to use IHEP's Hadoop cluster without much success. Not all the tools he needs are currently available in that cluster. He instead installed Hadoop and HBase on a "single-node cluster", i.e. a standalone configuration. This allowed him to get familiar to the software installation process and to make initial tests using HBase.
  • A HBase database was configured with 2 column families using the grouping criteria relevant for the plot generation. A set of about 30.000 accounting records was inserted in the database for testing. Gang developed a Python script to query the HBase database and perform the data aggregation within his program. This approach is not convenient when a large set of records need to be processed and aggregated beacause it needs to transfer a significant amount of data from the database server to the machine where the script runs in order to make the aggregation. For comparison, the aggregation process is performed by the MySQL engine, in the production environment.
  • An approach of type MapReduce may be more convenient, provided the data is distributed over several physical machines.

We agreed in the following next steps:

  • Try again to use IHEP's Hadoop cluster. ZiYan and Gang will contact the appropriate person at IHEP for getting this sorted out.
  • Install and configure Thrift, so that Python scripts can interact with Hadoop.
  • Insert the whole set of accounting records into Hadoop
  • Develop a MapReduce program to exploit (i.e. select and aggregate) the data using the tools provided by Hadoop

Next meeting:

Thursday November 8th, 2:30pm, Fabio's office. This date is to be confirmed by Gang according to the progress he will manage to make during the following 2 weeks. The date of the meeting may be postponed to Nov. 15th.

-- FabioHernandez - 2012-10-26

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointpptx Hadoop_Hbase.pptx r1 manage 582.9 K 2012-10-26 - 03:18 FabioHernandez  
Topic revision: r1 - 2012-10-26 - FabioHernandez
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback