Tags:
create new tag
view all tags

Exploring Python Client

To gain hands on experience, I used Cassandra python API to implement several simple name space operations.

1. Design of data model

To fully represent the metadata information of a namespace, 2 column family(CF) are used: Directory and FFile.
1.1 Column Family Directory
CF Directory is used to describe the metadata of directory, and basic metadata includes owner/group/content, content includes the subdirectory and files and their associated keys inside the directory. Each entry of the column family is identified by a RowKey, and can have different numbers of columns. Each column of the column family is used to store one piece of metadata information, ie, column "owner", column "group", column "dir3" (name of subdirectory), column "f2" (name of file ). Each column has a timestamp to indicate the update time of the column. The values for columns with names of subdirectory name or files name are they associated keys, ie, value for column "dir3" should be the RowKey of "dir3". RowKey is generated by uuid and with naming convention, ie, dirkey_uuid , used to identify each entry of the column family
RowKey: dirkey_372c5d87-4567-11e0-bc71-001a64631cb0
=> (column=dir3, value=3e180f00-459b-11e0-8846-001a64631cb0, timestamp=1299159388519845)
=> (column=f2, value=c69f2ac2-45a6-11e0-9c79-001a64631cb0, timestamp=1299329058698329)
=> (column=f3, value=ddd77c2e-45a5-11e0-934f-001a64631cb0, timestamp=1299328989534849)
=> (column=group, value=root, timestamp=1299137043078874)
=> (column=owner, value=root, timestamp=1299137043078874)
=> (column=p3, value=edf0ed73-45a6-11e0-bf90-001a64631cb0, timestamp=1299164408007020)

1.2 Column Family FFile
CF FFile is used to describe the metadata of file, and basic metadata includes owner/group/size/content, content is the content of the file (File is directly stored in a column, right now, a single file is stored in one column. However, because of the maximum size of each entry is limited to 64MB by Thrift, in order to store files bigger than this, files should be split into chunks and stored in different entries ), same as CF directory, each entry of the CF FFile has columns like owner,group,size, content, and the RowKey of each entry is generated by uuid with naming convention. an example entry in CF FFile:
RowKey: filekey_edf0ed73-45a6-11e0-bf90-001a64631cb0
=> (column=content, value=
127.0.0.1       localhost.localdomain   localhost
202.122.33.12   lcg002.ihep.ac.cn   lcg002
192.168.56.11     lwn011.ihep.ac.cn    lwn011
....,timestamp=1299164408007882)
=> (column=group, value=root, timestamp=1299164408007882)
=> (column=owner, value=root, timestamp=1299164408007882)
=> (column=size, value=11281, timestamp=1299164408007882)

2. Create data model in Cassandra

you can use the cassandra client to create the keyspace and column family and their attributes.

To connect to Cassandra:

[root@lcg003 apache-cassandra-0.7.0]# bin/cassandra-cli --host localhost

To create a keyspace (like database in SQL DB)

[default@unknown] create KeySpace FSNS;
9111f9d3-6be2-11e0-a3c6-e700f669bcfc
To Create a Column family (like a table in SQL DB) "comparator" is used to sort all the entries, "validation_class" is to verify the data type,available validation class includes: AsciiType, BytesType, LexicalUUIDType, LongType, TimeUUIDType, and UTF8Type. Here, we use UTF8Type for general columns like owner/group, use LongType for size, and use BytesType to store the contents of a file(for both Ascii file and Binary file) , in order to update the column family, command "update column family" can be used.

[default@unknown] use FSNS          
...   ;
Authenticated to keyspace: FSNS
[default@FSNS] create column family FFile with comparator=UTF8Type  and column_metadata=[{column_name:size,validation_class:LongType,Comparator:LongType},{column_name:content,validation_class:BytesType}];
0ccd2084-6be4-11e0-a3c6-e700f669bcfc

[default@FSNS] create column family Directory with comparator=UTF8Type and default_validation_class=UTF8Type;
6cbdac85-6be4-11e0-a3c6-e700f669bcfc
Insert the first entry in CF Directory, the RowKey for root directory(/) is always fixed (dirkey_1), so it can be refereed as a starting point to walk through the name space tree.
[default@FSNS] set Directory[dirkey_1][owner]='filestore';
Value inserted.
[default@FSNS] set Directory[dirkey_1][group]='filestore';
Value inserted.
[default@FSNS] list Directory;                            
Using default limit of 100
-------------------
RowKey: dirkey_1
=> (column=group, value=filestore, timestamp=1303369145333000)
=> (column=owner, value=filestore, timestamp=1303369135423000)

3 How to implement

The name space is in a tree like structure, in order to get the metadata for a file/directory with a given path, need to walk from the root(/) to the leaves(file /directory) Some scenarios:

3.1 make a directory : fs_mkdir /testdir1/testdir2/testdir3 (/testdir1/testdir2 already exisits)
1. generate a key for this entry: new_key=dirkey_`uuid`

2. walk from the root directory(/, key is dirkey_1) to get the key for the parent directory(testdir2), assuming the key is dirkey_XXX

3.insert a column in the parent directory entry (testdir2, with key dirkey_XXX). the column name is the name of the inserting directory(testdir3), and its value is the new_key

4. create a new entry for the new directory, with all the columns (owner, group)

3.2 list a directory (fs_ls /testdir1/testdir2)
1. walk from the root directory(/, key is dirkey_1) to get the key for the directory(testdir2), assuming the key is dirkey_XXX

2. list the column names of Directory entry with key dirkey_XXX except for metadata(group,owner..)

3.3 list a file (fs_ls /testdir/testdir2/file1)
1. walk from the root directory(/, key is dirkey_1) to get the key for the parent directory(testdir2), assuming the key is dirkey_XXX

2. get the key for the file "file1" from Directory entry with key dirkey_XXX, assuming the key for file1 is fkey_YYY

3. list the values for metadata columns (owner/group/size)in the FFile entry with key fkey_YYY.

Example Code in Python

import pycassa;
import uuid;
import sys
from FileStore import *

def list_dir(cf_dir,dir_id):
   '''list the contents under a directory'''
   dir_key="dirkey_"+str(dir_id)
   try:
           row=cf_dir.get(dir_key)
           for key in row.keys():
                   if key not in META_KEY:
                           print key
   except:
      print 

def usage():
   print '''
   %s file or dir path'''%(sys.argv[0])
if len(sys.argv)!=2:
   usage()
   sys.exit(-1)

dirs=parse_path(sys.argv[1])
pool=pycassa.connect('Keyspace1',['localhost:9160'])
cf_dir=pycassa.ColumnFamily(pool,'Directory')
cf_file=pycassa.ColumnFamily(pool,"FFile")

p_dir_id=1
if dirs[0]=="":
   list_dir(cf_dir,p_dir_id)
   sys.exit(0)   

plen=len(dirs)
for i in range(0,plen,1):
   c_dir=dirs[i]
   p_dir_key="dirkey_"+str(p_dir_id)
   try:
      p_dir_id=cf_dir.get(p_dir_key)[c_dir]
   except:
      print sys.argv[1]," does not exist"
      sys.exit(2)

file_key="filekey_"+str(p_dir_id)
try:
   row=cf_file.get(file_key)
   if not row.has_key('owner'):
      owner="wenjing"
   else:
      owner=row['owner']
   if not row.has_key('group'):
      group="atlas"
   else:
      group=row['group']
   size=row['size']
   print "%8s%8s%10s%10s"%(owner,group,size,dirs[-1])
except:
   list_dir(cf_dir,p_dir_id)

4. Basic name space operations

the following operations have been implemented with the Python API: fs_ls fs_cpr fs_cpw fs_mkdir fs_mv fs_rename fs_rm
fs_cpw
write a file to namespace, and store the content of the file in Cassandra storage.
[root@lcg003 Cassandra]# fs_cpw /etc/profile /testdir2/p123
[root@lcg003 Cassandra]# fs_ls /testdir2/p123
    root    root      1093      p123
fs_cpr
read a file from namespace
[root@lcg003 Cassandra]# fs_cpr /testdir2/p123 /tmp/profile
[root@lcg003 Cassandra]# diff /etc/profile /tmp/profile 

fs_mkdir

make a directory in namespace
[root@lcg003 Cassandra]# fs_mkdir /testdir3
[root@lcg003 Cassandra]# fs_ls /
file5
p6
testdir2
testdir3
fs_rename
rename a file or a directory in namespace
[root@lcg003 Cassandra]# fs_rename -h

   ./fs_rename file_name1(directory_name1) file_name2(direcotry_name2)
[root@lcg003 Cassandra]# fs_rename /testdir3 /testdir_rename3
[root@lcg003 Cassandra]# fs_ls /
testdir2
testdir_rename3
[root@lcg003 Cassandra]# 
[root@lcg003 Cassandra]# fs_rename /p6 /p6_rename
[root@lcg003 Cassandra]# fs_ls /p6_rename
    root    root      1093 p6_rename
fs_mv
move a file to another directory with the same name or another name, or move a file to the same directory with another name(rename), or move a directory to another directory with the same name or another name.

[root@lcg003 Cassandra]# fs_mv -h

   ./fs_mv file_path1 file_path2
[root@lcg003 Cassandra]# fs_mv -h

   ./fs_mv file_path1 file_path2
[root@lcg003 Cassandra]# fs_mv /file3 /file_rename3
[root@lcg003 Cassandra]# fs_ls /
file_rename3
testdir2
testdir3

[root@lcg003 Cassandra]# fs_mv /file_rename3 /testdir2/dir3/
[root@lcg003 Cassandra]# fs_ls /testdir2/dir3/file_rename3
    root    root      1093file_rename3
root@lcg003 Cassandra]# fs_ls /testdir2
dir5
p123
p6_rename

[root@lcg003 Cassandra]# fs_mv /testdir2/p123 /testdir2/dir5/p123_rename
[root@lcg003 Cassandra]# fs_ls /testdir2/dir5/p123_rename
    root    root      1093p123_rename

[root@lcg003 Cassandra]# fs_mv /testdir2/dir4 /testdir2/dir5/
[root@lcg003 Cassandra]# fs_ls /testdir2/dir5
dir4
p123_rename
[root@lcg003 Cassandra]# fs_ls /testdir2/dir5/dir4/
p6_rename

[root@lcg003 Cassandra]# fs_mv /testdir2/dir5/dir4 /dir4_rename
[root@lcg003 Cassandra]# fs_ls /dir4_rename
p6_rename

fs_rm
remove a file or a directory with its associated files and subdirectories from the namespace
[root@lcg003 Cassandra]# fs_rm -h

   ./fs_rm
   -r remove a direcotry and files associated with the direcotry
   -h print help information
[root@lcg003 Cassandra]# fs_rm /testdir2/dir3/file4
remove file /testdir2/dir3/file4

[root@lcg003 Cassandra]# fs_rm -r /testdir2/dir3
remove file  file_rename3
remove dir  dir3
[root@lcg003 Cassandra]# fs_ls /testdir2
dir5
p123
p6_rename
[root@lcg003 Cassandra]# 

5. Install pycassa python module(use easy_install)

5.1 install easy_install (based on python 2.4, please download suitable installation file)

$wget http://pypi.python.org/packages/2.4/s/setuptools/setuptools-0.6c11-py2.4.egg#md5=bd639f9b0eac4c42497034dec2ec0c2b

$sh setuptools-0.6c11-py2.4.egg

5.2 prequisition:

Make sure you have the following python modules installed

$easy_install thrift

$easy_install uuid

Make sure you have the following rpms installed

python-libs
python-devel

5.3 install pycassa

$easy_install pycassa

5.4 test

$python

>>import pycassa

Additional Material

  • Presentation made by Wenjing Wu for the project meeting on May 17th 2011:here

-- WenjingWu - 2011-08-25

Topic revision: r1 - 2011-08-25 - WenjingWu
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback