create new tag
view all tags

Papers, blogs and books Reading

1 Investigation of "small files problem"

Need Help with "scale out NAS" or "Object-based system"

  • application scenerio:
    • File services for pdfs and scanned documents
    • millions of small files (<20MB?)
    • access time for retrieving files goes up high

The Best File Storage Solutions for Small Files

  • introduces online file services provider classified by different types of files
    • for the work&school document, which may need online editing, two providers are recommended: Dropbox and Box.com
    • for the static photoes and other documents, which are only for archieve, two providers are recommended as well: Mozy and ZipCloud

2 Object-based Storage System investigation



Small files storage problem with Lustre

Small files storage problem with the application on Lustre

Why are small files produced?



  • Gluster combines object and file storage
    • it has been mentioned that: "Most object storage products such as EMC Atmos, Scality, OpenStack and Amazon S3 don’t have file systems"
    • if OpenStack Swift can solve the "Smalll files storage" problem well, then how about develop an file system interface for Swift??

Small files storage problem in Gluster



Small files storage problem in Ceph

3 Cloud storage system

Openstack Swift

HDFS: Hadoop File System


Small files problem in HDFS

  • The small files problem
    • A small file is one which is significantly smaller than the HDFS block size (default 64MB)
    • If there are a lot of(for example, millions/billions) small files, since files, directories, and blocks are represented as objects(each object is 150KB) stored in the namenode's memory, 10 million files will cost ~3GB, billion files will cost 3TB, which is not acceptable for current hardware capacity.
    • besides, the access mode to files in HDFS decides the poor performance for the access time of small files, because it has to hop between namenode and datanodes.
    • There are two kinds of replications which generate (lots of ) small files, one is the appending operation in HDFS, another one is because of the inherent nature of the files, like photoes, they are natural small.
    • To alleviate the small files problem in HDFS(actually, it's the problem of poor performance of access time) , there are three methods:
      • take advantage of HAR(Hadoop Archievs) files, which work by adding another filesystem layer on top of HDFS, the problem is the reading files through HAR filesystem is not as efficient as only from HDFS(for big files), 'cause there are two indexes reads required first.
      • take advantage of sequence files, key is the file name, value is the file data.
      • take advantage of MapFile files, whose data

4 Small files storage requirements


in Grid Computing enviroment

other scietific area

in high energy physics field

-- DuRan - 2013-01-16

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2013-01-17 - DuRan
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback