Questions to be answered

  • From "the problem" perspective
    • Q1: what is "small files" problem?
    • Q2: why there is "small files" problem?
    • Q3: How to fix "small files" problem?
  • From the use case perspective
    • Q1: How are small files generated in HEP field? Any other scietific fields?
    • Q2: What are the small files used for? Read? Write? Copy? Transfer?
    • Q3: What is the scale of clients? thousands? millions?

Investigation of "small files" problem

Q2: why there is "small files" problem?

Efficient Access to Many Small files in a Filesystem for Grid Computing, Douglas Thain et. al, Sep. 2007

  • Unfortunately, the data throughput of small le operations on both networks and lesystems is many orders of magnitude worse than the bulk transfer speeds available with large files. On the network, this is because protocols such as
    • FTP [15]treat a single le as a distinct heavyweight transaction that requires an individual network stream and authentication step.
    • A network lesystem such as NFS [16] has the opposite prob-lem: files are access on demand in small page-sized chunks, resulting in many network round trips and poor performance.This is particularly harmful in grids, where high network latencies are common.

Q3: How to fix "small files" problem?


  • from the asker: The filesystems that I tried (ext4, btrfs) have some problems with positioning of files on disk. Over a longer span of time, the physical positions of files on the disk (rotating media, not solid state disk) are becoming more randomly distributed. The negative consequence of this random distribution is that the filesystem is getting slower (such as: 4 times slower than a fresh filesystem).
  • from the answer, possible candidate:

Investigation of "small files" use cases

Q1: How are small files generated in HEP field? Any other scietific fields?

Efficient Access to Many Small files in a Filesystem for Grid Computing, Douglas Thain et. al, Sep. 2007

  • There are also many production workloads that manipulate primarily large numbers of small les. For example
    • in bioinformatics applications such as BLAST [14], it is common to run thousands of small (less than 100 bytes)string queries against a constant genomic database of several
    • In other cases, a standard grid application may depend on the installation of a complex software package consisting of executables, dynamic libraries, and conguration les.The package must either be accessed at run-time over the network,resulting in many small network operations, or installed on a worker node, resulting in a large number of small file creations.
    • Or, a grid computing system may create a large number of small files internally in the course of executing a workload for the inputs, outputs, log files, and so forth.

Distributed File System


  • local file systems
    • ext[34] focus is familiarity
    • XFS's focus is higher scale - more disks, more cores, etc
    • btrfs's focus is advanced technology, esp. snapshots and online repare capability.
  • non-local file systems
    • NFSv3 largely about familiarity, broad feature support, good performance for common workloads
    • GFS2 & OCFS2, shared-storage filesystems, focused on high availability & high consistency
    • Lustre focused on very high performance on "embarrassingly parallel" I/O workloads when backed by top-of-the-line hardware
    • PVFS2 has historically pursued a similar track(as Lustre), though the OrangeFS branch is trying to address general-purpose needs more
    • GlusterFS is focused on general-purpose use with cheap commodity hardware, and on flexibility/modularity
    • Ceph is focused on the very latest algorithms to scale up to petabytes, even though the production-level implementation of those algorithms might take longer than a simpler approach would have
    • HDFS and GoogleFS are highly specialized for the needs of their respective creators
    • worthy of mention are XtreemFS and Gfarm (both focused on wide-area distribution), Pomegranate (optimizing for very many small files), and Tahoe-LAFS (privacy/security)



  • Network = one server, or appearance thereof.
  • Cluster = multiple servers with a "shared disk" model (SAN with failover)
  • Distributed = multiple servers with a "shared nothing" model (server-local disks with or without replication)
  • NFSv3 is a network filesystem.
  • GFS2 is a cluster filesystem.
  • GlusterFS is a distributed filesystem.
  • NFSv4, especially with pNFS, is a bit of a weird network/distributed hybrid. It's still fundamentally a single-server model, but with little bits of multi-server support grafted on


  • from Jeff Darcy, HekaFS (née CloudFS) project lead
    • For local filesystems:
      • there's still a lot of work on B-tree based filesystems like ZFS and btrfs, but I think that has passed its peak as a research area and the focus will shift elsewhere
      • Making all layers of the storage stack work better with SSDs was a huge topic at FAST'11
      • Better repair and data-integrity guarantees are also getting more attention as capacities continue to outstrip speeds by ever greater ratios
    • For distributed filesystems
      • the biggest challenge IMO(In My Opinion) - and this is said as a filesystem developer - is keeping them relevant
      • I think Gluster is on the right track offering both filesystem and object (S3/Swift) APIs on top of the same basic infrastructure, as is Ceph with filesystem and block APIs (RBD)
      • Another area of inquiry is providing ways for higher-level systems such as Hadoop to reason about the physical location of data that has been put into a distributed filesystem (or for that matter any other kind of storage besides Hadoop's own HDFS)
  • from Kartik Ayyar, Distributed filesystem developer
  • from Ravi Tandon
    • I believe file systems are moving towards a flat object oriented structure, File system sizes are getting too large to store them hierarchically. Flat file systems are those where files, directories, symbolic links are objects that can be tagged. These tags will be semantically generated depending on the content of the files. These tags can then be hashed to locate file. It has two primary advantages:
      • file system objects can be clustered together in a better manner semantically.
      • Secondly search time would be greatly enhanced due to object tagging.File system storage technologies are moving towards log-structured file systems. Flash disks, SSD would propel file systems of the future.

  • from Jim Dennis
    • need to resolve the problem of traversal
      • I think we need to see filesystems adding support for their own indexing (essentially providing continuously/incrementally updated indexes for various attributes and a query language and engine for receiving subsets of those) and for providing enhanced "inotify/dnotify" related APIs (to allow applications to register themselves to event driven notification of changes to directories, file contents, and/or file meta data for whole trees.

  • from Roy Lacombe -- from the perspective of hardward changes
    • I think adoption of memristors (why isn't this word in my Safari's spellchecker already?) alone can result in big paradigm shifts not only in file systems, but in OS philosophy and in programming too.

  • from Alan Cohen -- local filesystem
    • I think we will see file systems that are more like ZFS.

Cloud storage system


OpenStack Swift


Object-based storage system





Object-based storage by Oracle



  • Why are traditional file systems rapidly becoming obsolete?
    • First, there's the "lots of objects" problem
    • Second, there's the metadata problem.
    • Third, there's the policies and services problem
  • Object-based information stores are different than filesystems in several important ways
    • First, you use a token or other uniform identifier to get your information
    • Second, they have the ability to associate all sorts of metadata with the object itself
    • Third, the ability to hang metadata off the object gives us the ability to create all sorts of useful policies and services around the information without having to put everything in some sort of database or repository.


  • in short, any data intensive job is a good target for parallel filesystems. However, you're likely to see more gains on large I/Os than you are on small I/Os because smaller I/Os have a heavier metadata component



FUSE: Filesystem in Userspace

B Tree


-- DuRan - 2013-01-22
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2013-01-29 - DuRan
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback