GFS Primer

Best practices

  • Organize your data hierarchically and avoid putting more than a hundred
    files under the same directory. A big fat directory is hard for any files
    system even for local ext3/ext4 ones, worse on, an NFS, and probably worst on
    a GFS file system.
  • Create your temporary file(s) on local file system, such as /tmp or some
    locally mounted area under /home depending on the size of the stuff.
  • Use backup and scratch folders for large files, you have an additional space
    to use for temporary store for large files. You can create folders by running
    makescratch and makenobackup commands.
  • Open files read-only by default when running programs to access/analyze
    data in the same directory, especially when doing so from different nodes.
  • Disable colored output in ls In situations where a GFS file system is
    slow to respond, the first response of many users is to run ls in order to
    determine the problem. If the --color option is enabled, ls must run a
    stat() against every entry, which creates additional lock requests and can
    create contention for those files with other processes. This might exacerbate
    the problem and cause slower response times for processes accessing that file
    system. Thus in general, however, it is best to avoid excessive use of the
    ls command due to the locking overhead. In the same token, probably it’s a
    good idea to disable the file completion feature in tcsh/bash as well.

By default, RHEL systems are configured with the following aliases from
/etc/profile.d/colorls.sh and colorls.csh:

alias l.='ls -d .* --color=tty' 
alias ll='ls -l --color=tty' 
alias ls='ls --color=tty'

If disabling ls coloring aliases altogether is too inconvenient (since the home
directory is shared among the unified Linux platforms) when not logging to the 4
machines in the cluster, one can bypass the aliases temporarily as a one-time
shot when needed (just one time not for the whole login session) by prefixing a
“\” or use the full path as shown in the following examples:

\ls a*

or

/bin/ls a*

Advanced topics

Theory of operation

Both GFS and GFS2 work like local file systems, except in regards to caching. In
GFS/GFS2, caching is controlled by glocks. There are two essential things to
know about caching in order to understand GFS/GFS2 performance characteristics:

The cache is split between nodes: either only a single node may cache a
particular part of the file system at one time, or, in the case of a particular
part of the file system being read but not modified, multiple nodes may cache
the same part of the file system simultaneously. Caching granularity is per
inode or per resource group so that each object is associated with a glock that
controls its caching. There is no other form of communication between GFS/GFS2
nodes in the file system. All cache-control information comes from the glock
layer and the underlying lock manager (DLM). When a node makes an exclusive-use
access request (for a write or modification operation) to locally cache some
part of the file system that is currently in use elsewhere in the cluster, all
the other cluster nodes must write any pending changes and empty their caches.
If a write or modification operation has just been performed on another node,
this requires both log flushing and writing back of data, which can be
tremendously slower than accessing data that is already cached locally.

These caching principles apply to directories as well as to files. Adding or
removing a directory entry (i.e., creating or deleting a file in a directory) is
the same (from a caching point of view) as writing to a file, and reading the
directory or looking up a single entry is the same as reading a file. The speed
is slower if the file or directory is larger, although it also depends on how
much of the file or directory needs to be read in order to complete the
operation.

Reading cached data can be very fast. In GFS2, the code path used to read cached
data is almost identical to that used by the ext3/ext4 file system: the read
path goes directly to the page cache in order to check the page state and copy
the data to the application. But there will only be a call into the file system
to refresh the pages if the pages are non-existent or not up to date. GFS works
slightly differently: it wraps the read call in a glock directly; however,
reading data that is already cached this way is still fast. You can read the
same data at the same speed in parallel across multiple nodes, and the effective
transfer rate can be very large. It is generally possible to achieve acceptable
performance for most applications by being careful about how files are accessed.

Read/Write

Read/write performance should be acceptable for most applications, provided you
are careful not to cause too many cross-node accesses that require cache sync
and/or invalidation. Streaming writes on GFS2 are currently slower than on GFS.
This is a direct consequence of the locking hierarchy and results from GFS2
performing writes on a per-page basis like other (local) file systems. Each page
written has a certain amount of overhead. Due to the different lock ordering,
GFS does not suffer from the same problem since it is able to perform the
overhead operations once for multiple pages. Speed of multiple-page write calls
aside, there are many advantages to the GFS2 file system, including faster
performance for cached reads and simpler code for deadlock avoidance during
complicated write calls (for example, when the source page being written is from
a memory-mapped file on a different file system type).

Memory Mapping

GFS and GFS2 implement memory mapping differently. In GFS (and some earlier GFS2
kernels), a page fault on a writable shared mapping would always result in an
exclusive lock being taken for the inode in question. This is consequence of an
optimization that was originally introduced for local file systems where pages
would be made writable on the initial page fault in order to avoid a potential
second fault later (if the first access was a read and a subsequent access was a
write). In Red Hat Enterprise Linux 6 (and some later versions of Red Hat
Enterprise Linux 5) kernels, GFS2 has implemented a system of only providing a
read-only mapping for read requests, significantly improving scalability. A file
that is mapped on multiple nodes of a GFS2 cluster in a shared writable manner
can be cached on all nodes, provided no writes occur.

Cache Control (fsync/fadvise/madvise)

Both GFS and GFS2 support fsync(2), which functions the same way as in any
local file system. When using fsync(2) with numerous small files, Red Hat
recommends sorting them by inode number. This will improve performance by
reducing the disk seeks required to complete the operation. If it is possible to
save up fsync(2) on a set of files and sync them all back together, it will
help performance when compared with using either O_SYNC or fsync(2) after
each write. To improve performance with GFS2, you can use the fadvise and/or
madvise pair of system calls to request read ahead or cache flushing when it is
known that data will not be used again (GFS does not support the
fadvise/madvise interface). Overall performance can be significantly
improved by flushing the page cache for an inode when it will not be used from a
particular node again and is likely to be requested by another node.

File Locking

The locking methods below are only recommendations as GFS and GFS2 do not
support mandatory locks.

The flock system call

The flock system call is implemented by type 6 glocks and works across the
cluster in the normal way. It is affected by the localflocks mount option, as
described below. Flocks are a relatively fast method of file locking and are
preferred to fcntl locks on performance grounds (the difference becomes greater
on clusters with larger node counts).

Which Inode is Contended?

Glock numbers are made up of two parts. In the glock dump, glock numbers are
represented as type, number. Type 2 indicates an inode, and type 3 indicates a
resource group. There are additional types of glocks, but the majority of
slowdown and glock-contention issues will be associated with these two glock
types.

The number of the type 2 glocks (inode glocks) indicates the disk location (in
file system blocks) of the inode and also serves as the inode identification
number. You can convert the inode number listed in the glock dump from
hexadecimal to decimal format and use it to track down the inode associated with
that glock. Identifying the contended inode should be possible using find -inum,
preferably on an otherwise idle file system since it will try to read all the
inodes in the file system, making any contention problem worse.