Power-safe file system

The power-safe file system is a reliable disk file system that can withstand power failures without losing or corrupting data. It was designed for and is intended for traditional rotating hard disk drive media. This file system is supported by the fs-qnx6.so shared object.

Problems with existing disk file systems

Although existing disk file systems are designed to be robust and reliable, there's still the possibility of losing data, depending on what the file system is doing when a catastrophic failure (such as a power failure) occurs. For example:

  • Each sector on a hard disk includes a 4-byte error-correcting code (ECC) that the drive uses to catch hardware errors and so on. If the driver is writing the disk when the power fails, then the heads are removed to prevent them from crashing on the surface, leaving the sector half-written with the new content. The next time you try to read that block—or sector—the inconsistent ECC causes the read to fail, so you lose both the old and new content.

    You can get hard drives that offer atomic sector upgrades and promise you that either all of the old or new data in the sector is readable, but these drives are rare and expensive.

  • Some file system operations require updating multiple on-disk data structures. For example, if a program calls unlink(), the file system has to update a bitmap block, a directory block, and an inode, which means it has to write three separate blocks. If the power fails between writing these blocks, the file system is in an inconsistent state on the disk. Critical file system data, such as updates to directories, inodes, extent blocks, and the bitmap are written synchronously to the disk in a carefully chosen order to reduce—but not eliminate—this risk.
  • You can use chkfsys to check for consistency on a QNX 4 file system, but it checks only the file system's structure and metadata, not the user's file data, and it can be slow if the disk is large or there are many directories on it.
  • If the root directory, the bitmap, or inode file (all in the first few blocks of the disk) gets corrupted, you wouldn't be able to mount the file system at all. You might be able to manually repair the system, but you need to be very familiar with the details of the file system structure.

Copy-on-write file system

To address the problems associated with existing disk file systems, the power-safe file system never overwrites live data; it does all updates using copy-on-write (COW), assembling a new view of the file system in unused blocks on the disk. The new view of the file system becomes live only when all the updates are safely written on the disk. Everything is COW: both metadata and user data are protected. To see how this works, let's consider how the data is stored. A power-safe file system is divided into logical blocks, the size of which you can specify when you use mkqnx6fs to format the file system. Each inode includes 16 pointers to blocks. If the file is smaller than 16 blocks, the inode points to the data blocks directly. If the file is any bigger, those 16 blocks become pointers to more blocks, and so on.

The final block pointers to the real data are all in the leaves and are all at the same level. In some other file systems—such as EXT2—a file always has some direct blocks, some indirect ones, and some double indirect, so you go to different levels to get to different parts of the file. With the power-safe file system, all the user data for a file is at the same level.

Diagram showing inodes, indirect pointers, and blocks.

If you change some data, it's written in one or more unused blocks, and the original data remains unchanged. The list of indirect block pointers must be modified to refer to the newly used blocks, but again the file system copies the existing block of pointers and modifies the copy. The file system then updates the inode again by modifying a copy—to refer to the new block of indirect pointers. When the operation is complete, the original data and the pointers to it remain intact, but there's a new set of blocks, indirect pointers, and inode for the modified data:

Diagram showing COW inodes, indirect pointers, and blocks.

This has several implications for the COW file system:

  • The bitmap and inodes are treated in the same way as user files.
  • Any file system block can be relocated, so there aren't any fixed locations, such as those for the root block or bitmap in the QNX 4 file system
  • The file system must be completely self-referential.

A superblock is a global root block that contains the inodes for the system bitmap and inodes files. A power-safe file system maintains two superblocks:

  • A stable superblock that reflects the original version of all the blocks
  • A working superblock that reflects the modified data

The working superblock can include pointers to blocks in the stable superblock. These blocks contain data that hasn't yet been modified. The inodes and bitmap for the working superblock grow from it.

Diagram showing stable and working superblocks.

A snapshot is a consistent view of the file system (simply a committed superblock). To take a snapshot, the file system:

  1. Locks the file system to make sure that it's in a stable state; all client activity is suspended, and there must be no active operations.
  2. Writes all the copied blocks to disk. The order isn't important (as it is for the QNX 4 file system), so it can be optimized.
  3. Forces the data to be synchronized to disk, including flushing any hardware track cache.
  4. Constructs the superblock, recording the new location of the bitmap and inodes, incrementing its sequence number, and calculating a CRC.
  5. Writes the superblock to disk.
  6. Switches between the working and committed views. The old versions of the copied blocks are freed and become available for use.

To mount the disk at startup, the file system simply reads the superblocks from disk, validates their CRCs, and then chooses the one with the higher sequence number. There's no need to run chkfsys or replay a transaction log. The time it takes to mount the file system is the time it takes to read a couple of blocks.

Note: If the drive doesn't support synchronizing, fs-qnx6.so can't guarantee that the file system is power-safe. Before using this file system on devices—such as USB/Flash devices—other than traditional rotating hard disk drive media, check to make sure that your device meets the file system's requirements. For more information, see Required properties of the device.

Performance

The copy-on-write (COW) method has some drawbacks:

  • Each change to user data can cause up to a dozen blocks to be copied and modified, because the file system never modifies the inode and indirect block pointers in place; it has to copy the blocks to a new location and modify the copies. Thus, write operations are longer.
  • When taking a snapshot, the file system must force all blocks fully to disk before it commits the superblock.

However:

  • There's no constraint on the order in which the blocks (aside from the superblock) can be written.
  • The new blocks can be allocated from any free, contiguous space.

The performance of the file system depends on how much buffer cache is available, and on the frequency of the snapshots. Snapshots occur periodically (every 10 seconds, or as specified by the snapshot option to fs-qnx6.so ), and also when you call sync() for the entire file system, or fsync() for a single file.

Note: Synchronization is at the file system level, not at that of individual files, so fsync() is potentially an expensive operation; the power-safe file system ignores the O_SYNC flag.

You can also turn snapshots off if you're doing some long operation, and the intermediate states aren't useful to you. For example, suppose you are copying a very large file into a power-safe file system. The cp utility is really just a sequence of basic operations:

  • An open(O_CREAT|O_TRUNC) to make the file
  • A bunch of write() operations to copy the data
  • A close(), chmod(), and chown() to copy the metadata

If the file is big enough so that copying it spans snapshots, you have on-disk views that include the file not existing, the file existing at a variety of sizes, and finally the complete file copied and its IDs and permissions set:

Diagram showing a timeline of cp operations and snapshots.

Each snapshot is a valid point-in-time view of the file system (that is, if you have copied 50 MB, the size is 50 MB, and all data up to 50 MB is also correctly copied and available). If there's a power failure, the file system is restored to the most recent snapshot. But the file system has no concept that the sequence of open(), write(), and close() operations is really one higher-level operation, cp. If you want the higher-level semantics, disable the snapshots around the cp, and then the middle snapshots won't happen, and if a power failure occurs, the file is either complete, or not there at all.

For information about using this file system, see Power-safe file system.