Virtual Filesystem and Storage Internals

The Virtual Filesystem (VFS), also known as the Virtual Filesystem Switch, is the kernel layer that provides a uniform filesystem interface to userspace while allowing entirely different storage backends — on-disk filesystems, network filesystems, pseudo-filesystems, and in-memory filesystems — to coexist transparently. System calls such as open(2), read(2), write(2), and stat(2) all pass through the VFS before reaching any filesystem-specific code.

VFS architecture

The VFS sits between the POSIX system call interface and individual filesystem implementations. When a process calls open("/mnt/data/file.txt", O_RDONLY), the kernel:

Resolves the path through the directory entry cache (dcache).
Looks up or creates the inode for each path component.
Allocates a struct file and inserts it into the process’s file descriptor table.
Calls the filesystem’s open() method.

The VFS defines four core object types, each with an associated operations table that filesystem implementations fill in:

struct file_system_type  →  describes the filesystem type
struct super_block        →  one per mounted filesystem instance
struct inode              →  one per filesystem object (file, dir, symlink, …)
struct dentry             →  one per path component; cached in the dcache
struct file               →  one per open file description

Registering and mounting a filesystem

A filesystem registers itself with the VFS at module load time:

#include <linux/fs.h>

static struct file_system_type my_fs_type = {
    .name           = "myfs",
    .init_fs_context = myfs_init_fs_context,
    .kill_sb        = kill_block_super,
    .owner          = THIS_MODULE,
};

static int __init myfs_init(void) {
    return register_filesystem(&my_fs_type);
}

All registered filesystems appear in /proc/filesystems. When mount(2) is called, the VFS invokes the init_fs_context method to initialise a struct fs_context, then calls get_tree() to obtain the superblock.

The superblock object

A superblock represents a mounted filesystem instance. It holds filesystem-wide metadata and a pointer to struct super_operations which the VFS calls to manage the filesystem’s lifecycle:

struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*dirty_inode)(struct inode *, int flags);
    int  (*write_inode)(struct inode *, struct writeback_control *wbc);
    void (*evict_inode)(struct inode *);
    void (*put_super)(struct super_block *);
    int  (*sync_fs)(struct super_block *sb, int wait);
    int  (*statfs)(struct dentry *, struct kstatfs *);
    /* ... */
};

Notable operations:

dirty_inode — called when inode metadata (not data) is modified.
write_inode — called by the writeback thread to flush inode metadata to disk.
evict_inode — cleans up page cache and on-disk structures when an inode is removed from memory.
sync_fs — called by sync(2) to flush all dirty data.

The inode object

An inode represents a single filesystem object: a regular file, directory, symbolic link, device node, FIFO, or socket. It holds the object’s metadata — ownership, permissions, timestamps, size, and block pointers — and a pointer to struct inode_operations.

struct inode_operations {
    struct dentry *(*lookup)(struct inode *, struct dentry *, unsigned int);
    int (*create)(struct mnt_idmap *, struct inode *, struct dentry *,
                  umode_t, bool);
    int (*link)(struct dentry *, struct inode *, struct dentry *);
    int (*unlink)(struct inode *, struct dentry *);
    int (*mkdir)(struct mnt_idmap *, struct inode *, struct dentry *, umode_t);
    int (*rmdir)(struct inode *, struct dentry *);
    int (*rename)(struct mnt_idmap *, struct inode *, struct dentry *,
                  struct inode *, struct dentry *, unsigned int);
    int (*permission)(struct mnt_idmap *, struct inode *, int);
    int (*setattr)(struct mnt_idmap *, struct dentry *, struct iattr *);
    int (*getattr)(struct mnt_idmap *, const struct path *, struct kstat *,
                   u32, unsigned int);
    /* ... */
};

A single inode may be referenced by multiple dentries — this is the mechanism behind hard links.

The dentry cache

Dentries map path components (names) to inodes and are cached in the dentry cache (dcache) to make repeated path lookups fast. Dentries live only in RAM; they are never written to disk. A negative dentry caches the fact that a name does not exist, avoiding repeated disk lookups.

struct dentry_operations {
    /* Called when a name lookup hits a dcache entry */
    int (*d_revalidate)(struct inode *, const struct qstr *,
                        struct dentry *, unsigned int);
    /* Hash function for dentry name */
    int (*d_hash)(const struct dentry *, struct qstr *);
    /* Name comparison */
    int (*d_compare)(const struct dentry *, unsigned int,
                     const char *, const struct qstr *);
    /* Called when the last reference to a dentry is dropped */
    void (*d_release)(struct dentry *);
    /* ... */
};

Network filesystems like NFS implement d_revalidate to check with the server whether a cached dentry is still valid. Local filesystems typically leave it as NULL, since local dentries do not go stale.

The file object

Opening a file allocates a struct file that represents a single open file description (one entry in the process’s file descriptor table). It holds the current seek position, open flags, and a pointer to struct file_operations.

struct file_operations {
    loff_t  (*llseek)(struct file *, loff_t, int);
    ssize_t (*read_iter)(struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter)(struct kiocb *, struct iov_iter *);
    int     (*iterate_shared)(struct file *, struct dir_context *);
    __poll_t (*poll)(struct file *, struct poll_table_struct *);
    long    (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    int     (*mmap)(struct file *, struct vm_area_struct *);
    int     (*open)(struct inode *, struct file *);
    int     (*release)(struct inode *, struct file *);
    int     (*fsync)(struct file *, loff_t, loff_t, int datasync);
    ssize_t (*splice_read)(struct file *, loff_t *,
                           struct pipe_inode_info *, size_t, unsigned int);
    /* ... */
};

Page cache and writeback

The page cache is the kernel’s unified buffer for file data. When a process reads a file, the kernel loads pages into the page cache and serves subsequent reads from there. Writes go to the page cache first (making pages dirty), and a background thread — the writeback daemon (flusher threads) — periodically flushes dirty pages to disk. Pages are organised per-inode in an address_space structure (sometimes called the inode’s mapping):

struct address_space_operations {
    int     (*read_folio)(struct file *, struct folio *);
    int     (*writepages)(struct address_space *,
                          struct writeback_control *);
    bool    (*dirty_folio)(struct address_space *, struct folio *);
    void    (*readahead)(struct readahead_control *);
    /* ... */
};

Writeback is triggered by three conditions: the dirty ratio threshold (/proc/sys/vm/dirty_ratio), the dirty background ratio (dirty_background_ratio), and fsync(2) called by the application.

Key filesystem implementations

ext4 — journalling block filesystem

ext4 is the default filesystem for many Linux distributions. Its journal (in a special inode) records metadata operations before they are applied to the main filesystem, ensuring consistency after a crash. Three journalling modes are supported: journal (data + metadata), ordered (metadata journalled, data flushed first; default), and writeback (metadata only).Key features: extents for large files, delayed allocation, online resizing, and dir_index (htree directories) for large directories.

mkfs.ext4 -L mydata /dev/sdb1
mount -t ext4 /dev/sdb1 /mnt/data

btrfs — copy-on-write filesystem

btrfs uses a copy-on-write (COW) B-tree structure for all metadata and optionally for data. This enables instant snapshots, online scrubbing and RAID-like redundancy across multiple devices, transparent compression, and data checksumming.

# Create a snapshot
btrfs subvolume snapshot /mnt/data /mnt/data-snap

# Enable compression
mount -t btrfs -o compress=zstd /dev/sdb1 /mnt/data

xfs — high-performance 64-bit filesystem

xfs was designed for large files and high-throughput workloads. It uses B+ trees throughout, supports online growing, and provides excellent parallel I/O performance due to its allocation group design (each group manages its own free space independently, reducing contention).xfs does not support online shrinking.

tmpfs — memory-backed filesystem

tmpfs stores all data in the page cache (anonymous or swap-backed pages). There is no persistent backing store; data is lost on unmount. It is used for /tmp, /run, and shared memory (shm_open()). Unlike ramfs, tmpfs obeys memory limits and can swap pages out under pressure.

overlayfs — union filesystem for containers

overlayfs stacks a read-write upper layer on top of a read-only lower layer. Reads come from whichever layer has the file; writes go to the upper layer via copy-on-write. This is the mechanism behind container image layers in Docker and Podman.

mount -t overlay overlay \
  -o lowerdir=/lower,upperdir=/upper,workdir=/work \
  /merged

FUSE — userspace filesystems

FUSE (Filesystem in USErspace) allows filesystem implementations to run as ordinary userspace processes. The kernel FUSE module translates VFS calls into requests that are sent over a /dev/fuse file descriptor to the userspace daemon, which processes them and returns responses.

/* Userspace FUSE operation (libfuse API) */
static int my_getattr(const char *path, struct stat *stbuf,
                      struct fuse_file_info *fi) {
    memset(stbuf, 0, sizeof(*stbuf));
    if (strcmp(path, "/") == 0) {
        stbuf->st_mode = S_IFDIR | 0755;
        stbuf->st_nlink = 2;
    }
    return 0;
}

FUSE has higher per-operation overhead than in-kernel filesystems because each VFS call requires a context switch to userspace and back. For latency-critical workloads, prefer in-kernel implementations.

Filesystem mounting and namespaces

Mounts are tracked in struct mount objects organised into a mount tree. Each process has a reference to a mount namespace (struct mnt_namespace) that determines its view of the filesystem hierarchy. Different namespaces can have entirely different mount trees, which is the foundation for container filesystem isolation.

# Create an isolated mount namespace
unshare --mount bash

# Inspect current mount namespace
ls -la /proc/self/ns/mnt

Shared subtrees (MS_SHARED, MS_PRIVATE, MS_SLAVE) control whether mount and unmount events propagate between namespaces. The behaviour is described in detail in Documentation/filesystems/sharedsubtrees.rst.

Direct I/O and io_uring

Direct I/O (O_DIRECT) bypasses the page cache entirely, transferring data directly between userspace buffers and the block device. This is useful for databases that implement their own caching. Buffers must be aligned to the block size.

int fd = open("file", O_RDWR | O_DIRECT);
/* buffer must be aligned: posix_memalign(&buf, 512, 4096) */
read(fd, buf, 4096);

io_uring (merged in kernel 5.1) provides a high-performance, low-latency asynchronous I/O interface based on two shared ring buffers — a submission queue (SQ) and a completion queue (CQ) — between userspace and the kernel. It supports both buffered and direct I/O, network sockets, and many other operations with a single interface.

struct io_uring ring;
io_uring_queue_init(32, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_submit(&ring);

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
/* cqe->res contains bytes read or a negative error code */
io_uring_cqe_seen(&ring, cqe);

Memory management

The page cache is part of the memory management subsystem; dirty page reclaim interacts directly with the MM layer.

Locking primitives

VFS operations use a mix of inode semaphores, RCU for dcache lookups, and spinlocks in the page cache.

Networking

Network filesystems (NFS, SMB, 9P) sit below the VFS and above the networking stack.

Scheduling

Writeback kthreads and flusher daemons are scheduled by the same scheduler as ordinary tasks.

Get Started

Development Guide

Kernel Internals

Administration

Virtual Filesystem and Storage Internals

VFS architecture

Registering and mounting a filesystem

The superblock object

The inode object

The dentry cache

The file object

Page cache and writeback

Key filesystem implementations

FUSE — userspace filesystems

Filesystem mounting and namespaces

Direct I/O and io_uring

Memory management

Locking primitives

Networking

Scheduling

Build docs developers (and LLMs) love

Get Started

Development Guide

Kernel Internals

Administration

Documentation Index

​VFS architecture

​Registering and mounting a filesystem

​The superblock object

​The inode object

​The dentry cache

​The file object

​Page cache and writeback

​Key filesystem implementations

​FUSE — userspace filesystems

​Filesystem mounting and namespaces

​Direct I/O and io_uring

Memory management

Locking primitives

Networking

Scheduling

Build docs developers (and LLMs) love

VFS architecture

Registering and mounting a filesystem

The superblock object

The inode object

The dentry cache

The file object

Page cache and writeback

Key filesystem implementations

FUSE — userspace filesystems

Filesystem mounting and namespaces

Direct I/O and io_uring