Virtual Filesystem Layer and Filesystem APIs

The Virtual File System (VFS) is the software layer in the kernel that presents a uniform filesystem interface to user-space programs and provides an abstraction within the kernel that allows different filesystem implementations — ext4, btrfs, xfs, tmpfs, NFS, and many others — to coexist. System calls such as open(2), read(2), write(2), and stat(2) always go through the VFS, which dispatches them to the appropriate filesystem-specific code.

Core data structures

The VFS is built around four primary objects. Each maps to a concept in a Unix filesystem:

struct super_block — the mounted filesystem

A super_block represents one mounted instance of a filesystem. It holds the block size, filesystem flags, a pointer to the root dentry, and the super_operations dispatch table. When a filesystem is unmounted, the VFS calls put_super on the superblock to allow cleanup.

struct inode — a filesystem object

An inode represents a file, directory, symbolic link, device node, FIFO, or socket. It stores the object’s permissions, size, timestamps, and a pointer to inode_operations. A single inode may be referenced by multiple dentry objects (hard links). Inodes for block-device filesystems are cached in memory and written back to disk when dirty.

struct dentry — a directory entry

A dentry is the kernel’s in-memory representation of a pathname component. The dentry cache (dcache) translates a pathname like /home/user/file.c into a chain of dentries terminating at an inode. Dentries are never written to disk; they exist purely as a performance cache.

struct file — an open file description

A file object is created when a process calls open(2). It holds the current file offset, open flags, and a pointer to file_operations. Closing the last file descriptor referencing a file object calls release and drops the reference on the underlying dentry and inode.

super_operations

super_operations is the dispatch table that the VFS uses to manage a mounted filesystem instance:

/* Documentation/filesystems/vfs.rst */
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);

    void (*dirty_inode)(struct inode *, int flags);
    int  (*write_inode)(struct inode *, struct writeback_control *wbc);
    int  (*drop_inode)(struct inode *);
    void (*evict_inode)(struct inode *);
    void (*put_super)(struct super_block *);
    int  (*sync_fs)(struct super_block *sb, int wait);
    int  (*statfs)(struct dentry *, struct kstatfs *);
    void (*umount_begin)(struct super_block *);
    int  (*show_options)(struct seq_file *, struct dentry *);
};

All methods are called without locks held unless documented otherwise.

inode_operations

inode_operations describes how the VFS manipulates individual inodes:

/* Documentation/filesystems/vfs.rst */
struct inode_operations {
    int          (*create)(struct mnt_idmap *, struct inode *,
                           struct dentry *, umode_t, bool);
    struct dentry *(*lookup)(struct inode *, struct dentry *, unsigned int);
    int          (*link)(struct dentry *, struct inode *, struct dentry *);
    int          (*unlink)(struct inode *, struct dentry *);
    int          (*symlink)(struct mnt_idmap *, struct inode *,
                            struct dentry *, const char *);
    struct dentry *(*mkdir)(struct mnt_idmap *, struct inode *,
                            struct dentry *, umode_t);
    int          (*rmdir)(struct inode *, struct dentry *);
    int          (*rename)(struct mnt_idmap *, struct inode *,
                           struct dentry *, struct inode *,
                           struct dentry *, unsigned int);
    int          (*permission)(struct mnt_idmap *, struct inode *, int);
    int          (*setattr)(struct mnt_idmap *, struct dentry *,
                            struct iattr *);
    int          (*getattr)(struct mnt_idmap *, const struct path *,
                            struct kstat *, u32, unsigned int);
};

lookup is among the most important methods — it is called whenever the VFS needs to resolve a pathname component against a parent directory inode. A successful lookup must call d_add() to install the found inode into the dentry.

file_operations

file_operations defines how the VFS operates on open file descriptions. As of kernel 4.18+:

/* Documentation/filesystems/vfs.rst */
struct file_operations {
    struct module  *owner;
    loff_t         (*llseek)(struct file *, loff_t, int);
    ssize_t        (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t        (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t        (*read_iter)(struct kiocb *, struct iov_iter *);
    ssize_t        (*write_iter)(struct kiocb *, struct iov_iter *);
    int            (*iterate_shared)(struct file *, struct dir_context *);
    __poll_t       (*poll)(struct file *, struct poll_table_struct *);
    long           (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    int            (*mmap)(struct file *, struct vm_area_struct *);
    int            (*open)(struct inode *, struct file *);
    int            (*release)(struct inode *, struct file *);
    int            (*fsync)(struct file *, loff_t, loff_t, int datasync);
    loff_t         (*remap_file_range)(struct file *, loff_t,
                                       struct file *, loff_t, loff_t, unsigned int);
};

Modern kernel code uses read_iter/write_iter rather than read/write because iov_iter supports scatter-gather I/O and integrates with the io_uring subsystem.

Registering a filesystem

To make a filesystem mountable, you register it with the VFS using register_filesystem:

/* Documentation/filesystems/vfs.rst */
#include <linux/fs.h>

extern int   register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);

struct file_system_type describes the filesystem to the VFS:

/* Documentation/filesystems/vfs.rst */
struct file_system_type {
    const char                  *name;
    int                          fs_flags;
    int                         (*init_fs_context)(struct fs_context *);
    const struct fs_parameter_spec *parameters;
    void                        (*kill_sb)(struct super_block *);
    struct module               *owner;
    struct file_system_type     *next;
    struct hlist_head            fs_supers;
};

init_fs_context is the modern entry point for mounting. It populates a fs_context with filesystem-specific state, then the VFS calls get_tree to obtain or create a super_block. You can see all registered filesystems in /proc/filesystems.

Define file_system_type

Populate the name, fs_flags, init_fs_context, and kill_sb fields. Set owner to THIS_MODULE.

Implement init_fs_context

Allocate a private context structure, assign fc->ops to your fs_context_operations, and set fc->fs_private.

Implement get_tree

Call one of the helpers — get_tree_bdev (block device filesystem), get_tree_nodev (pseudo filesystem), or get_tree_single (singleton mount) — to fill fc->root.

Call register_filesystem(&my_fs_type) in your module’s init function and unregister_filesystem in the exit function.

Major filesystem implementations

ext4
btrfs
xfs
tmpfs

ext4 is the default Linux filesystem for most distributions. It is a journalling filesystem descended from ext2 and ext3, supporting extents (contiguous block ranges), online defragmentation, delayed allocation, and large volumes. The journal writes metadata changes to a circular log before applying them, ensuring consistency after a crash.Key features: extents, dir_index (htree directories), flex_bg, 64bit mode for volumes over 16 TB, inline_data for small files, metadata_csum.

btrfs is a copy-on-write (CoW) filesystem built on a B-tree data structure. Every write goes to a new location; the old blocks are freed when no snapshot references them. This design enables atomic snapshots, online send/receive for replication, built-in RAID (0, 1, 5, 6, 10), transparent compression, and online filesystem scrub to detect and repair silent data corruption.Key features: subvolumes, snapshots, compression (zstd, lzo, zlib), checksums on data and metadata, balance, send/receive.

xfs is a high-performance journalling filesystem originally developed by SGI for IRIX. It excels with large files, high concurrency, and large-capacity volumes. xfs uses allocation groups (AGs) to parallelize metadata operations, extents for efficient block mapping, and delayed allocation to reduce fragmentation.Key features: allocation groups, 64-bit inodes, reflink for efficient copy-on-write, realtime device support, xfsdump/xfsrestore.

tmpfs is a memory-backed virtual filesystem. Its storage lives in the page cache and swap; there is no block device backing. It grows and shrinks dynamically and is used for /tmp, /dev/shm, and kernel-internal mounts such as shmfs (System V shared memory).tmpfs integrates with the memory management subsystem: when memory is under pressure, pages that have not been accessed recently are moved to swap.

Page cache and writeback

The page cache is the kernel’s unified buffer for file data. When a process reads from a file, the VFS checks the page cache first; if the page is present and up to date, no I/O occurs. When data is written, the page is marked dirty in the cache and the write returns to user space immediately. A background thread (bdflush/pdflush, now replaced by per-backing-device writeback threads managed by bdi_writeback) periodically scans for dirty pages and writes them back to storage. The vm.dirty_ratio and vm.dirty_background_ratio sysctl parameters control how much dirty data is allowed before writeback is triggered.

/* address_space_operations callbacks relevant to writeback */
struct address_space_operations {
    int    (*read_folio)(struct file *, struct folio *);
    int    (*writepages)(struct address_space *,
                         struct writeback_control *);
    bool   (*dirty_folio)(struct address_space *, struct folio *);
    void   (*readahead)(struct readahead_control *);
    int    (*write_begin)(const struct kiocb *, struct address_space *,
                           loff_t pos, unsigned len,
                           struct folio **foliop, void **fsdata);
    int    (*write_end)(const struct kiocb *, struct address_space *,
                         loff_t pos, unsigned len, unsigned copied,
                         struct folio *folio, void *fsdata);
    bool   (*release_folio)(struct folio *, gfp_t);
};

Use fsync(2) or fdatasync(2) after critical writes to guarantee that dirty pages have been flushed to stable storage. The kernel’s writeback subsystem reports errors to fsync callers on all file descriptions that were open when the error occurred.

Memory management

The page cache is backed by the memory management subsystem; understanding zones and reclaim is essential.

Locking and concurrency

VFS operations use a mix of mutexes, spinlocks, and RCU to protect shared data structures.

Get Started

Development Guide

Kernel Internals

Driver Development

Administration

Contributing

Virtual Filesystem Layer and Filesystem APIs

Core data structures

super_operations

inode_operations

file_operations

Registering a filesystem

Major filesystem implementations

Page cache and writeback

Further reading

Memory management

Locking and concurrency

Build docs developers (and LLMs) love

Get Started

Development Guide

Kernel Internals

Driver Development

Administration

Contributing

Documentation Index

​Core data structures

​super_operations

​inode_operations

​file_operations

​Registering a filesystem

​Major filesystem implementations

​Page cache and writeback

​Further reading

Memory management

Locking and concurrency

Build docs developers (and LLMs) love

Core data structures

super_operations

inode_operations

file_operations

Registering a filesystem

Major filesystem implementations

Page cache and writeback

Further reading