Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/deelerdev/linux/llms.txt

Use this file to discover all available pages before exploring further.

The Virtual File System (VFS) is the software layer in the kernel that presents a uniform filesystem interface to user-space programs and provides an abstraction within the kernel that allows different filesystem implementations — ext4, btrfs, xfs, tmpfs, NFS, and many others — to coexist. System calls such as open(2), read(2), write(2), and stat(2) always go through the VFS, which dispatches them to the appropriate filesystem-specific code.

Core data structures

The VFS is built around four primary objects. Each maps to a concept in a Unix filesystem:
A super_block represents one mounted instance of a filesystem. It holds the block size, filesystem flags, a pointer to the root dentry, and the super_operations dispatch table. When a filesystem is unmounted, the VFS calls put_super on the superblock to allow cleanup.
An inode represents a file, directory, symbolic link, device node, FIFO, or socket. It stores the object’s permissions, size, timestamps, and a pointer to inode_operations. A single inode may be referenced by multiple dentry objects (hard links). Inodes for block-device filesystems are cached in memory and written back to disk when dirty.
A dentry is the kernel’s in-memory representation of a pathname component. The dentry cache (dcache) translates a pathname like /home/user/file.c into a chain of dentries terminating at an inode. Dentries are never written to disk; they exist purely as a performance cache.
A file object is created when a process calls open(2). It holds the current file offset, open flags, and a pointer to file_operations. Closing the last file descriptor referencing a file object calls release and drops the reference on the underlying dentry and inode.

super_operations

super_operations is the dispatch table that the VFS uses to manage a mounted filesystem instance:
/* Documentation/filesystems/vfs.rst */
struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*free_inode)(struct inode *);

    void (*dirty_inode)(struct inode *, int flags);
    int  (*write_inode)(struct inode *, struct writeback_control *wbc);
    int  (*drop_inode)(struct inode *);
    void (*evict_inode)(struct inode *);
    void (*put_super)(struct super_block *);
    int  (*sync_fs)(struct super_block *sb, int wait);
    int  (*statfs)(struct dentry *, struct kstatfs *);
    void (*umount_begin)(struct super_block *);
    int  (*show_options)(struct seq_file *, struct dentry *);
};
All methods are called without locks held unless documented otherwise.

inode_operations

inode_operations describes how the VFS manipulates individual inodes:
/* Documentation/filesystems/vfs.rst */
struct inode_operations {
    int          (*create)(struct mnt_idmap *, struct inode *,
                           struct dentry *, umode_t, bool);
    struct dentry *(*lookup)(struct inode *, struct dentry *, unsigned int);
    int          (*link)(struct dentry *, struct inode *, struct dentry *);
    int          (*unlink)(struct inode *, struct dentry *);
    int          (*symlink)(struct mnt_idmap *, struct inode *,
                            struct dentry *, const char *);
    struct dentry *(*mkdir)(struct mnt_idmap *, struct inode *,
                            struct dentry *, umode_t);
    int          (*rmdir)(struct inode *, struct dentry *);
    int          (*rename)(struct mnt_idmap *, struct inode *,
                           struct dentry *, struct inode *,
                           struct dentry *, unsigned int);
    int          (*permission)(struct mnt_idmap *, struct inode *, int);
    int          (*setattr)(struct mnt_idmap *, struct dentry *,
                            struct iattr *);
    int          (*getattr)(struct mnt_idmap *, const struct path *,
                            struct kstat *, u32, unsigned int);
};
lookup is among the most important methods — it is called whenever the VFS needs to resolve a pathname component against a parent directory inode. A successful lookup must call d_add() to install the found inode into the dentry.

file_operations

file_operations defines how the VFS operates on open file descriptions. As of kernel 4.18+:
/* Documentation/filesystems/vfs.rst */
struct file_operations {
    struct module  *owner;
    loff_t         (*llseek)(struct file *, loff_t, int);
    ssize_t        (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t        (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t        (*read_iter)(struct kiocb *, struct iov_iter *);
    ssize_t        (*write_iter)(struct kiocb *, struct iov_iter *);
    int            (*iterate_shared)(struct file *, struct dir_context *);
    __poll_t       (*poll)(struct file *, struct poll_table_struct *);
    long           (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    int            (*mmap)(struct file *, struct vm_area_struct *);
    int            (*open)(struct inode *, struct file *);
    int            (*release)(struct inode *, struct file *);
    int            (*fsync)(struct file *, loff_t, loff_t, int datasync);
    loff_t         (*remap_file_range)(struct file *, loff_t,
                                       struct file *, loff_t, loff_t, unsigned int);
};
Modern kernel code uses read_iter/write_iter rather than read/write because iov_iter supports scatter-gather I/O and integrates with the io_uring subsystem.

Registering a filesystem

To make a filesystem mountable, you register it with the VFS using register_filesystem:
/* Documentation/filesystems/vfs.rst */
#include <linux/fs.h>

extern int   register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
struct file_system_type describes the filesystem to the VFS:
/* Documentation/filesystems/vfs.rst */
struct file_system_type {
    const char                  *name;
    int                          fs_flags;
    int                         (*init_fs_context)(struct fs_context *);
    const struct fs_parameter_spec *parameters;
    void                        (*kill_sb)(struct super_block *);
    struct module               *owner;
    struct file_system_type     *next;
    struct hlist_head            fs_supers;
};
init_fs_context is the modern entry point for mounting. It populates a fs_context with filesystem-specific state, then the VFS calls get_tree to obtain or create a super_block. You can see all registered filesystems in /proc/filesystems.
1

Define file_system_type

Populate the name, fs_flags, init_fs_context, and kill_sb fields. Set owner to THIS_MODULE.
2

Implement init_fs_context

Allocate a private context structure, assign fc->ops to your fs_context_operations, and set fc->fs_private.
3

Implement get_tree

Call one of the helpers — get_tree_bdev (block device filesystem), get_tree_nodev (pseudo filesystem), or get_tree_single (singleton mount) — to fill fc->root.
4

Register at module init

Call register_filesystem(&my_fs_type) in your module’s init function and unregister_filesystem in the exit function.

Major filesystem implementations

ext4 is the default Linux filesystem for most distributions. It is a journalling filesystem descended from ext2 and ext3, supporting extents (contiguous block ranges), online defragmentation, delayed allocation, and large volumes. The journal writes metadata changes to a circular log before applying them, ensuring consistency after a crash.Key features: extents, dir_index (htree directories), flex_bg, 64bit mode for volumes over 16 TB, inline_data for small files, metadata_csum.

Page cache and writeback

The page cache is the kernel’s unified buffer for file data. When a process reads from a file, the VFS checks the page cache first; if the page is present and up to date, no I/O occurs. When data is written, the page is marked dirty in the cache and the write returns to user space immediately. A background thread (bdflush/pdflush, now replaced by per-backing-device writeback threads managed by bdi_writeback) periodically scans for dirty pages and writes them back to storage. The vm.dirty_ratio and vm.dirty_background_ratio sysctl parameters control how much dirty data is allowed before writeback is triggered.
/* address_space_operations callbacks relevant to writeback */
struct address_space_operations {
    int    (*read_folio)(struct file *, struct folio *);
    int    (*writepages)(struct address_space *,
                         struct writeback_control *);
    bool   (*dirty_folio)(struct address_space *, struct folio *);
    void   (*readahead)(struct readahead_control *);
    int    (*write_begin)(const struct kiocb *, struct address_space *,
                           loff_t pos, unsigned len,
                           struct folio **foliop, void **fsdata);
    int    (*write_end)(const struct kiocb *, struct address_space *,
                         loff_t pos, unsigned len, unsigned copied,
                         struct folio *folio, void *fsdata);
    bool   (*release_folio)(struct folio *, gfp_t);
};
Use fsync(2) or fdatasync(2) after critical writes to guarantee that dirty pages have been flushed to stable storage. The kernel’s writeback subsystem reports errors to fsync callers on all file descriptions that were open when the error occurred.

Further reading

Memory management

The page cache is backed by the memory management subsystem; understanding zones and reclaim is essential.

Locking and concurrency

VFS operations use a mix of mutexes, spinlocks, and RCU to protect shared data structures.

Build docs developers (and LLMs) love