The Virtual Filesystem (VFS), also known as the Virtual Filesystem Switch, is the kernel layer that provides a uniform filesystem interface to userspace while allowing entirely different storage backends — on-disk filesystems, network filesystems, pseudo-filesystems, and in-memory filesystems — to coexist transparently. System calls such asDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/DeelerDev/linux/llms.txt
Use this file to discover all available pages before exploring further.
open(2), read(2), write(2), and stat(2) all pass through the VFS before reaching any filesystem-specific code.
VFS architecture
The VFS sits between the POSIX system call interface and individual filesystem implementations. When a process callsopen("/mnt/data/file.txt", O_RDONLY), the kernel:
- Resolves the path through the directory entry cache (dcache).
- Looks up or creates the inode for each path component.
- Allocates a
struct fileand inserts it into the process’s file descriptor table. - Calls the filesystem’s
open()method.
Registering and mounting a filesystem
A filesystem registers itself with the VFS at module load time:/proc/filesystems. When mount(2) is called, the VFS invokes the init_fs_context method to initialise a struct fs_context, then calls get_tree() to obtain the superblock.
The superblock object
A superblock represents a mounted filesystem instance. It holds filesystem-wide metadata and a pointer tostruct super_operations which the VFS calls to manage the filesystem’s lifecycle:
dirty_inode— called when inode metadata (not data) is modified.write_inode— called by the writeback thread to flush inode metadata to disk.evict_inode— cleans up page cache and on-disk structures when an inode is removed from memory.sync_fs— called bysync(2)to flush all dirty data.
The inode object
An inode represents a single filesystem object: a regular file, directory, symbolic link, device node, FIFO, or socket. It holds the object’s metadata — ownership, permissions, timestamps, size, and block pointers — and a pointer tostruct inode_operations.
The dentry cache
Dentries map path components (names) to inodes and are cached in the dentry cache (dcache) to make repeated path lookups fast. Dentries live only in RAM; they are never written to disk. A negative dentry caches the fact that a name does not exist, avoiding repeated disk lookups.d_revalidate to check with the server whether a cached dentry is still valid. Local filesystems typically leave it as NULL, since local dentries do not go stale.
The file object
Opening a file allocates astruct file that represents a single open file description (one entry in the process’s file descriptor table). It holds the current seek position, open flags, and a pointer to struct file_operations.
Page cache and writeback
The page cache is the kernel’s unified buffer for file data. When a process reads a file, the kernel loads pages into the page cache and serves subsequent reads from there. Writes go to the page cache first (making pages dirty), and a background thread — the writeback daemon (flusher threads) — periodically flushes dirty pages to disk.
Pages are organised per-inode in an address_space structure (sometimes called the inode’s mapping):
/proc/sys/vm/dirty_ratio), the dirty background ratio (dirty_background_ratio), and fsync(2) called by the application.
Key filesystem implementations
ext4 — journalling block filesystem
ext4 — journalling block filesystem
ext4 is the default filesystem for many Linux distributions. Its journal (in a special inode) records metadata operations before they are applied to the main filesystem, ensuring consistency after a crash. Three journalling modes are supported:
journal (data + metadata), ordered (metadata journalled, data flushed first; default), and writeback (metadata only).Key features: extents for large files, delayed allocation, online resizing, and dir_index (htree directories) for large directories.btrfs — copy-on-write filesystem
btrfs — copy-on-write filesystem
btrfs uses a copy-on-write (COW) B-tree structure for all metadata and optionally for data. This enables instant snapshots, online scrubbing and RAID-like redundancy across multiple devices, transparent compression, and data checksumming.
xfs — high-performance 64-bit filesystem
xfs — high-performance 64-bit filesystem
xfs was designed for large files and high-throughput workloads. It uses B+ trees throughout, supports online growing, and provides excellent parallel I/O performance due to its allocation group design (each group manages its own free space independently, reducing contention).xfs does not support online shrinking.
tmpfs — memory-backed filesystem
tmpfs — memory-backed filesystem
tmpfs stores all data in the page cache (anonymous or swap-backed pages). There is no persistent backing store; data is lost on unmount. It is used for
/tmp, /run, and shared memory (shm_open()). Unlike ramfs, tmpfs obeys memory limits and can swap pages out under pressure.overlayfs — union filesystem for containers
overlayfs — union filesystem for containers
overlayfs stacks a read-write upper layer on top of a read-only lower layer. Reads come from whichever layer has the file; writes go to the upper layer via copy-on-write. This is the mechanism behind container image layers in Docker and Podman.
FUSE — userspace filesystems
FUSE (Filesystem in USErspace) allows filesystem implementations to run as ordinary userspace processes. The kernel FUSE module translates VFS calls into requests that are sent over a/dev/fuse file descriptor to the userspace daemon, which processes them and returns responses.
FUSE has higher per-operation overhead than in-kernel filesystems because each VFS call requires a context switch to userspace and back. For latency-critical workloads, prefer in-kernel implementations.
Filesystem mounting and namespaces
Mounts are tracked instruct mount objects organised into a mount tree. Each process has a reference to a mount namespace (struct mnt_namespace) that determines its view of the filesystem hierarchy. Different namespaces can have entirely different mount trees, which is the foundation for container filesystem isolation.
MS_SHARED, MS_PRIVATE, MS_SLAVE) control whether mount and unmount events propagate between namespaces. The behaviour is described in detail in Documentation/filesystems/sharedsubtrees.rst.
Direct I/O and io_uring
Direct I/O (O_DIRECT) bypasses the page cache entirely, transferring data directly between userspace buffers and the block device. This is useful for databases that implement their own caching. Buffers must be aligned to the block size.
Memory management
The page cache is part of the memory management subsystem; dirty page reclaim interacts directly with the MM layer.
Locking primitives
VFS operations use a mix of inode semaphores, RCU for dcache lookups, and spinlocks in the page cache.
Networking
Network filesystems (NFS, SMB, 9P) sit below the VFS and above the networking stack.
Scheduling
Writeback kthreads and flusher daemons are scheduled by the same scheduler as ordinary tasks.
