Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/DeelerDev/linux/llms.txt

Use this file to discover all available pages before exploring further.

The Linux networking stack is a layered architecture that processes packets from hardware up through protocol handlers to userspace sockets, and back down in reverse for transmission. Understanding how packets move through the stack — and where to intercept or accelerate them — is essential for network driver development, performance tuning, and packet filtering.

The socket buffer: sk_buff

Every packet in the Linux networking stack is represented by a struct sk_buff (skb). This is the central data structure that carries a packet and its metadata through every layer of the stack.
/* Simplified view of struct sk_buff (include/linux/skbuff.h) */
struct sk_buff {
    /* Packet data pointers */
    unsigned char   *head;   /* start of allocated buffer */
    unsigned char   *data;   /* start of actual packet data */
    unsigned char   *tail;   /* end of packet data */
    unsigned char   *end;    /* end of allocated buffer */

    /* Length fields */
    unsigned int    len;     /* length of actual data */
    unsigned int    data_len;/* length of paged (fragmented) data */

    /* Protocol metadata */
    __be16          protocol;      /* L3 protocol (ETH_P_IP, etc.) */
    __u16           transport_header;
    __u16           network_header;
    __u16           mac_header;

    /* Routing / connection tracking */
    struct dst_entry *dst;
    struct nf_conntrack *nfct;

    /* Checksum info, timestamps, priority ... */
};
The four pointers — head, data, tail, end — define a linear buffer. Protocol headers are prepended or stripped by adjusting data. Additional data can be stored in page fragments (skb_shinfo(skb)->frags[]) to support zero-copy I/O.
skb_put() extends the data area toward end; skb_push() extends it toward head (prepending a header). skb_pull() removes bytes from the front (stripping a header during receive).

Protocol layers

The stack follows the classic layered model. On receive, each layer strips its header and passes the skb upward; on transmit, each layer prepends its header and passes the skb downward.
ip_rcv() is the entry point for IPv4. It validates the IP header, performs routing lookups via ip_route_input(), and either delivers the packet locally (ip_local_deliver()) or forwards it (ip_forward()).
/* IPv4 receive chain */
ip_rcv()
ip_rcv_finish()
ip_local_deliver()   /* local destination */
ip_local_deliver_finish()
tcp_v4_rcv() / udp_rcv() / ...
Fragmented packets are reassembled here. IP options (record route, timestamps) are also processed at this layer.
TCP (tcp_v4_rcv()) locates the matching socket via a hash table lookup, validates the segment, and inserts it into the socket’s receive queue. The TCP state machine handles SYN/ACK, retransmission timers, and congestion control.UDP (udp_rcv()) is simpler: the socket is found and the datagram is enqueued. If no socket is found, an ICMP port-unreachable message is sent.
/* Reading from a TCP socket in userspace → kernel path */
recv() → sock_recvmsg() → tcp_recvmsg()
skb_copy_datagram_iter()  /* copy data to userspace */

Netfilter hooks and iptables/nftables

Netfilter inserts five hook points into the packet path. Packet filtering frameworks (iptables, nftables, conntrack) register callbacks at these hooks:
HookLocation
NF_INET_PRE_ROUTINGAfter L2, before routing decision
NF_INET_LOCAL_INAfter routing, for locally-destined packets
NF_INET_FORWARDFor packets being forwarded
NF_INET_LOCAL_OUTLocally generated packets, before routing
NF_INET_POST_ROUTINGAfter routing, before transmission
/* Registering a netfilter hook */
static struct nf_hook_ops my_hook = {
    .hook     = my_hook_fn,
    .pf       = NFPROTO_INET,
    .hooknum  = NF_INET_PRE_ROUTING,
    .priority = NF_IP_PRI_FIRST,
};
nf_register_net_hook(&init_net, &my_hook);
Connection tracking (nf_conntrack) maintains a table of active connections. Each packet for a tracked flow is associated with a nf_conn entry that records the connection state, enabling stateful filtering and NAT.

eBPF and XDP

eBPF (extended Berkeley Packet Filter) and XDP (eXpress Data Path) enable programmable packet processing without modifying the kernel. XDP runs eBPF programs at the earliest possible point in the receive path — either in the NIC driver (native XDP) or just after the skb is allocated (generic XDP). This makes it the fastest packet processing option available in-kernel.
/* XDP program (compiled to BPF bytecode) */
SEC("xdp")
int xdp_drop_udp(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->protocol == IPPROTO_UDP)
        return XDP_DROP;

    return XDP_PASS;
}
XDP return codes: XDP_DROP discards the packet immediately; XDP_PASS continues normal processing; XDP_TX retransmits on the same interface; XDP_REDIRECT sends to another interface or CPU. TC eBPF programs attach at the traffic control layer (after skb allocation) and can inspect or modify packets in both ingress and egress directions with full access to the skb.

Network device and NAPI polling

Network devices are registered and managed through struct net_device. Drivers call alloc_netdev() to allocate the structure, then register_netdev() to make it visible to the system.
int probe(struct pci_dev *pdev, ...) {
    struct net_device *dev = alloc_netdev(sizeof(*priv),
                                          "eth%d",
                                          NET_NAME_UNKNOWN,
                                          my_setup);
    /* configure device ... */
    err = register_netdev(dev);
}
To handle high-speed receive efficiently, Linux uses NAPI (New API) — an interrupt mitigation technique. When a packet arrives, the NIC raises an interrupt and the driver schedules a NAPI poll. Further interrupts are masked until the poll completes, allowing the driver to drain the receive ring in a single softirq context.
/* NAPI poll callback — called from net_rx_action() softirq */
static int my_poll(struct napi_struct *napi, int budget) {
    int work_done = 0;
    while (work_done < budget && packet_available()) {
        process_one_packet();
        work_done++;
    }
    if (work_done < budget) {
        napi_complete(napi);    /* re-enable interrupts */
    }
    return work_done;
}
Multi-queue NICs create one RX/TX queue pair per CPU and use RSS (Receive Side Scaling) to hash flows across queues, avoiding lock contention and enabling true parallel packet processing.

Traffic control

The Linux traffic control (tc) subsystem implements queueing disciplines (qdiscs) that control how packets are enqueued and dequeued on a network device’s TX path. Common qdiscs:
QdiscUse case
pfifo_fastDefault; three-band priority FIFO
fq_codelFlow-aware fair queuing with AQM; reduces bufferbloat
htbHierarchical Token Bucket; rate limiting and shaping
tbfToken Bucket Filter; simple rate limiting
netemNetwork emulation; adds delay, loss, and reordering
# Replace the default qdisc with fq_codel
tc qdisc replace dev eth0 root fq_codel

# Add an HTB root qdisc with a 100 Mbit/s class
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 100mbit

Key networking syscalls

The POSIX socket API maps to in-kernel operations as follows:
SyscallKernel entryPurpose
socket()__sys_socket()Create a socket and allocate a struct socket
bind()__sys_bind()Assign a local address
connect()__sys_connect()Initiate a connection (TCP) or set remote addr (UDP)
listen()__sys_listen()Mark socket as passive; set backlog
accept()__sys_accept4()Dequeue a completed connection
send() / sendmsg()sock_sendmsg()Transmit data
recv() / recvmsg()sock_recvmsg()Receive data
sendmsg() with MSG_ZEROCOPY allows the kernel to DMA data directly from userspace buffers, avoiding a copy into kernel memory for large transmissions on supported NICs.

Memory management

How the kernel allocates and frees socket buffers and network data structures.

Locking primitives

RCU usage in routing tables and spinlocks in the socket and netdev layer.

Filesystems

Socket file descriptors, the VFS file object model, and splice/sendfile internals.

Scheduling

Softirq scheduling, NAPI budget interaction with the CPU scheduler.

Build docs developers (and LLMs) love