Linux Kernel Networking Stack Internals

The Linux networking stack is a layered architecture that processes packets from hardware up through protocol handlers to userspace sockets, and back down in reverse for transmission. Understanding how packets move through the stack — and where to intercept or accelerate them — is essential for network driver development, performance tuning, and packet filtering.

The socket buffer: sk_buff

Every packet in the Linux networking stack is represented by a struct sk_buff (skb). This is the central data structure that carries a packet and its metadata through every layer of the stack.

/* Simplified view of struct sk_buff (include/linux/skbuff.h) */
struct sk_buff {
    /* Packet data pointers */
    unsigned char   *head;   /* start of allocated buffer */
    unsigned char   *data;   /* start of actual packet data */
    unsigned char   *tail;   /* end of packet data */
    unsigned char   *end;    /* end of allocated buffer */

    /* Length fields */
    unsigned int    len;     /* length of actual data */
    unsigned int    data_len;/* length of paged (fragmented) data */

    /* Protocol metadata */
    __be16          protocol;      /* L3 protocol (ETH_P_IP, etc.) */
    __u16           transport_header;
    __u16           network_header;
    __u16           mac_header;

    /* Routing / connection tracking */
    struct dst_entry *dst;
    struct nf_conntrack *nfct;

    /* Checksum info, timestamps, priority ... */
};

The four pointers — head, data, tail, end — define a linear buffer. Protocol headers are prepended or stripped by adjusting data. Additional data can be stored in page fragments (skb_shinfo(skb)->frags[]) to support zero-copy I/O.

skb_put() extends the data area toward end; skb_push() extends it toward head (prepending a header). skb_pull() removes bytes from the front (stripping a header during receive).

Protocol layers

The stack follows the classic layered model. On receive, each layer strips its header and passes the skb upward; on transmit, each layer prepends its header and passes the skb downward.

L2 — Ethernet and the link layer

The network driver delivers received skbs to netif_receive_skb(). The Ethernet header is examined and the appropriate L3 handler is called based on skb->protocol (ETH_P_IP, ETH_P_IPV6, ETH_P_ARP, etc.).

/* Driver deliver path */
skb->protocol = eth_type_trans(skb, dev);
netif_receive_skb(skb);

VLAN tags are handled here; skb_vlan_tag_present() checks whether an 802.1Q tag was offloaded by the NIC or is still inline in the header.

L3 — IP processing

ip_rcv() is the entry point for IPv4. It validates the IP header, performs routing lookups via ip_route_input(), and either delivers the packet locally (ip_local_deliver()) or forwards it (ip_forward()).

/* IPv4 receive chain */
ip_rcv()
  → ip_rcv_finish()
    → ip_local_deliver()   /* local destination */
      → ip_local_deliver_finish()
        → tcp_v4_rcv() / udp_rcv() / ...

Fragmented packets are reassembled here. IP options (record route, timestamps) are also processed at this layer.

L4 — TCP and UDP

TCP (tcp_v4_rcv()) locates the matching socket via a hash table lookup, validates the segment, and inserts it into the socket’s receive queue. The TCP state machine handles SYN/ACK, retransmission timers, and congestion control.UDP (udp_rcv()) is simpler: the socket is found and the datagram is enqueued. If no socket is found, an ICMP port-unreachable message is sent.

/* Reading from a TCP socket in userspace → kernel path */
recv() → sock_recvmsg() → tcp_recvmsg()
       → skb_copy_datagram_iter()  /* copy data to userspace */

Netfilter hooks and iptables/nftables

Netfilter inserts five hook points into the packet path. Packet filtering frameworks (iptables, nftables, conntrack) register callbacks at these hooks:

Hook	Location
`NF_INET_PRE_ROUTING`	After L2, before routing decision
`NF_INET_LOCAL_IN`	After routing, for locally-destined packets
`NF_INET_FORWARD`	For packets being forwarded
`NF_INET_LOCAL_OUT`	Locally generated packets, before routing
`NF_INET_POST_ROUTING`	After routing, before transmission

/* Registering a netfilter hook */
static struct nf_hook_ops my_hook = {
    .hook     = my_hook_fn,
    .pf       = NFPROTO_INET,
    .hooknum  = NF_INET_PRE_ROUTING,
    .priority = NF_IP_PRI_FIRST,
};
nf_register_net_hook(&init_net, &my_hook);

Connection tracking (nf_conntrack) maintains a table of active connections. Each packet for a tracked flow is associated with a nf_conn entry that records the connection state, enabling stateful filtering and NAT.

eBPF and XDP

eBPF (extended Berkeley Packet Filter) and XDP (eXpress Data Path) enable programmable packet processing without modifying the kernel. XDP runs eBPF programs at the earliest possible point in the receive path — either in the NIC driver (native XDP) or just after the skb is allocated (generic XDP). This makes it the fastest packet processing option available in-kernel.

/* XDP program (compiled to BPF bytecode) */
SEC("xdp")
int xdp_drop_udp(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->protocol == IPPROTO_UDP)
        return XDP_DROP;

    return XDP_PASS;
}

XDP return codes: XDP_DROP discards the packet immediately; XDP_PASS continues normal processing; XDP_TX retransmits on the same interface; XDP_REDIRECT sends to another interface or CPU. TC eBPF programs attach at the traffic control layer (after skb allocation) and can inspect or modify packets in both ingress and egress directions with full access to the skb.

Network device and NAPI polling

Network devices are registered and managed through struct net_device. Drivers call alloc_netdev() to allocate the structure, then register_netdev() to make it visible to the system.

int probe(struct pci_dev *pdev, ...) {
    struct net_device *dev = alloc_netdev(sizeof(*priv),
                                          "eth%d",
                                          NET_NAME_UNKNOWN,
                                          my_setup);
    /* configure device ... */
    err = register_netdev(dev);
}

To handle high-speed receive efficiently, Linux uses NAPI (New API) — an interrupt mitigation technique. When a packet arrives, the NIC raises an interrupt and the driver schedules a NAPI poll. Further interrupts are masked until the poll completes, allowing the driver to drain the receive ring in a single softirq context.

/* NAPI poll callback — called from net_rx_action() softirq */
static int my_poll(struct napi_struct *napi, int budget) {
    int work_done = 0;
    while (work_done < budget && packet_available()) {
        process_one_packet();
        work_done++;
    }
    if (work_done < budget) {
        napi_complete(napi);    /* re-enable interrupts */
    }
    return work_done;
}

Multi-queue NICs create one RX/TX queue pair per CPU and use RSS (Receive Side Scaling) to hash flows across queues, avoiding lock contention and enabling true parallel packet processing.

Traffic control

The Linux traffic control (tc) subsystem implements queueing disciplines (qdiscs) that control how packets are enqueued and dequeued on a network device’s TX path. Common qdiscs:

Qdisc	Use case
`pfifo_fast`	Default; three-band priority FIFO
`fq_codel`	Flow-aware fair queuing with AQM; reduces bufferbloat
`htb`	Hierarchical Token Bucket; rate limiting and shaping
`tbf`	Token Bucket Filter; simple rate limiting
`netem`	Network emulation; adds delay, loss, and reordering

# Replace the default qdisc with fq_codel
tc qdisc replace dev eth0 root fq_codel

# Add an HTB root qdisc with a 100 Mbit/s class
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 100mbit

Key networking syscalls

The POSIX socket API maps to in-kernel operations as follows:

Syscall	Kernel entry	Purpose
`socket()`	`__sys_socket()`	Create a socket and allocate a `struct socket`
`bind()`	`__sys_bind()`	Assign a local address
`connect()`	`__sys_connect()`	Initiate a connection (TCP) or set remote addr (UDP)
`listen()`	`__sys_listen()`	Mark socket as passive; set backlog
`accept()`	`__sys_accept4()`	Dequeue a completed connection
`send()` / `sendmsg()`	`sock_sendmsg()`	Transmit data
`recv()` / `recvmsg()`	`sock_recvmsg()`	Receive data

sendmsg() with MSG_ZEROCOPY allows the kernel to DMA data directly from userspace buffers, avoiding a copy into kernel memory for large transmissions on supported NICs.

Memory management

How the kernel allocates and frees socket buffers and network data structures.

Locking primitives

RCU usage in routing tables and spinlocks in the socket and netdev layer.

Filesystems

Socket file descriptors, the VFS file object model, and splice/sendfile internals.

Scheduling

Softirq scheduling, NAPI budget interaction with the CPU scheduler.

Get Started

Development Guide

Kernel Internals

Administration

Linux Kernel Networking Stack Internals

The socket buffer: sk_buff

Protocol layers

Netfilter hooks and iptables/nftables

eBPF and XDP

Network device and NAPI polling

Traffic control

Key networking syscalls

Memory management

Locking primitives

Filesystems

Scheduling

Build docs developers (and LLMs) love

Get Started

Development Guide

Kernel Internals

Administration

Documentation Index

​The socket buffer: sk_buff

​Protocol layers

​Netfilter hooks and iptables/nftables

​eBPF and XDP

​Network device and NAPI polling

​Traffic control

​Key networking syscalls

Memory management

Locking primitives

Filesystems

Scheduling

Build docs developers (and LLMs) love

The socket buffer: sk_buff

Protocol layers

Netfilter hooks and iptables/nftables

eBPF and XDP

Network device and NAPI polling

Traffic control

Key networking syscalls