Deep dive into x86/x86_64 instruction emulation and the threaded code interpreter
iSH emulates a complete x86 or x86_64 CPU in usermode, allowing Linux binaries to run on iOS devices. The emulation layer is the performance-critical heart of the system.
iSH uses a threaded code interpreter rather than traditional interpretation or JIT compilation.
From the README: “It’s not quite a JIT since it doesn’t target machine code. Instead it generates an array of pointers to functions called gadgets, and each gadget ends with a tailcall to the next function; like the threaded code technique used by some Forth interpreters. The result is a speedup of roughly 3-5x compared to emulation using a simpler switch dispatch.”
// Array of function pointersvoid (*gadgets[])() = { &gadget_mov_eax_imm, &gadget_add_eax_ebx, &gadget_ret, // ...};// Each gadget tail-calls the nextvoid gadget_mov_eax_imm() { cpu->eax = *ip++; (*ip++)(); // Tail call to next gadget}
Pros: 3-5x faster, no switch overheadCons: Complex implementation, written in assembly
while (true) { switch (*ip++) { case MOV_EAX_IMM: cpu->eax = *ip++; break; case ADD_EAX_EBX: cpu->eax += cpu->ebx; break; // ... }}
// Pseudo-assembly for a MOV gadgetgadget_mov_reg_imm: mov REG, IMM // Execute the instruction jmp next_gadget // Tail call to next gadget
From the README: “I made the decision to write nearly all of the gadgets in assembly language. This was probably a good decision with regards to performance (though I’ll never know for sure), but a horrible decision with regards to readability, maintainability, and my sanity.”
iSH uses lazy flag evaluation to optimize performance:
emu/cpu.h:262
// Instead of computing flags immediately:// cpu->zf = (result == 0);// cpu->sf = (result < 0);// cpu->pf = compute_parity(result);// Store the result and operands:cpu->res = result;cpu->op1 = operand1;cpu->op2 = operand2;cpu->zf_res = 1; // ZF should be computed from rescpu->sf_res = 1; // SF should be computed from rescpu->pf_res = 1; // PF should be computed from res// Compute only when accessed:#define ZF (cpu->zf_res ? cpu->res == 0 : cpu->zf)#define SF (cpu->sf_res ? (int32_t)cpu->res < 0 : cpu->sf)#define PF (cpu->pf_res ? !__builtin_parity(cpu->res & 0xff) : cpu->pf)
This avoids computing flag values that are never read.
The TLB is critical for performance. Memory accesses that hit in the TLB are nearly as fast as native code. The TLB is automatically flushed when:
Memory mappings change (mmap, munmap, mprotect)
Process context switches
Source: emu/tlb.c:4
Gadget Caching
Frequently executed code paths will have their gadget sequences in the CPU’s instruction cache, reducing memory access overhead.
Lazy Flag Evaluation
Most instructions set flags but don’t read them. Lazy evaluation saves significant computation by only calculating flag values when actually needed (e.g., for conditional jumps).Source: emu/cpu.h:262