Documentation Index Fetch the complete documentation index at: https://mintlify.com/OminiX-ai/OminiX-MLX/llms.txt
Use this file to discover all available pages before exploring further.
GLM-4 is a 9B parameter language model from Tsinghua University with a distinctive architecture featuring partial RoPE, fused MLP projections, and extra layer normalization for improved stability.
Features
Partial RoPE : Rotary position embedding on half of head dimensions
Fused MLP : Combined gate_up_proj for better efficiency
Extra LayerNorms : Post-attention and post-MLP normalization layers
4-bit quantization : Required for consumer hardware (6 GB vs 18 GB)
Step-based KV cache : Memory-efficient generation
Installation
Add to your Cargo.toml:
[ dependencies ]
glm4-mlx = { path = "../glm4-mlx" }
mlx-rs = "0.18"
Quick start
Download model
Download the 4-bit quantized model (recommended): huggingface-cli download mlx-community/glm-4-9b-chat-4bit \
--local-dir ./models/GLM-4-9B-4bit
Or the full precision model (requires 18GB+ memory): huggingface-cli download mlx-community/glm-4-9b-chat-bf16 \
--local-dir ./models/GLM-4-9B
Run generation example
cargo run --release --example generate_glm4 -- \
./models/GLM-4-9B-4bit "你好"
Use in your code
use glm4_mlx :: {load_model, load_tokenizer, Generate , KVCache };
use mlx_rs :: ops :: indexing :: NewAxis ;
let tokenizer = load_tokenizer ( "./models/GLM-4-9B-4bit" ) ? ;
let mut model = load_model ( "./models/GLM-4-9B-4bit" ) ? ;
let encoding = tokenizer . encode ( "你好," , true ) ? ;
let prompt = mlx_rs :: Array :: from ( encoding . get_ids ()) . index ( NewAxis );
let mut cache = Vec :: new ();
let generator = Generate :: < KVCache > :: new ( & mut model , & mut cache , 0.7 , & prompt );
for token in generator . take ( 100 ) {
let token = token ? ;
print! ( "{}" , tokenizer . decode ( & [ token . item :: < u32 >()], true ) ? );
}
Architecture details
GLM-4 uses several unique architectural features:
Partial RoPE
Unlike standard transformers that apply rotary position embedding to all head dimensions, GLM-4 only applies RoPE to the first half (partial_rotary_factor = 0.5).
This reduces computation while maintaining positional awareness:
// Standard RoPE: applied to all dims
let rope_dims = head_dim ; // e.g., 128
// GLM-4 partial RoPE: applied to first half
let rope_dims = head_dim / 2 ; // e.g., 64
Fused gate_up_proj
The MLP layer uses a single projection to 2×hidden_dim, then splits for gate and up paths:
x → gate_up_proj(dim → 2×dim) → split → [gate, up]
↓ ↓
silu() |
↓ |
multiply ←┘
↓
down_proj(2×dim → dim)
This is more efficient than separate projections:
// Traditional approach (2 matrix multiplies)
let gate = gate_proj . forward ( & x ) ? ;
let up = up_proj . forward ( & x ) ? ;
// GLM-4 fused approach (1 matrix multiply + split)
let gate_up = gate_up_proj . forward ( & x ) ? ;
let ( gate , up ) = gate_up . split ( 2 , - 1 ) ? ;
Each decoder layer has 4 LayerNorm operations:
input_layernorm - Before attention
post_self_attn_layernorm - After attention, before residual
post_attention_layernorm - Before MLP
post_mlp_layernorm - After MLP, before residual
This provides better gradient flow and training stability compared to standard transformers with only 2 LayerNorms per block.
Code example
From examples/generate_glm4.rs :
use glm4_mlx :: {load_model, load_tokenizer, Generate , KVCache };
use mlx_rs :: ops :: indexing :: NewAxis ;
use mlx_rs :: transforms :: eval;
use std :: time :: Instant ;
fn main () -> Result <(), Box < dyn std :: error :: Error >> {
let model_dir = "./models/GLM-4-9B-4bit" ;
let prompt = "你好,请介绍一下自己。" ;
println! ( "Loading model from: {}" , model_dir );
let start = Instant :: now ();
let tokenizer = load_tokenizer ( model_dir ) ? ;
let mut model = load_model ( model_dir ) ? ;
println! ( "Model loaded in {:.2}s" , start . elapsed () . as_secs_f32 ());
// Tokenize
let encoding = tokenizer . encode ( prompt , true ) ? ;
let prompt_tokens = mlx_rs :: Array :: from ( encoding . get_ids ()) . index ( NewAxis );
println! ( "Prompt ({} tokens): {}" , encoding . get_ids () . len (), prompt );
println! ( "---" );
// Generate
let mut cache = Vec :: new ();
let generator = Generate :: < KVCache > :: new ( & mut model , & mut cache , 0.7 , & prompt_tokens );
let mut tokens = Vec :: new ();
for ( i , token ) in generator . enumerate () {
let token = token ? ;
tokens . push ( token . clone ());
// Decode in batches
if tokens . len () % 10 == 0 {
eval ( & tokens ) ? ;
let slice : Vec < u32 > = tokens . drain ( .. ) . map ( | t | t . item :: < u32 >()) . collect ();
let text = tokenizer . decode ( & slice , true ) ? ;
print! ( "{}" , text );
}
if i >= 100 { break ; }
}
// Flush remaining
if ! tokens . is_empty () {
eval ( & tokens ) ? ;
let slice : Vec < u32 > = tokens . drain ( .. ) . map ( | t | t . item :: < u32 >()) . collect ();
print! ( "{}" , tokenizer . decode ( & slice , true ) ? );
}
println! ();
Ok (())
}
Supported models
GLM-4-9B (bf16) Size : 18 GB
Precision : bfloat16
Use case : Maximum quality (requires 32GB+ RAM)
Download :huggingface-cli download mlx-community/glm-4-9b-chat-bf16 \
--local-dir ./models/GLM-4-9B
GLM-4-9B (4-bit) Size : 6 GB
Precision : 4-bit quantized
Use case : Recommended for consumer hardware
Download :huggingface-cli download mlx-community/glm-4-9b-chat-4bit \
--local-dir ./models/GLM-4-9B-4bit
Converting models
Convert from HuggingFace with 4-bit quantization:
pip install mlx-lm
mlx_lm.convert --hf-path THUDM/glm-4-9b-chat -q
Without quantization:
mlx_lm.convert --hf-path THUDM/glm-4-9b-chat
Model configuration
GLM-4-9B configuration:
{
"hidden_size" : 4096 ,
"num_hidden_layers" : 40 ,
"num_attention_heads" : 32 ,
"num_key_value_heads" : 2 ,
"intermediate_size" : 13696 ,
"partial_rotary_factor" : 0.5 ,
"vocab_size" : 151552 ,
"rope_theta" : 10000.0 ,
"rms_norm_eps" : 1.5625e-07
}
Key parameters:
Grouped Query Attention : 32 query heads, 2 KV heads (16:1 ratio)
Partial RoPE : 0.5 factor means RoPE applied to 64 of 128 head dimensions
Large intermediate size : 13696 dims (3.34× hidden size)
Memory requirements
Model Weights KV Cache (2K ctx) Total GLM-4-9B (bf16) 18 GB ~1 GB ~19 GB GLM-4-9B (4-bit) 6 GB ~1 GB ~7 GB
The 4-bit model fits comfortably on M1/M2/M3 devices with 16GB+ unified memory.
Inference speed
On Apple M3 Max (estimated based on architecture):
Prompt processing : ~200 tok/s (4-bit)
Token generation : ~50 tok/s (4-bit)
Similar to Qwen3-8B given comparable parameter count and architecture complexity.
Chinese language support
GLM-4 is optimized for Chinese language understanding with:
Extended Chinese vocabulary (151K tokens)
Training on large Chinese corpora
Better tokenization efficiency for Chinese text
Use Chinese prompts for best results:
let prompt = "你好,请介绍一下自己。" ; // "Hello, please introduce yourself."
API reference
Loading functions
pub fn load_model ( model_dir : impl AsRef < Path >) -> Result < Model , Error >
pub fn load_tokenizer ( model_dir : impl AsRef < Path >) -> Result < Tokenizer , Error >
Generation
pub struct Generate < C : KeyValueCache > {
// fields omitted
}
impl < C : KeyValueCache > Generate < C > {
pub fn new (
model : & mut Model ,
cache : & mut Vec < C >,
temperature : f32 ,
prompt : & Array ,
) -> Self
}
Iterator yielding generated tokens. Use temperature = 0.0 for greedy decoding.
Troubleshooting
Model loads slowly
GLM-4-9B has 40 layers which takes time to load. Use bf16 → 4-bit quantization to reduce load time:
# Quantize existing bf16 model
mlx_lm.convert --hf-path ./models/GLM-4-9B -q
Out of memory
GLM-4-9B (bf16) requires 20GB+ memory. Solutions:
Use 4-bit quantized model instead
Close other applications
Reduce generation length
Unexpected Chinese output
GLM-4 is trained primarily on Chinese text. For English prompts, you may get mixed-language responses. This is expected behavior.
Qwen3 - Alternative 9B model with similar performance