What you’ll learn
By the end of this tutorial, you’ll know how to:- Understand when to use continued pre-training vs fine-tuning
- Prepare raw text data for continued pre-training
- Perform domain adaptation on language-specific corpora
- Follow up with instruction fine-tuning for translation
- Test your adapted model on translation tasks
- Export models for deployment
Prerequisites
- GPU: This tutorial requires a GPU. You can run it for free on Google Colab using an NVIDIA T4 GPU
- Python: Python 3.8 or higher
- Basic knowledge: Familiarity with language modeling concepts
What is continued pre-training?
Continued pre-training (CPT) extends the pre-training phase on domain-specific or language-specific data. Unlike fine-tuning on instruction pairs, CPT:- Trains on raw, unlabeled text
- Teaches the model new vocabulary and concepts
- Adapts to specific languages or domains
- Improves downstream task performance
When to use CPT
Use continued pre-training when you need to:Adapt to new languages
- Add support for low-resource languages
- Improve performance on non-English languages
- Teach language-specific grammar and syntax
Domain specialization
- Medical, legal, or technical domains
- Industry-specific terminology
- Specialized writing styles
Knowledge injection
- New facts and information
- Recent events or developments
- Proprietary knowledge bases
- Simple instruction following (use SFT instead)
- Tasks with abundant labeled data
- When base model already performs well
Tutorial overview
The tutorial uses a two-phase approach:Phase 1: Continued pre-training
- Data preparation: Load Korean Wikipedia dataset
- Tokenization: Process raw text into training format
- CPT training: Train on unlabeled Korean text
- Evaluation: Test language understanding
Phase 2: Instruction fine-tuning
- Translation data: Prepare instruction-response pairs
- SFT training: Fine-tune on translation examples
- Inference: Test translation capabilities
- Export: Save for deployment
Key concepts
Base model vs instruct model
This tutorial starts with a base model (LFM2.5-1.2B-Base):- Pre-trained on raw text without instruction tuning
- Better starting point for CPT
- More flexible for domain adaptation
- Catastrophic forgetting of instruction-following
- Need for careful learning rate tuning
- Longer adaptation time
Two-phase training rationale
Phase 1 (CPT): Teaches the model Korean language- Vocabulary and grammar
- Common phrases and expressions
- Language-specific patterns
- Input-output format
- Translation conventions
- Specific translation patterns
Important: CCE disabling
For CPT, you must disable Cut Cross-Entropy (CCE):- CPT uses raw text completion
- CCE is optimized for instruction tuning
- Without disabling, training will fail
Data preparation
The tutorial uses two datasets: Phase 1 - Korean Wikipedia:- Large corpus of Korean text
- Unlabeled, raw text format
- Teaches language fundamentals
- Instruction-response format
- Korean ↔ English examples
- Teaches translation task
Deployment options
After adaptation, you can deploy your model to:- Mobile: Android and iOS apps using the LEAP SDK
- Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
- Cloud: vLLM, Modal, Baseten, Fal for production deployments
- Edge: On-device inference for low-latency applications
Run the tutorial
You can run this tutorial in two ways:- Google Colab (recommended): Click the “Open in Colab” badge at the top
- Local environment: Clone the LFM Cookbook repository and run the notebook locally
Access the notebook
The complete notebook is available at:- GitHub: cpt_translation_with_unsloth.ipynb
- Colab: Click the badge above to open directly in Google Colab
Adapting to your language
To adapt this tutorial for another language:- Find raw text corpus: Wikipedia dumps, CommonCrawl, or domain-specific text
- Adjust dataset: Replace Korean Wikipedia with your target language
- Prepare translation pairs: Create instruction data for your language
- Tune training duration: More data may need longer training
- Evaluate thoroughly: Test on diverse examples
Expected results
After training, your model should:- Understand and generate text in the target language
- Perform translation between languages
- Maintain instruction-following capabilities
- Show improved fluency compared to base model
Next steps
After completing this tutorial, you can:- Try CPT for text completion for creative writing
- Apply CPT to other languages or domains
- Experiment with different corpus sizes and mixing ratios
- Deploy your model using the inference guides