Skip to main content
Open In Colab This tutorial shows you how to use Continued Pre-training (CPT) with Unsloth to adapt a language model for translation tasks. We’ll use the LFM2.5-1.2B-Base model and perform continued pre-training on Korean Wikipedia data, followed by instruction fine-tuning on Korean translation examples.

What you’ll learn

By the end of this tutorial, you’ll know how to:
  • Understand when to use continued pre-training vs fine-tuning
  • Prepare raw text data for continued pre-training
  • Perform domain adaptation on language-specific corpora
  • Follow up with instruction fine-tuning for translation
  • Test your adapted model on translation tasks
  • Export models for deployment

Prerequisites

  • GPU: This tutorial requires a GPU. You can run it for free on Google Colab using an NVIDIA T4 GPU
  • Python: Python 3.8 or higher
  • Basic knowledge: Familiarity with language modeling concepts

What is continued pre-training?

Continued pre-training (CPT) extends the pre-training phase on domain-specific or language-specific data. Unlike fine-tuning on instruction pairs, CPT:
  • Trains on raw, unlabeled text
  • Teaches the model new vocabulary and concepts
  • Adapts to specific languages or domains
  • Improves downstream task performance

When to use CPT

Use continued pre-training when you need to:

Adapt to new languages

  • Add support for low-resource languages
  • Improve performance on non-English languages
  • Teach language-specific grammar and syntax

Domain specialization

  • Medical, legal, or technical domains
  • Industry-specific terminology
  • Specialized writing styles

Knowledge injection

  • New facts and information
  • Recent events or developments
  • Proprietary knowledge bases
Don’t use CPT for:
  • Simple instruction following (use SFT instead)
  • Tasks with abundant labeled data
  • When base model already performs well

Tutorial overview

The tutorial uses a two-phase approach:

Phase 1: Continued pre-training

  1. Data preparation: Load Korean Wikipedia dataset
  2. Tokenization: Process raw text into training format
  3. CPT training: Train on unlabeled Korean text
  4. Evaluation: Test language understanding

Phase 2: Instruction fine-tuning

  1. Translation data: Prepare instruction-response pairs
  2. SFT training: Fine-tune on translation examples
  3. Inference: Test translation capabilities
  4. Export: Save for deployment

Key concepts

Base model vs instruct model

This tutorial starts with a base model (LFM2.5-1.2B-Base):
  • Pre-trained on raw text without instruction tuning
  • Better starting point for CPT
  • More flexible for domain adaptation
If you start with an instruct model, you risk:
  • Catastrophic forgetting of instruction-following
  • Need for careful learning rate tuning
  • Longer adaptation time

Two-phase training rationale

Phase 1 (CPT): Teaches the model Korean language
  • Vocabulary and grammar
  • Common phrases and expressions
  • Language-specific patterns
Phase 2 (SFT): Teaches translation task
  • Input-output format
  • Translation conventions
  • Specific translation patterns

Important: CCE disabling

For CPT, you must disable Cut Cross-Entropy (CCE):
%env UNSLOTH_RETURN_LOGITS=1
This is required because:
  • CPT uses raw text completion
  • CCE is optimized for instruction tuning
  • Without disabling, training will fail

Data preparation

The tutorial uses two datasets: Phase 1 - Korean Wikipedia:
  • Large corpus of Korean text
  • Unlabeled, raw text format
  • Teaches language fundamentals
Phase 2 - Translation pairs:
  • Instruction-response format
  • Korean ↔ English examples
  • Teaches translation task

Deployment options

After adaptation, you can deploy your model to:
  • Mobile: Android and iOS apps using the LEAP SDK
  • Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
  • Cloud: vLLM, Modal, Baseten, Fal for production deployments
  • Edge: On-device inference for low-latency applications
See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:
  1. Google Colab (recommended): Click the “Open in Colab” badge at the top
  2. Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

Adapting to your language

To adapt this tutorial for another language:
  1. Find raw text corpus: Wikipedia dumps, CommonCrawl, or domain-specific text
  2. Adjust dataset: Replace Korean Wikipedia with your target language
  3. Prepare translation pairs: Create instruction data for your language
  4. Tune training duration: More data may need longer training
  5. Evaluate thoroughly: Test on diverse examples

Expected results

After training, your model should:
  • Understand and generate text in the target language
  • Perform translation between languages
  • Maintain instruction-following capabilities
  • Show improved fluency compared to base model

Next steps

After completing this tutorial, you can:

Getting help

Need assistance? Join the Liquid AI Discord Community: Join Discord

Build docs developers (and LLMs) love