Continued pre-training for translation

This tutorial shows you how to use Continued Pre-training (CPT) with Unsloth to adapt a language model for translation tasks. We’ll use the LFM2.5-1.2B-Base model and perform continued pre-training on Korean Wikipedia data, followed by instruction fine-tuning on Korean translation examples.

What you’ll learn

By the end of this tutorial, you’ll know how to:

Understand when to use continued pre-training vs fine-tuning
Prepare raw text data for continued pre-training
Perform domain adaptation on language-specific corpora
Follow up with instruction fine-tuning for translation
Test your adapted model on translation tasks
Export models for deployment

Prerequisites

GPU: This tutorial requires a GPU. You can run it for free on Google Colab using an NVIDIA T4 GPU
Python: Python 3.8 or higher
Basic knowledge: Familiarity with language modeling concepts

What is continued pre-training?

Continued pre-training (CPT) extends the pre-training phase on domain-specific or language-specific data. Unlike fine-tuning on instruction pairs, CPT:

Trains on raw, unlabeled text
Teaches the model new vocabulary and concepts
Adapts to specific languages or domains
Improves downstream task performance

When to use CPT

Use continued pre-training when you need to:

Adapt to new languages

Add support for low-resource languages
Improve performance on non-English languages
Teach language-specific grammar and syntax

Domain specialization

Medical, legal, or technical domains
Industry-specific terminology
Specialized writing styles

Knowledge injection

New facts and information
Recent events or developments
Proprietary knowledge bases

Don’t use CPT for:

Simple instruction following (use SFT instead)
Tasks with abundant labeled data
When base model already performs well

Tutorial overview

The tutorial uses a two-phase approach:

Phase 1: Continued pre-training

Data preparation: Load Korean Wikipedia dataset
Tokenization: Process raw text into training format
CPT training: Train on unlabeled Korean text
Evaluation: Test language understanding

Phase 2: Instruction fine-tuning

Translation data: Prepare instruction-response pairs
SFT training: Fine-tune on translation examples
Inference: Test translation capabilities
Export: Save for deployment

Key concepts

Base model vs instruct model

This tutorial starts with a base model (LFM2.5-1.2B-Base):

Pre-trained on raw text without instruction tuning
Better starting point for CPT
More flexible for domain adaptation

If you start with an instruct model, you risk:

Catastrophic forgetting of instruction-following
Need for careful learning rate tuning
Longer adaptation time

Two-phase training rationale

Phase 1 (CPT): Teaches the model Korean language

Vocabulary and grammar
Common phrases and expressions
Language-specific patterns

Phase 2 (SFT): Teaches translation task

Input-output format
Translation conventions
Specific translation patterns

Important: CCE disabling

For CPT, you must disable Cut Cross-Entropy (CCE):

%env UNSLOTH_RETURN_LOGITS=1

This is required because:

CPT uses raw text completion
CCE is optimized for instruction tuning
Without disabling, training will fail

Data preparation

The tutorial uses two datasets: Phase 1 - Korean Wikipedia:

Large corpus of Korean text
Unlabeled, raw text format
Teaches language fundamentals

Phase 2 - Translation pairs:

Instruction-response format
Korean ↔ English examples
Teaches translation task

Deployment options

After adaptation, you can deploy your model to:

Mobile: Android and iOS apps using the LEAP SDK
Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
Cloud: vLLM, Modal, Baseten, Fal for production deployments
Edge: On-device inference for low-latency applications

See the deployment documentation for detailed guides.

Run the tutorial

You can run this tutorial in two ways:

Google Colab (recommended): Click the “Open in Colab” badge at the top
Local environment: Clone the LFM Cookbook repository and run the notebook locally

Access the notebook

The complete notebook is available at:

GitHub: cpt_translation_with_unsloth.ipynb
Colab: Click the badge above to open directly in Google Colab

Adapting to your language

To adapt this tutorial for another language:

Find raw text corpus: Wikipedia dumps, CommonCrawl, or domain-specific text
Adjust dataset: Replace Korean Wikipedia with your target language
Prepare translation pairs: Create instruction data for your language
Tune training duration: More data may need longer training
Evaluate thoroughly: Test on diverse examples

Expected results

After training, your model should:

Understand and generate text in the target language
Perform translation between languages
Maintain instruction-following capabilities
Show improved fluency compared to base model

Next steps

After completing this tutorial, you can:

Try CPT for text completion for creative writing
Apply CPT to other languages or domains
Experiment with different corpus sizes and mixing ratios
Deploy your model using the inference guides

Getting help

Need assistance? Join the Liquid AI Discord Community:

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Continued pre-training for translation

What you’ll learn

Prerequisites

What is continued pre-training?

When to use CPT

Adapt to new languages

Domain specialization

Knowledge injection

Tutorial overview

Phase 1: Continued pre-training

Phase 2: Instruction fine-tuning

Key concepts

Base model vs instruct model

Two-phase training rationale

Important: CCE disabling

Data preparation

Deployment options

Run the tutorial

Access the notebook

Adapting to your language

Expected results

Next steps

Getting help

Build docs developers (and LLMs) love

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Documentation Index

​What you’ll learn

​Prerequisites

​What is continued pre-training?

​When to use CPT

​Adapt to new languages

​Domain specialization

​Knowledge injection

​Tutorial overview

​Phase 1: Continued pre-training

​Phase 2: Instruction fine-tuning

​Key concepts

​Base model vs instruct model

​Two-phase training rationale

​Important: CCE disabling

​Data preparation

​Deployment options

​Run the tutorial

​Access the notebook

​Adapting to your language

​Expected results

​Next steps

​Getting help

Build docs developers (and LLMs) love

What you’ll learn

Prerequisites

What is continued pre-training?

When to use CPT

Adapt to new languages

Domain specialization

Knowledge injection

Tutorial overview

Phase 1: Continued pre-training

Phase 2: Instruction fine-tuning

Key concepts

Base model vs instruct model

Two-phase training rationale

Important: CCE disabling

Data preparation

Deployment options

Run the tutorial

Access the notebook

Adapting to your language

Expected results

Next steps

Getting help