What you’ll learn
By the end of this tutorial, you’ll know how to:- Prepare raw text data for continued pre-training
- Train models on creative text generation datasets
- Adapt models to specific writing styles and patterns
- Generate creative text completions
- Export models for deployment
Prerequisites
- GPU: This tutorial requires a GPU. You can run it for free on Google Colab using an NVIDIA T4 GPU
- Python: Python 3.8 or higher
- Basic knowledge: Familiarity with language modeling concepts
What is text completion training?
Text completion training teaches models to:- Continue partial text in a coherent way
- Match specific writing styles and patterns
- Generate creative content (stories, dialogue, etc.)
- Maintain consistency across longer generations
- No instruction-response format needed
- Trains on raw, continuous text
- Optimizes for natural text continuation
- Focuses on style and creativity
When to use CPT for text completion
Use this approach when you want to:Creative writing
- Novel and story generation
- Dialogue and screenplay writing
- Poetry and creative content
Style adaptation
- Mimic specific author styles
- Match brand voice and tone
- Adapt to genre conventions
Domain text generation
- Technical documentation
- Legal or medical text
- Code documentation and comments
- Question answering (use instruction tuning)
- Task-specific outputs (use SFT)
- Structured generation (use specialized fine-tuning)
Tutorial overview
The tutorial covers the following steps:- Installation: Set up Unsloth and dependencies
- Data preparation: Load and format the Tiny Stories dataset
- Model setup: Configure LFM2.5-1.2B-Base for CPT
- Training configuration: Set up for text completion training
- Training: Run continued pre-training
- Generation: Test creative text completion
- Export: Save your model for deployment
Key concepts
The Tiny Stories dataset
Tiny Stories is a dataset of short, simple stories designed for language models:- Narrative structure
- Story progression
- Character consistency
- Creative language patterns
Single-column format
Unlike instruction datasets with question-answer pairs, CPT uses only:- text column: Contains the raw text for completion
- Instruction formatting
- System prompts
- Response templates
Important: CCE disabling
For CPT, you must disable Cut Cross-Entropy (CCE):- CPT uses raw text completion
- CCE is optimized for instruction tuning
- Without disabling, training will fail
Base model selection
This tutorial uses LFM2.5-1.2B-Base instead of an instruct model:- Base models are better for CPT
- Avoid catastrophic forgetting of instruction abilities
- More flexible for style adaptation
- Better starting point for creative tasks
Training configuration
Key differences from instruction fine-tuning: Data format:- No chat templates
- No instruction wrapping
- Plain text completion
- Compute loss on entire sequence
- No special token handling
- Simple next-token prediction
- Similar learning rates as SFT
- Longer context windows for narrative
- Larger batch sizes for stability
Generation strategies
After training, you can use various generation strategies:Greedy decoding
- Deterministic, always picks most likely token
- Good for consistent outputs
Sampling
- Introduces randomness for creativity
- Adjust temperature for control
Top-k / Top-p sampling
- Balance between quality and diversity
- Recommended for creative text
Deployment options
After training, you can deploy your model to:- Mobile: Android and iOS apps using the LEAP SDK
- Desktop: Mac (MLX), Windows/Linux (llama.cpp, Ollama, LM Studio)
- Cloud: vLLM, Modal, Baseten, Fal for production deployments
- Edge: On-device inference for low-latency applications
Run the tutorial
You can run this tutorial in two ways:- Google Colab (recommended): Click the “Open in Colab” badge at the top
- Local environment: Clone the LFM Cookbook repository and run the notebook locally
Access the notebook
The complete notebook is available at:- GitHub: cpt_text_completion_with_unsloth.ipynb
- Colab: Click the badge above to open directly in Google Colab
Adapting to your dataset
To use your own text corpus:- Prepare data: Collect raw text in a single column format
- Clean text: Remove unwanted formatting and artifacts
- Chunk appropriately: Split long documents into training chunks
- Adjust context length: Set max_seq_length based on your text
- Monitor generation: Check outputs during training
Use cases
Examples of what you can build:Story generator
- Train on story collections
- Generate new stories in similar style
- Continue partial story prompts
Dialogue system
- Train on screenplay or dialogue data
- Generate natural conversations
- Maintain character voices
Code documentation
- Train on well-documented codebases
- Generate code comments
- Write technical documentation
Brand content
- Train on brand materials
- Maintain consistent voice
- Generate marketing content
Next steps
After completing this tutorial, you can:- Try CPT for translation for language adaptation
- Apply CPT to your own text corpus
- Experiment with different generation strategies
- Deploy your model using the inference guides