Documentation Index Fetch the complete documentation index at: https://mintlify.com/kyryl-opens-ml/ml-in-production-practice/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Data labeling is critical for supervised learning. This guide covers deploying Argilla for human annotation and generating synthetic datasets with LLMs.
Why Data Labeling Matters
Quality High-quality labels directly improve model performance
Consistency Clear guidelines ensure inter-annotator agreement
Efficiency Proper tools accelerate the labeling process
Cost Plan labeling budget based on dataset size and complexity
Argilla
Argilla is an open-source platform for data labeling and feedback collection.
Key Features
Modern UI : Intuitive interface for annotators
Flexible : Text, token, ranking, and custom tasks
Python SDK : Programmatic dataset creation
Collaboration : Multi-user support with workspaces
Feedback : Collect model predictions for RLHF
Integration : Works with HuggingFace, OpenAI
Quick Start with Docker
docker run -it --rm --name argilla -p 6900:6900 \
argilla/argilla-quickstart:v2.0.0rc1
Access :
Alternative Deployments
Kubernetes
Railway
Docker Compose
Deploy on K8s for production: kubectl apply -f https://raw.githubusercontent.com/argilla-io/argilla/develop/examples/deployments/k8s/argilla.yaml
Full K8s examples One-click deployment: Automatically provisions:
Argilla server
PostgreSQL database
Persistent storage
Production-ready setup: version : '3.8'
services :
argilla :
image : argilla/argilla-server:latest
ports :
- "6900:6900"
environment :
ARGILLA_DATABASE_URL : postgresql://user:pass@db/argilla
depends_on :
- db
db :
image : postgres:14
environment :
POSTGRES_PASSWORD : pass
Creating Labeling Datasets
Simple Text-to-SQL Dataset
labeling/create_dataset.py
from datasets import load_dataset
import argilla as rg
client = rg.Argilla(
api_url = "http://0.0.0.0:6900" ,
api_key = "argilla.apikey"
)
WORKSPACE_NAME = "admin"
def create_text2sql_dataset ():
# Define guidelines
guidelines = """
Please examine the given SQL question and context.
Write the correct SQL query that accurately answers
the question based on the context provided.
Ensure the query follows SQL syntax and logic correctly.
"""
# Create dataset settings
settings = rg.Settings(
guidelines = guidelines,
fields = [
rg.TextField(
name = "query" ,
title = "Query" ,
use_markdown = False ,
),
rg.TextField(
name = "schema" ,
title = "Schema" ,
use_markdown = True ,
),
],
questions = [
rg.TextQuestion(
name = "sql" ,
title = "Please write SQL for this query" ,
description = "Please write SQL for this query" ,
required = True ,
use_markdown = True ,
)
],
)
# Create dataset
dataset = rg.Dataset(
name = "text2sql-123" ,
settings = settings,
workspace = WORKSPACE_NAME ,
client = client,
)
dataset.create()
# Load and add data
data = load_dataset( "b-mc2/sql-create-context" )
records = []
for idx in range ( len (data[ "train" ])):
x = rg.Record(
fields = {
"query" : data[ "train" ][idx][ "question" ],
"schema" : data[ "train" ][idx][ "context" ],
},
)
records.append(x)
dataset = client.datasets( name = "text2sql-123" )
dataset.records.log(records, batch_size = 1000 )
Run :
uv run ./labeling/create_dataset.py
Synthetic Data Generation
Use LLMs to generate training data programmatically.
labeling/create_dataset_synthetic.py
import sqlite3
def get_sqllite_schema ( db_name : str ) -> str :
with sqlite3.connect(db_name) as conn:
cursor = conn.cursor()
cursor.execute(
"SELECT 'CREATE TABLE ' || name || ' (' || sql || ');' "
"FROM sqlite_master WHERE type='table';"
)
db_schema_records = cursor.fetchall()
db_schema = [x[ 0 ] for x in db_schema_records]
db_schema = " \n " .join(db_schema)
return db_schema
Generate Synthetic Examples
labeling/create_dataset_synthetic.py
import json
from openai import OpenAI
from retry import retry
@retry ( tries = 3 , delay = 1 )
def generate_synthetic_example ( db_schema : str ) -> Dict[ str , str ]:
client = OpenAI()
prompt = f """
Corresponding database schema: { db_schema }
Please generate an example of what user might ask
from this database: in plain text and in SQL.
Return only JSON with format {{ "user_text": '...', "sql": "...." }}
"""
chat_completion = client.chat.completions.create(
messages = [
{
"role" : "system" ,
"content" : "You are SQLite and SQL expert." ,
},
{
"role" : "user" ,
"content" : prompt,
},
],
model = "gpt-4o" ,
response_format = { "type" : "json_object" },
temperature = 1 ,
)
sample = json.loads(chat_completion.choices[ 0 ].message.content)
assert "user_text" in sample
assert "sql" in sample
return sample
Create Synthetic Dataset
labeling/create_dataset_synthetic.py
from tqdm import tqdm
import argilla as rg
def create_text2sql_dataset_synthetic ( num_samples : int = 10 ):
db_schema = get_sqllite_schema( "examples/chinook.db" )
# Generate samples
samples = []
for _ in tqdm( range (num_samples)):
sample = generate_synthetic_example( db_schema = db_schema)
samples.append(sample)
# Create guidelines with schema
guidelines = f """
Please examine the given SQL question and context.
Write the correct SQL query that accurately answers
the question based on the context provided.
DB schema: \n\n { db_schema } \n\n
To verify the query:
- Download: https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip
- Install SQLite
- Run: sqlite3 chinook.db
"""
# Create dataset
settings = rg.Settings(
guidelines = guidelines,
fields = [
rg.TextField( name = "schema" , title = "Schema" , use_markdown = True ),
rg.TextField( name = "sync_query" , title = "Query" , use_markdown = False ),
rg.TextField( name = "sync_sql" , title = "SQL" , use_markdown = True ),
],
questions = [
rg.BooleanQuestion(
name = "valid" ,
title = "Is this SQL query correct?" ,
description = "Validate the SQL query" ,
required = True ,
)
],
)
dataset = rg.Dataset(
name = "text2sql-chinook-synthetic-123" ,
workspace = "admin" ,
settings = settings,
client = client,
)
dataset.create()
# Add records
records = [
rg.Record(
fields = {
"sync_sql" : sample[ "sql" ],
"sync_query" : sample[ "user_text" ],
"schema" : db_schema,
}
)
for sample in samples
]
dataset.records.log(records, batch_size = 1000 )
Run :
uv run ./labeling/create_dataset_synthetic.py
Labeling Guidelines
Good guidelines are essential for consistent annotations.
Guidelines Template
# [Task Name] Labeling Guidelines
## Objective
[Clear description of what annotators should accomplish]
## Task Definition
[Detailed explanation of the task]
## Label Definitions
### Label 1
- **Description** : ...
- **Example** : ...
- **Non-example** : ...
### Label 2
- **Description** : ...
- **Example** : ...
- **Non-example** : ...
## Decision Tree
1. First, check if...
2. Then, determine if...
3. Finally, assign...
## Edge Cases
- **Case 1** : How to handle...
- **Case 2** : What to do when...
## Quality Checks
- [ ] Label makes sense given context
- [ ] Followed decision tree
- [ ] Checked edge cases
## Examples
### Example 1
**Input** : ...
**Correct Label** : ...
**Rationale** : ...
### Example 2
**Input** : ...
**Correct Label** : ...
**Rationale** : ...
Best Practices
Use simple, unambiguous language
Provide concrete examples
Include visual aids when helpful
Define domain-specific terms
Cover all edge cases
Provide decision flowcharts
Include non-examples
Address ambiguous cases
Start with pilot labeling (50 samples)
Measure inter-annotator agreement
Update guidelines based on confusion
Re-label if agreement < 80%
Use gold-standard test sets
Calculate Cohen’s kappa
Review disagreements
Provide ongoing feedback
Cost Estimation
Pilot Study Process
Label 50 samples
Time your labeling process: import time
start = time.time()
# Label 50 samples
elapsed = time.time() - start
time_per_sample = elapsed / 50
print ( f "Average: { time_per_sample :.2f} s per sample" )
Calculate total time
total_samples = 10000
time_per_sample = 30 # seconds
total_hours = (total_samples * time_per_sample) / 3600
print ( f "Total: { total_hours :.1f} hours" )
Estimate cost
hourly_rate = 15 # USD
total_cost = total_hours * hourly_rate
# Add 20% for quality control
total_cost *= 1.2
print ( f "Estimated cost: $ { total_cost :,.2f} " )
Typical Ranges
Task Type Time/Sample Cost/1000 Samples Binary classification 5-15s 20 − 20- 20 − 100Multi-class 15-30s 60 − 60- 60 − 200Named entity recognition 30-60s 150 − 150- 150 − 400Semantic segmentation 2-5 min 500 − 500- 500 − 2000Question answering 1-3 min 250 − 250- 250 − 1000
Data Validation
Ensure label quality with automated checks.
Using Cleanlab
import cleanlab
from cleanlab.classification import CleanLearning
# Train with noisy labels
cl = CleanLearning( clf = YourClassifier())
cl.fit(X_train, noisy_labels)
# Find label issues
issues = cl.get_label_issues()
print ( f "Found { len (issues) } potential label errors" )
# Get cleaned labels
cleaned_labels = cl.predict(X_train)
Using Deepchecks
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import data_integrity
# Create dataset
ds = Dataset(df, label = 'target' , cat_features = [ 'cat1' , 'cat2' ])
# Run integrity checks
suite = data_integrity()
result = suite.run(ds)
# View results
result.show()
Production Labeling Workflow
Active Learning
Prioritize labeling of informative samples:
from modAL.uncertainty import uncertainty_sampling
from modAL.models import ActiveLearner
# Initialize learner
learner = ActiveLearner(
estimator = classifier,
query_strategy = uncertainty_sampling,
X_training = X_initial,
y_training = y_initial
)
# Query most uncertain samples
query_idx, query_inst = learner.query(X_pool, n_instances = 100 )
# Label and teach
y_new = get_labels(query_inst)
learner.teach(query_inst, y_new)
Label Studio
Prodigy
Labelbox
docker run -p 8080:8080 heartexlabs/label-studio
Features:
Rich media support
ML-assisted labeling
Export to many formats
Label Studio Commercial tool by spaCy team: prodigy textcat.teach my_dataset model data.jsonl
Features:
Active learning built-in
Scriptable recipes
Fast annotation UI
Enterprise platform: Features:
Workforce management
Quality assurance
Model-assisted labeling
Analytics dashboard
Resources
Next Steps