Documentation Index Fetch the complete documentation index at: https://mintlify.com/apache/iceberg/llms.txt
Use this file to discover all available pages before exploring further.
Quick Start Guide
This guide will help you get started with Apache Iceberg quickly. You’ll learn how to add Iceberg to your project, create your first table, and perform basic operations.
The latest version of Iceberg can be found on the releases page . This guide uses examples compatible with Iceberg 1.0+.
Installation
Add Dependencies
Add Iceberg to your project using Maven or Gradle. < dependencies >
<!-- Core Iceberg API -->
< dependency >
< groupId > org.apache.iceberg </ groupId >
< artifactId > iceberg-core </ artifactId >
< version > 1.7.1 </ version >
</ dependency >
<!-- For Parquet file format -->
< dependency >
< groupId > org.apache.iceberg </ groupId >
< artifactId > iceberg-parquet </ artifactId >
< version > 1.7.1 </ version >
</ dependency >
<!-- For Hive Metastore catalog -->
< dependency >
< groupId > org.apache.iceberg </ groupId >
< artifactId > iceberg-hive-metastore </ artifactId >
< version > 1.7.1 </ version >
</ dependency >
</ dependencies >
Module guide:
iceberg-core - The core API and implementations (required)
iceberg-parquet - For Parquet file format support
iceberg-orc - For ORC file format support
iceberg-hive-metastore - For Hive Metastore catalog
iceberg-data - For direct JVM data access
Choose a Catalog
Iceberg uses catalogs to manage tables. Choose the catalog that fits your environment:
Hadoop Catalog - File-based catalog for HDFS or S3
Hive Metastore - Uses existing Hive Metastore
AWS Glue - For AWS environments
Nessie - Git-like data catalog
REST Catalog - HTTP-based catalog service
Create Your First Table
Follow the examples below to create a table.
Using Spark (Recommended for Getting Started)
Spark is the most feature-rich engine for Iceberg and the easiest way to get started.
Start Spark Shell with Iceberg
Launch Spark with the Iceberg runtime package: spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5:1.7.1
Or for spark-sql: spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5:1.7.1 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse= $PWD /warehouse
Replace 3.5 with your Spark version (e.g., 3.3, 3.4, 3.5).
Create a Table
Create your first Iceberg table using SQL: CREATE TABLE local . db . users (
id bigint ,
name string,
email string,
created_at timestamp
) USING iceberg;
Or create a partitioned table: CREATE TABLE local . db . events (
event_id bigint ,
user_id bigint ,
event_type string,
event_time timestamp ,
payload string
) USING iceberg
PARTITIONED BY ( days (event_time), event_type);
Insert Data
Insert data using standard SQL: INSERT INTO local . db .users
VALUES
( 1 , 'Alice' , 'alice@example.com' , current_timestamp ()),
( 2 , 'Bob' , 'bob@example.com' , current_timestamp ()),
( 3 , 'Charlie' , 'charlie@example.com' , current_timestamp ());
Or insert from another table: INSERT INTO local . db .users
SELECT id, name , email, timestamp
FROM source_table
WHERE active = true;
Query Data
Query your Iceberg table: SELECT * FROM local . db .users WHERE name LIKE 'A%' ;
Use time travel to query historical data: -- Query as of a timestamp
SELECT * FROM local . db .users
TIMESTAMP AS OF '2024-01-01 10:00:00' ;
-- Query a specific snapshot
SELECT * FROM local . db .users
VERSION AS OF 5678901234 ;
View table history: SELECT * FROM local . db . users . snapshots ;
Update and Merge Data
Iceberg supports row-level updates and merges: -- Update rows
UPDATE local . db .users
SET email = 'newemail@example.com'
WHERE id = 1 ;
-- Delete rows
DELETE FROM local . db .users
WHERE created_at < '2023-01-01' ;
-- Merge (upsert) data
MERGE INTO local . db .users t
USING updates u ON t . id = u . id
WHEN MATCHED THEN
UPDATE SET t . email = u . email , t . name = u . name
WHEN NOT MATCHED THEN
INSERT * ;
Using the Java API
For programmatic access, use the Iceberg Java API.
Initialize a Catalog
Choose and initialize a catalog: Hadoop Catalog
Hive Catalog
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.hadoop.HadoopCatalog;
Configuration conf = new Configuration ();
String warehousePath = "hdfs://host:8020/warehouse" ;
HadoopCatalog catalog = new HadoopCatalog (conf, warehousePath);
Define a Schema
Create a schema for your table: import org.apache.iceberg.Schema;
import org.apache.iceberg.types.Types;
Schema schema = new Schema (
Types . NestedField . required ( 1 , "id" , Types . LongType . get ()),
Types . NestedField . required ( 2 , "name" , Types . StringType . get ()),
Types . NestedField . optional ( 3 , "email" , Types . StringType . get ()),
Types . NestedField . required ( 4 , "created_at" ,
Types . TimestampType . withZone ())
);
Type IDs must be unique within the schema. Iceberg automatically reassigns IDs when creating tables to ensure uniqueness.
Define Partitioning
Create a partition spec: import org.apache.iceberg.PartitionSpec;
// Unpartitioned table
PartitionSpec spec = PartitionSpec . unpartitioned ();
// Or partition by day and identity
PartitionSpec spec = PartitionSpec . builderFor (schema)
. day ( "created_at" )
. identity ( "email" )
. build ();
Partition transforms include: identity, bucket[N], truncate[L], year, month, day, and hour.
Create the Table
Create the table using the catalog: import org.apache.iceberg.Table;
import org.apache.iceberg.catalog.TableIdentifier;
TableIdentifier name = TableIdentifier . of ( "db" , "users" );
Table table = catalog . createTable (name, schema, spec);
System . out . println ( "Created table: " + table . location ());
Write Data
Append data files to the table: import org.apache.iceberg.DataFile;
import org.apache.iceberg.DataFiles;
// Create a data file (simplified example)
DataFile dataFile = DataFiles . builder (spec)
. withPath ( "/path/to/data-file.parquet" )
. withFileSizeInBytes ( 1024 )
. withRecordCount ( 100 )
. build ();
// Append to table
table . newAppend ()
. appendFile (dataFile)
. commit ();
This is a simplified example. In practice, you would write data using a file writer or compute engine like Spark.
Read Data
Scan and read table data: import org.apache.iceberg.TableScan;
import org.apache.iceberg.io.CloseableIterable;
import org.apache.iceberg.expressions.Expressions;
import org.apache.iceberg.FileScanTask;
// Create a scan
TableScan scan = table . newScan ()
. filter ( Expressions . greaterThan ( "id" , 100 ))
. select ( "id" , "name" , "email" );
// Get the files to read
try ( CloseableIterable < FileScanTask > tasks = scan . planFiles ()) {
for ( FileScanTask task : tasks) {
System . out . println ( "File: " + task . file (). path ());
System . out . println ( "Records: " + task . file (). recordCount ());
}
}
Common Operations
Schema Evolution
Modify your table schema without rewriting data:
// Add a new column
table . updateSchema ()
. addColumn ( "phone" , Types . StringType . get ())
. commit ();
// Rename a column
table . updateSchema ()
. renameColumn ( "email" , "email_address" )
. commit ();
// Update column type (with compatible type)
table . updateSchema ()
. updateColumn ( "id" , Types . LongType . get ())
. commit ();
Time Travel
Access historical versions of your table:
// Read from a specific snapshot
TableScan scan = table . newScan ()
. useSnapshot (snapshotId);
// Read as of a timestamp
TableScan scan = table . newScan ()
. asOfTime ( System . currentTimeMillis () - 3600000 ); // 1 hour ago
Table Maintenance
Keep your tables healthy:
-- Expire old snapshots (remove history)
CALL local . system .expire_snapshots(
table => 'db.users' ,
older_than => TIMESTAMP '2024-01-01 00:00:00'
);
-- Remove orphan files
CALL local . system .remove_orphan_files(
table => 'db.users'
);
-- Compact small files
CALL local . system .rewrite_data_files(
table => 'db.users'
);
Next Steps
Now that you’ve created your first Iceberg table, explore more advanced features:
Java API Deep Dive Learn advanced Java API usage
Partitioning Master hidden partitioning
Schema Evolution Safely evolve table schemas
Performance Optimize table performance
Troubleshooting
ClassNotFoundException or NoClassDefFoundError
Make sure you have all required dependencies:
iceberg-core for the core API
iceberg-parquet or iceberg-orc for file formats
iceberg-hive-metastore for Hive catalog
Hadoop dependencies for HDFS access
For Spark, use the runtime JAR which includes all dependencies: --packages org.apache.iceberg:iceberg-spark-runtime-3.5:1.7.1
Connection refused to Hive Metastore
Verify:
The catalog name is correct
The database/namespace exists
You have permissions to access the table
The table was created successfully
List tables to debug: List < TableIdentifier > tables = catalog . listTables ( Namespace . of ( "db" ));
tables . forEach ( System . out :: println);