Skip to main content
BZIP2 is a high-quality compression format using the Burrows-Wheeler transform algorithm, providing better compression than GZIP at the cost of speed.

Format Overview

BZIP2 provides:
  • High compression ratio - 10-15% better than GZIP
  • Block-based compression - Independent 100-900 KB blocks
  • Burrows-Wheeler algorithm - Advanced text transformation
  • Error recovery - Block independence aids recovery
  • No file size limit - Unlimited file sizes
  • Open source - Free implementation (libbzip2)
BZIP2 excels at compressing text and source code, often achieving 10-15% better compression than GZIP while using similar memory.

Format Structure

From source/CPP/7zip/Archive/Bz2Handler.cpp:107-121, BZIP2 files have this structure:
+------------------------+
| File Header (4 bytes)  |
+------------------------+
| Block(s)               |
| - Block Header         |
| - Compressed Data      |
+------------------------+
| Stream End (6 bytes)   |
+------------------------+

File Header

// Signature check (Bz2Handler.cpp:109-114)
if (p[0] != 'B' || p[1] != 'Z' || p[2] != 'h' || p[3] < '1' || p[3] > '9')
  return k_IsArc_Res_NO;

// Header format:
// Bytes 0-1:  Magic "BZ" (0x42 0x5A)
// Byte 2:     'h' (0x68) - Huffman coding
// Byte 3:     Block size ('1' to '9') = 100KB to 900KB

Block Size

The fourth byte indicates block size:
ByteBlock SizeMemory (Compress)Memory (Decompress)
‘1’100 KB~1 MB~400 KB
’2’200 KB~2 MB~800 KB
’3’300 KB~3 MB~1.2 MB
‘9’900 KB~9 MB~3.6 MB
Larger block sizes generally provide better compression but require more memory. The default is usually ‘9’ (900 KB blocks).

Compression Algorithm

BZIP2 uses a multi-stage compression pipeline:
  1. Run-Length Encoding (RLE) - First pass
  2. Burrows-Wheeler Transform (BWT) - Sorts data for better compression
  3. Move-To-Front (MTF) - Transform to improve locality
  4. Run-Length Encoding - Second pass
  5. Huffman Coding - Final entropy encoding
This sophisticated pipeline achieves excellent compression ratios.

Usage Examples

Compress Single File

7z a -tbzip2 file.txt.bz2 file.txt

Compress with Specific Block Size

7z a -tbzip2 -mx=1 file.bz2 file.txt  # 100KB blocks
7z a -tbzip2 -mx=5 file.bz2 file.txt  # 500KB blocks
7z a -tbzip2 -mx=9 file.bz2 file.txt  # 900KB blocks (best)

Decompress File

7z x file.txt.bz2

Create TAR.BZ2 Archive

7z a -ttar archive.tar files/
7z a -tbzip2 -mx=9 archive.tar.bz2 archive.tar
Or using system tar:
tar -cjf archive.tar.bz2 files/

Test Archive Integrity

7z t file.bz2

List Archive Information

7z l -slt file.bz2
Shows compressed and uncompressed sizes.

Handler Implementation

From source/CPP/7zip/Archive/Bz2Handler.cpp:23-47:
Z7_CLASS_IMP_CHandler_IInArchive_3(
  IArchiveOpenSeq,
  IOutArchive,
  ISetProperties
)
  CMyComPtr<IInStream> _stream;
  CMyComPtr<ISequentialInStream> _seqStream;
  
  bool _isArc;
  bool _needSeekToStart;
  bool _dataAfterEnd;
  bool _needMoreInput;

  bool _packSize_Defined;
  bool _unpackSize_Defined;
  bool _numStreams_Defined;
  bool _numBlocks_Defined;

  UInt64 _packSize;
  UInt64 _unpackSize;
  UInt64 _numStreams;
  UInt64 _numBlocks;

Archive Properties

BZIP2 handler tracks several properties (Bz2Handler.cpp:49-82):

Archive-Level Properties

  • kpidPhySize - Compressed size
  • kpidUnpackSize - Uncompressed size
  • kpidNumStreams - Number of BZIP2 streams
  • kpidNumBlocks - Number of compression blocks
  • kpidErrorFlags - Error indicators

Item Properties

  • kpidPackSize - Compressed size
  • kpidSize - Uncompressed size

Block Independence

BZIP2’s block structure enables:

Parallel Decompression

Blocks can be decompressed independently:
# Using pbzip2 (parallel bzip2)
pbzip2 -d -p4 large_file.bz2  # Use 4 cores

Error Recovery

If one block is corrupted, other blocks remain intact:
# Attempt recovery
7z x -y corrupted.bz2

Seeking

Blocks allow seeking within compressed files without full decompression.

Compression Performance

Compression Levels

LevelBlock SizeRatioSpeedMemory
-mx=1100 KBGoodFastLow
-mx=3300 KBBetterMediumMedium
-mx=5500 KBBetterMediumMedium
-mx=9900 KBBestSlowHigh

Sample Performance (100 MB text file)

FormatTimeSizeRatioRelative Speed
GZIP (-mx=9)12s28 MB28%1.0x (baseline)
BZIP2 (-mx=1)15s26 MB26%0.8x
BZIP2 (-mx=5)18s24 MB24%0.67x
BZIP2 (-mx=9)22s23 MB23%0.55x
XZ (-mx=9)35s20 MB20%0.34x
BZIP2 provides a good balance between GZIP’s speed and XZ’s compression ratio.

Advanced Usage

Compress from Standard Input

cat large_file.txt | 7z a -tbzip2 -si output.bz2

Decompress to Standard Output

7z x -so file.bz2 | less

Multiple Streams

BZIP2 supports concatenated streams:
7z a -tbzip2 file1.bz2 file1.txt
7z a -tbzip2 file2.bz2 file2.txt
cat file1.bz2 file2.bz2 > combined.bz2
7z l combined.bz2  # Shows multiple streams

Recovery Archives

# Create recovery record (with bzip2recover tool)
bzip2recover damaged.bz2

Implementation Details

BZIP2 compression codec:
source/CPP/7zip/Compress/BZip2Decoder.h
source/CPP/7zip/Compress/BZip2Encoder.h
Archive handler:
source/CPP/7zip/Archive/Bz2Handler.cpp
Key components:
#include "../Compress/BZip2Decoder.h"
#include "../Compress/BZip2Encoder.h"
#include "../Compress/CopyCoder.h"

Comparison with Other Formats

BZIP2 vs GZIP

Advantages:
  • 10-15% better compression
  • Block structure enables recovery
  • Better for text files
Disadvantages:
  • Slower compression (2-3x)
  • Slower decompression (2x)
  • Higher memory usage

BZIP2 vs XZ

Advantages:
  • Faster compression and decompression
  • Lower memory requirements
  • Block independence
Disadvantages:
  • Lower compression ratio (15-20% larger)
  • Less efficient for binary data

BZIP2 vs 7z/LZMA

Advantages:
  • Simpler format
  • Better tool support
  • Faster decompression
Disadvantages:
  • Lower compression ratio
  • Single file only (needs TAR)
  • No encryption

Best Practices

For Text Files

Use -mx=9 for excellent compression of source code and documents

For Log Archives

Combine with TAR for compressed log archives

For Balance

Use BZIP2 when GZIP is too weak and XZ is too slow

For Recovery

Use block structure for better error recovery

Common Use Cases

Source Code Distribution

tar -cjf project-1.0.tar.bz2 project-1.0/

Log Compression

find /var/log -name "*.log.1" -exec bzip2 {} \;

Database Dumps

pg_dump mydb | 7z a -tbzip2 -mx=9 -si backup-$(date +%Y%m%d).sql.bz2

System Backup

tar -cjf backup-$(date +%Y%m%d).tar.bz2 /home /etc /var/www

Memory Requirements

Compression

Memory = Block_Size × 8 + overhead
For -mx=9 (900KB blocks):
Memory ≈ 7.2 MB + 1.5 MB = ~9 MB

Decompression

Memory = Block_Size × 4 + overhead
For 900KB blocks:
Memory ≈ 3.6 MB + 0.5 MB = ~4 MB
BZIP2 uses less memory than LZMA/XZ but more than GZIP, making it suitable for resource-constrained environments.

Error Handling

From Bz2Handler.cpp:73-81:
case kpidErrorFlags: {
  UInt32 v = 0;
  if (!_isArc) v |= kpv_ErrorFlags_IsNotArc;
  if (_needMoreInput) v |= kpv_ErrorFlags_UnexpectedEnd;
  if (_dataAfterEnd) v |= kpv_ErrorFlags_DataAfterEnd;
  prop = v;
  break;
}
Error detection:
  • IsNotArc - Invalid BZIP2 signature
  • UnexpectedEnd - Truncated file
  • DataAfterEnd - Extra data after stream

Limitations

BZIP2 limitations:
  • Single file compression (use TAR for multiple files)
  • No encryption support
  • Slower than GZIP
  • Higher memory usage than GZIP
  • No solid compression

Compatibility

Universal Support

BZIP2 is widely supported:
  • Linux/Unix - bzip2, bunzip2 commands
  • macOS - Built-in bzip2
  • Windows - 7-Zip, WinZip, various tools
  • Programming - libbzip2 library in many languages

Archive Tools

  • 7-Zip - Full support
  • bzip2 - Reference implementation
  • pbzip2 - Parallel implementation
  • lbzip2 - Multi-threaded implementation
  • WinZip - Windows support

Performance Optimization

Parallel Compression

Use parallel implementations:
# pbzip2 (parallel bzip2)
pbzip2 -9 -p4 large_file.txt  # 4 cores

# lbzip2 (multi-threaded)
lbzip2 -9 -n4 large_file.txt  # 4 threads

Choosing Block Size

  • Small files (less than 1 MB) - Use -mx=1 or -mx=3
  • Medium files (1-10 MB) - Use -mx=5
  • Large files (greater than 10 MB) - Use -mx=9

Memory-Constrained Systems

# Low memory compression
7z a -tbzip2 -mx=3 file.bz2 large_file.txt

See Also

Build docs developers (and LLMs) love