PostgresIndexer API reference

PostgresIndexer creates and populates BM25-style full-text (sparse) or pgvector (dense) indexes in a PostgreSQL database. It implements the abstract Indexer base class and supports loading data from both JSONL and Parquet files.

from quackir.index import PostgresIndexer

Dense indexing requires the pgvector extension to be installed in your PostgreSQL instance. The embedding column is created as a vector(embedding_dim) type.

Constructor

PostgresIndexer(db_name="quackir", user="postgres")

Opens a psycopg2 connection to the specified PostgreSQL database.

db_name

string

default:"quackir"

Name of the PostgreSQL database to connect to.

user

string

default:"postgres"

PostgreSQL username.

Methods

init_table

indexer.init_table(table_name, index_type, embedding_dim=768)

Drops any existing table with the given name and creates a new one:

Sparse schema: (id TEXT PRIMARY KEY, contents TEXT)
Dense schema: (id TEXT PRIMARY KEY, embedding vector(embedding_dim))

table_name

string

required

Name of the table to create.

index_type

IndexType

required

IndexType.SPARSE or IndexType.DENSE.

embedding_dim

number

default:"768"

Dimension of the embedding vectors. Only used when index_type is IndexType.DENSE.

load_table

indexer.load_table(table_name, file_path, index_type=None, pretokenized=False)

Dispatches to load_jsonl_table or load_parquet_table based on the file extension. If index_type is None, it is detected via get_index_type.

Parquet loading is currently restricted to IndexType.DENSE. Passing a .parquet file with IndexType.SPARSE raises a ValueError.

table_name

string

required

Target table name.

file_path

string

required

Path to a .jsonl or .parquet source file.

index_type

IndexType

default:"None"

Override the index type. When None, get_index_type(table_name) is called.

pretokenized

boolean

default:"false"

When False, contents values are tokenized with Pyserini’s Lucene Analyzer before insertion.

load_jsonl_table

indexer.load_jsonl_table(table_name, file_path, index_type, pretokenized=False)

Reads a JSONL file and bulk-inserts rows using psycopg2.extras.execute_values. Each line must be a JSON object with id and either contents (sparse) or vector (dense).

Null bytes (\x00) in contents fields are replaced with the Unicode replacement character (\uFFFD) before insertion because PostgreSQL does not allow null characters in text columns.

table_name

string

required

Target table name.

file_path

string

required

Path to the .jsonl file.

index_type

IndexType

required

IndexType.SPARSE or IndexType.DENSE.

pretokenized

boolean

default:"false"

Skip tokenization when True. Applies to sparse indexing only.

load_parquet_table

indexer.load_parquet_table(table_name, file_path, index_type, pretokenized=False)

Reads a Parquet file with pandas, formats the vector column as a pgvector-compatible string [f1, f2, ...], and bulk-copies rows using COPY … FROM STDIN WITH CSV.

table_name

string

required

Target table name.

file_path

string

required

Path to the .parquet file.

index_type

IndexType

required

Must be IndexType.DENSE.

pretokenized

boolean

default:"false"

Unused; present for interface consistency.

fts_index

indexer.fts_index(table_name="corpus")

Creates a GIN index on the contents column using to_tsvector('simple', contents), enabling efficient full-text search with to_tsquery.

CREATE INDEX "corpus_contents_gin" ON "corpus"
    USING gin(to_tsvector('simple', contents));

table_name

string

default:"corpus"

Name of the sparse table to index.

get_index_type

indexer.get_index_type(table_name)

Queries information_schema.columns to detect index type from column names.

table_name

string

required

Table to inspect.

return

IndexType

IndexType.SPARSE if a contents column exists; IndexType.DENSE if an embedding column exists.

Raises ValueError if neither column is found.

get_num_rows

indexer.get_num_rows(table_name)

Returns the number of rows in the given table.

table_name

string

required

Table to count.

return

number

Row count as an integer.

close

indexer.close()

Closes the underlying psycopg2 connection.

Examples

Sparse (BM25) indexing

from quackir import IndexType
from quackir.index import PostgresIndexer

indexer = PostgresIndexer(db_name="mydb", user="myuser")

indexer.init_table("corpus", IndexType.SPARSE)
indexer.load_table("corpus", "corpus.jsonl")  # auto-tokenizes contents
indexer.fts_index("corpus")

print(indexer.get_num_rows("corpus"), "documents indexed")
indexer.close()

Dense (vector) indexing from Parquet

from quackir import IndexType
from quackir.index import PostgresIndexer

indexer = PostgresIndexer(db_name="mydb", user="myuser")

indexer.init_table("dense_corpus", IndexType.DENSE, embedding_dim=768)
indexer.load_table("dense_corpus", "embeddings.parquet")

print(indexer.get_num_rows("dense_corpus"), "vectors indexed")
indexer.close()

Core

Indexers

Searchers

Analysis

PostgresIndexer API reference

Constructor

Methods

init_table

load_table

load_jsonl_table

load_parquet_table

fts_index

get_index_type

get_num_rows

close

Examples

Sparse (BM25) indexing

Dense (vector) indexing from Parquet

Build docs developers (and LLMs) love

Core

Indexers

Searchers

Analysis

Documentation Index

​Constructor

​Methods

​init_table

​load_table

​load_jsonl_table

​load_parquet_table

​fts_index

​get_index_type

​get_num_rows

​close

​Examples

​Sparse (BM25) indexing

​Dense (vector) indexing from Parquet

Build docs developers (and LLMs) love

Constructor

Methods

init_table

load_table

load_jsonl_table

load_parquet_table

fts_index

get_index_type

get_num_rows

close

Examples

Sparse (BM25) indexing

Dense (vector) indexing from Parquet