Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/amitsaxena098/OpenKnowledgeStream/llms.txt

Use this file to discover all available pages before exploring further.

Once OpenKnowledgeStream is running, every Wikipedia change event flowing through the recent_change_stream Kafka topic is indexed into an OpenSearch index named wiki-changes. OpenSearch exposes a powerful REST API and Query DSL that let you search, filter, aggregate, and analyze that data in real time. This guide covers the structure of the indexed documents and the most useful queries to get started.

Document structure

OpensearchIndexer indexes each Change object directly, using the page title as the document ID (id(change.getTitle())). The Change model (in wiki-common) is a flat, four-field class:
FieldJSON keyTypeDescription
typetypestringChange type: edit, new, or log
titletitlestringPage title as it appears on Wikipedia
pageIdpageidnumberWikipedia’s numeric page identifier
tagstagsstring[]Editor-supplied tags, e.g. "mobile edit"
A typical document in the index looks like this:
{
  "type": "edit",
  "title": "Albert Einstein",
  "pageid": 736,
  "tags": ["mobile edit", "mobile web edit"]
}
Because the document ID is set to the page title, indexing the same title a second time upserts (overwrites) the existing document rather than creating a duplicate. The wiki-changes index therefore holds at most one document per Wikipedia page title — always reflecting the most recently indexed change for that page.

Common queries

Check index health and document count

GET /wiki-changes/_count
Sample response:
{
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "count": 1503
}

Get the most recently indexed documents

GET /wiki-changes/_search?size=10&sort=_id:desc
This returns the 10 documents whose page-title IDs sort last alphabetically. For strict recency ordering, consider adding an ingestion timestamp field and sorting on that instead.

Search by title keyword

GET /wiki-changes/_search
{
  "query": {
    "match": {
      "title": "Albert Einstein"
    }
  }
}
match performs full-text analysis — it tokenizes the query string and scores results by relevance. Use match_phrase to require the exact phrase in order.

Filter by change type

GET /wiki-changes/_search
{
  "query": {
    "term": {
      "type": "new"
    }
  }
}
Valid values for type are edit (an existing page was modified), new (a page was created), and log (an administrative log entry). Use term rather than match here because type values are not analyzed text — they are exact keyword tokens.

Filter by tag

GET /wiki-changes/_search
{
  "query": {
    "terms": {
      "tags": ["mobile edit", "mobile web edit"]
    }
  }
}
terms is the multi-value equivalent of term — it returns documents where the tags array contains any of the provided values.

Combine filters — new pages tagged as mobile edits

GET /wiki-changes/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "type": "new" } }
      ],
      "filter": [
        { "terms": { "tags": ["mobile edit"] } }
      ]
    }
  }
}
Queries inside filter are not scored, making them faster and cacheable — prefer filter over must for exact-match criteria that don’t affect relevance ranking.

Inspect the index mapping

OpenSearch infers the mapping from the first documents it receives. To see what was auto-detected:
GET /wiki-changes/_mapping
Sample response showing the inferred types:
{
  "wiki-changes": {
    "mappings": {
      "properties": {
        "pageid": {
          "type": "long"
        },
        "tags": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "type": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}
For term and terms queries on type, title, or tags, target the .keyword sub-field to avoid analyzed tokenization:
{
  "query": {
    "term": {
      "type.keyword": "edit"
    }
  }
}

The examples above cover the most common access patterns. For aggregations (e.g., a histogram of change types, top-edited pages, or tag frequency counts), pagination with search_after, and custom index mappings, refer to the OpenSearch Query DSL documentation.

Build docs developers (and LLMs) love