TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/alex-ber/AlexBerUtils/llms.txt
Use this file to discover all available pages before exploring further.
in_memory_similarity_search module computes cosine similarity between a query string and a list of candidate strings. It is suited for small to medium candidate sets that fit comfortably in memory.
Installation
NumPy is a required dependency for this module. The
[np] extra installs it
automatically. If NumPy is absent an ImportWarning is raised at import time
and the module will not be usable.Quick start
find_most_similar
Returns the single best-matching candidate.
An object that implements the
Embeddings protocol — it must expose an
embed_documents(texts) method.The query text to compare against the candidates.
Positional candidate strings. Pass them as individual arguments:
find_most_similar(emb, query, cand1, cand2, cand3).When
True, logs the similarity score for each candidate at INFO level.tuple[int, str] — the zero-based index and the text of the most
similar candidate.
find_most_similar_with_scores
Returns all candidates ranked by similarity score, highest first.
((index, text), score) tuples sorted descending by score.
An object implementing the
Embeddings protocol.The query text.
Candidate strings to rank.
Log ranked results at
INFO level.list[tuple[tuple[int, str], float]] — all candidates with scores,
highest similarity first. When *args is empty, returns [((-1, input_text), 0.0)].
Embeddings protocol
Any object with an embed_documents method satisfies the protocol:
/ in the signature), so
embeddings must always be the first positional argument.
SimpleEmbeddings
SimpleEmbeddings is a minimal, self-contained embedding implementation suitable
for unit tests and learning exercises.
Dimension of each output vector. The default matches the dimension of
OpenAI’s
text-embedding-ada-002 model for compatibility in tests.SimpleEmbeddings maps each character in a text to a fixed-size
vector by hashing the character to an index and incrementing that position. This
produces character-frequency vectors rather than semantic embeddings.
Using a production embedding backend
Edge cases
| Situation | Behaviour |
|---|---|
No candidates (*args is empty) | Returns (-1, input_text) from find_most_similar; [((-1, input_text), 0.0)] from find_most_similar_with_scores. |
| All-zero embedding vectors | Cosine similarity is 0.0 (division-by-zero is caught and set to 0.0). |
| NaN or Inf in similarity matrix | Replaced with 0.0 automatically. |
