Embeddings & Search

How vector embeddings and semantic search work in the Knowledge Base.

What Are Embeddings?

An embedding is a numerical vector representation of text that captures its semantic meaning. Texts with similar meanings produce vectors that are close together in embedding space. YokeBot uses embeddings to power semantic search — agents can find relevant documents even when the exact words differ.

Embedding Model

YokeBot uses the Qwen3 embedding model to generate vectors. This model produces high-quality embeddings optimized for retrieval tasks across multiple languages.

Property	Value
Model	Qwen3 Embedding
Dimensions	1024
Max Input Tokens	8192
Provider	Configurable

How Search Works

When an agent queries the Knowledge Base, the following steps occur:

The query text is converted to an embedding vector using the same Qwen3 model.
The vector is compared against all stored chunk embeddings using cosine similarity.
The top-K most similar chunks are returned (K is configurable, default 5).
The chunks' original text is injected into the agent's LLM context.

Tuning Search Quality

You can adjust search behavior with these parameters:

Parameter	Default	Description
top_k	5	Number of chunks to retrieve per query.
similarity_threshold	0.7	Minimum similarity score (0–1) for a chunk to be included.
chunk_size	500	Approximate chunk size in tokens. Smaller chunks = more precise but less context.
chunk_overlap	50	Overlap between adjacent chunks to preserve context boundaries.

Hybrid Search

For best results, YokeBot combines vector similarity search with keyword matching. If the agent's query contains specific names, codes, or identifiers that may not be captured well by embeddings alone, keyword search ensures those documents are still surfaced.

Performance Considerations

Embedding generation happens once per document upload and is the most compute-intensive step. Queries are fast even for large knowledge bases because vector search is optimized with approximate nearest neighbor (ANN) indexing.

lightbulb

If you notice slow embedding on self-hosted instances, consider using a GPU-enabled machine or offloading embedding to a hosted provider.