239 lines
7.7 KiB
Markdown
239 lines
7.7 KiB
Markdown
---
|
|
name: search-and-embeddings
|
|
description: Model search and semantic retrieval for tibi website projects. Covers embedding provider configuration, collection search modes, auto-regeneration, regenerate-search admin flows, and how later agents should decide between no search, classic search, ngram search, and vector search.
|
|
---
|
|
|
|
# search-and-embeddings
|
|
|
|
## When to use this skill
|
|
|
|
Use this skill when:
|
|
|
|
- a project needs explicit search behavior beyond generic CRUD filtering
|
|
- search should be typo-tolerant, weighted, or semantic
|
|
- embedding providers must be configured
|
|
- later agents need a clear yes/no decision for search instead of vague optionality
|
|
|
|
## Goal
|
|
|
|
Give later agents a practical workflow for deciding whether search is needed and, if yes, which search mode belongs to the project.
|
|
|
|
This skill is separate from editor AI features. Search and embeddings affect content retrieval, operational setup, and index/regeneration behavior, not just editor assistance.
|
|
|
|
## Source of truth
|
|
|
|
Use these sources when implementing or reviewing search behavior:
|
|
|
|
- `tibi-server/docs/02-configuration.md`
|
|
- `tibi-server/docs/04-collections.md`
|
|
- `tibi-server/docs/09-llm-integration.md`
|
|
- `.agents/skills/nova-ai-editor-features/SKILL.md`
|
|
- `.agents/skills/mongodb-and-indexes/SKILL.md`
|
|
|
|
## First decision: no search vs explicit search
|
|
|
|
Do not leave search in an implied state.
|
|
|
|
Make one explicit decision:
|
|
|
|
- no search in this project
|
|
- classic keyword search only
|
|
- fuzzy substring search (`ngram`)
|
|
- semantic/vector search
|
|
- hybrid search with deliberate ranking behavior
|
|
|
|
If the answer is “not used”, document that clearly so later agents do not accidentally wire providers or regress into half-configured search.
|
|
|
|
## Server-level provider setup
|
|
|
|
Embedding providers are configured server-side:
|
|
|
|
```yaml
|
|
embedding:
|
|
providers:
|
|
- name: bge-m3
|
|
type: native
|
|
modelPath: /models/bge-m3
|
|
dimensions: 1024
|
|
- name: openai-embed
|
|
type: openai
|
|
model: text-embedding-3-small
|
|
apiKey: ${EMBEDDING_OPENAI-EMBED_APIKEY}
|
|
baseURL: https://api.openai.com/v1
|
|
dimensions: 1536
|
|
```
|
|
|
|
Important:
|
|
|
|
- collection search config references the provider by name
|
|
- embedding secrets and model paths can come from environment variables
|
|
- vector search is not only a collection concern; the server must actually provide the embedding backend
|
|
|
|
## Collection search modes
|
|
|
|
Tibi supports multiple search modes via collection `search:` config:
|
|
|
|
- `text`
|
|
- `regex`
|
|
- `eval`
|
|
- `filter`
|
|
- `ngram`
|
|
- `vector`
|
|
- `combined`
|
|
|
|
Use explicit search configs when search is a real product feature. Auto-fallback is useful, but it is not a substitute for a deliberate retrieval model.
|
|
|
|
## Choosing the right mode
|
|
|
|
### `text`
|
|
|
|
Use when:
|
|
|
|
- MongoDB text indexing is sufficient
|
|
- exact field ownership of the text index is clear
|
|
- keyword search is enough
|
|
|
|
Requires a MongoDB text index (`$text: $**` or specific).
|
|
|
|
### `regex`
|
|
|
|
Use when:
|
|
|
|
- the searchable fields are explicit
|
|
- case-insensitive matching is enough
|
|
- weighted field scoring is useful (via `regex.weights: { "meta.title": 10, path: 5 }`)
|
|
|
|
Good for smaller datasets or precise keyed fields. Very easy to configure without external dependencies. Example:
|
|
|
|
```yaml
|
|
search:
|
|
- name: default
|
|
mode: regex
|
|
fields: [title, "alt.de", description]
|
|
```
|
|
|
|
### `filter` or `eval`
|
|
|
|
Use when:
|
|
|
|
- search logic depends on auth, project context, or business-specific filtering
|
|
- plain keyword matching is not the full contract
|
|
|
|
Treat these as controlled power tools. The resulting filters are still sanitized against blocked operators.
|
|
|
|
### `ngram`
|
|
|
|
Use when:
|
|
|
|
- typo tolerance or substring matching is needed
|
|
- users search codes, names, transliterated terms, or partial inputs
|
|
|
|
This is enrichment-based search. It stores generated `_search` data and benefits from clear regeneration expectations.
|
|
_Note:_ Field weighting is not natively supported inside a single `ngram` mode, because all `fields` are concatenated into one large ngram index block per document.
|
|
|
|
### `vector`
|
|
|
|
Use when:
|
|
|
|
- semantic similarity matters more than literal keyword overlap
|
|
- the project can support embedding-provider setup (e.g. `bge-m3` in `api/config.yml`)
|
|
- search quality justifies added complexity
|
|
|
|
Vector mode requires a registered provider.
|
|
|
|
### `combined` (RRF)
|
|
|
|
Use when:
|
|
|
|
- Hybrid search is required (e.g. `vector` + `ngram` to catch typos and semantic meaning).
|
|
- You need to simulate field-weighting for `vector` or `ngram` by breaking them up into multiple search blocks and fusing them with different weights.
|
|
|
|
`mode: combined` uses Reciprocal Rank Fusion (RRF). It delegates execution to other configured search blocks (which should be hidden in admin UI via `meta.hide: true`).
|
|
|
|
**Field-Weighting Workaround with combined:**
|
|
Because `vector` and `ngram` concatenate all fields, you can weight highly important fields (like titles) higher than deep content fields by creating multiple ngram/vector blocks and boosting the important one in the `combined` weights:
|
|
|
|
```yaml
|
|
search:
|
|
- name: main_search
|
|
mode: combined
|
|
rrf:
|
|
k: 60
|
|
topK: 100
|
|
weights:
|
|
semantic: 1.5
|
|
fuzzy_important: 2.0 # Boosts matches in title/headline
|
|
fuzzy_content: 0.5 # Lowers weight for deep text matches
|
|
meta:
|
|
label: { de: "Suche", en: "Search" }
|
|
|
|
- name: fuzzy_important
|
|
mode: ngram
|
|
fields: [name, "meta.title", "blocks.headline"]
|
|
autoRegenerate: true
|
|
meta: { hide: true }
|
|
|
|
- name: fuzzy_content
|
|
mode: ngram
|
|
fields: ["blocks.text", "blocks.items.answer"]
|
|
autoRegenerate: true
|
|
meta: { hide: true }
|
|
|
|
- name: semantic
|
|
mode: vector
|
|
fields: [name, "meta.title", "blocks.text"]
|
|
vector: { provider: bge-m3 }
|
|
autoRegenerate: true
|
|
```
|
|
|
|
## Auto-regeneration and admin flows
|
|
|
|
For `ngram` and `vector`, `autoRegenerate: true` can refresh stale enrichment data after config changes.
|
|
|
|
If regeneration is needed manually, the admin flow depends on project admin tokens with:
|
|
|
|
- `allowRegenerateSearch: true`
|
|
|
|
Treat regeneration as part of the search contract, not as an implementation footnote.
|
|
|
|
## Search and LLM are related but not identical
|
|
|
|
The LLM system and the embedding system are adjacent, but they are not the same thing.
|
|
|
|
- `llm.providers` drive chat/completion features
|
|
- `embedding.providers` drive vector search enrichment
|
|
- org/user budgets affect LLM usage workflows
|
|
- search design still needs its own retrieval and operator decisions
|
|
|
|
Do not assume that enabling editor AI automatically defines a sound search architecture.
|
|
|
|
## Anti-patterns
|
|
|
|
- leaving search unspecified and hoping auto-fallback is “good enough”
|
|
- enabling vector search without a real provider/runtime plan
|
|
- forgetting text indexes for `mode: text`
|
|
- enabling enrichment modes without a regeneration story
|
|
- mixing editor AI decisions with search decisions until neither is clear
|
|
|
|
## Verification checklist
|
|
|
|
After search-related changes, verify all of these:
|
|
|
|
1. the project has an explicit yes/no search decision
|
|
2. server-side embedding providers exist when vector search is configured
|
|
3. required text or search indexes exist
|
|
4. `?q=` and `?qName=` behavior matches the intended search contract
|
|
5. regeneration behavior is defined for enrichment-based modes
|
|
|
|
## What an LLM should inspect first
|
|
|
|
When asked to add or review search on this starter, inspect in this order:
|
|
|
|
1. `tibi-server/docs/04-collections.md`
|
|
2. `tibi-server/docs/02-configuration.md`
|
|
3. existing collection `search:` config
|
|
4. whether the project needs keyword, fuzzy, semantic, or no search
|
|
5. operator expectations for regeneration and provider secrets
|
|
|
|
This prevents over-engineered vector setups and under-specified search behavior.
|