Files
tibi-svelte-starter/.agents/skills/search-and-embeddings/SKILL.md
T

239 lines
7.7 KiB
Markdown

---
name: search-and-embeddings
description: Model search and semantic retrieval for tibi website projects. Covers embedding provider configuration, collection search modes, auto-regeneration, regenerate-search admin flows, and how later agents should decide between no search, classic search, ngram search, and vector search.
---
# search-and-embeddings
## When to use this skill
Use this skill when:
- a project needs explicit search behavior beyond generic CRUD filtering
- search should be typo-tolerant, weighted, or semantic
- embedding providers must be configured
- later agents need a clear yes/no decision for search instead of vague optionality
## Goal
Give later agents a practical workflow for deciding whether search is needed and, if yes, which search mode belongs to the project.
This skill is separate from editor AI features. Search and embeddings affect content retrieval, operational setup, and index/regeneration behavior, not just editor assistance.
## Source of truth
Use these sources when implementing or reviewing search behavior:
- `tibi-server/docs/02-configuration.md`
- `tibi-server/docs/04-collections.md`
- `tibi-server/docs/09-llm-integration.md`
- `.agents/skills/nova-ai-editor-features/SKILL.md`
- `.agents/skills/mongodb-and-indexes/SKILL.md`
## First decision: no search vs explicit search
Do not leave search in an implied state.
Make one explicit decision:
- no search in this project
- classic keyword search only
- fuzzy substring search (`ngram`)
- semantic/vector search
- hybrid search with deliberate ranking behavior
If the answer is “not used”, document that clearly so later agents do not accidentally wire providers or regress into half-configured search.
## Server-level provider setup
Embedding providers are configured server-side:
```yaml
embedding:
providers:
- name: bge-m3
type: native
modelPath: /models/bge-m3
dimensions: 1024
- name: openai-embed
type: openai
model: text-embedding-3-small
apiKey: ${EMBEDDING_OPENAI-EMBED_APIKEY}
baseURL: https://api.openai.com/v1
dimensions: 1536
```
Important:
- collection search config references the provider by name
- embedding secrets and model paths can come from environment variables
- vector search is not only a collection concern; the server must actually provide the embedding backend
## Collection search modes
Tibi supports multiple search modes via collection `search:` config:
- `text`
- `regex`
- `eval`
- `filter`
- `ngram`
- `vector`
- `combined`
Use explicit search configs when search is a real product feature. Auto-fallback is useful, but it is not a substitute for a deliberate retrieval model.
## Choosing the right mode
### `text`
Use when:
- MongoDB text indexing is sufficient
- exact field ownership of the text index is clear
- keyword search is enough
Requires a MongoDB text index (`$text: $**` or specific).
### `regex`
Use when:
- the searchable fields are explicit
- case-insensitive matching is enough
- weighted field scoring is useful (via `regex.weights: { "meta.title": 10, path: 5 }`)
Good for smaller datasets or precise keyed fields. Very easy to configure without external dependencies. Example:
```yaml
search:
- name: default
mode: regex
fields: [title, "alt.de", description]
```
### `filter` or `eval`
Use when:
- search logic depends on auth, project context, or business-specific filtering
- plain keyword matching is not the full contract
Treat these as controlled power tools. The resulting filters are still sanitized against blocked operators.
### `ngram`
Use when:
- typo tolerance or substring matching is needed
- users search codes, names, transliterated terms, or partial inputs
This is enrichment-based search. It stores generated `_search` data and benefits from clear regeneration expectations.
_Note:_ Field weighting is not natively supported inside a single `ngram` mode, because all `fields` are concatenated into one large ngram index block per document.
### `vector`
Use when:
- semantic similarity matters more than literal keyword overlap
- the project can support embedding-provider setup (e.g. `bge-m3` in `api/config.yml`)
- search quality justifies added complexity
Vector mode requires a registered provider.
### `combined` (RRF)
Use when:
- Hybrid search is required (e.g. `vector` + `ngram` to catch typos and semantic meaning).
- You need to simulate field-weighting for `vector` or `ngram` by breaking them up into multiple search blocks and fusing them with different weights.
`mode: combined` uses Reciprocal Rank Fusion (RRF). It delegates execution to other configured search blocks (which should be hidden in admin UI via `meta.hide: true`).
**Field-Weighting Workaround with combined:**
Because `vector` and `ngram` concatenate all fields, you can weight highly important fields (like titles) higher than deep content fields by creating multiple ngram/vector blocks and boosting the important one in the `combined` weights:
```yaml
search:
- name: main_search
mode: combined
rrf:
k: 60
topK: 100
weights:
semantic: 1.5
fuzzy_important: 2.0 # Boosts matches in title/headline
fuzzy_content: 0.5 # Lowers weight for deep text matches
meta:
label: { de: "Suche", en: "Search" }
- name: fuzzy_important
mode: ngram
fields: [name, "meta.title", "blocks.headline"]
autoRegenerate: true
meta: { hide: true }
- name: fuzzy_content
mode: ngram
fields: ["blocks.text", "blocks.items.answer"]
autoRegenerate: true
meta: { hide: true }
- name: semantic
mode: vector
fields: [name, "meta.title", "blocks.text"]
vector: { provider: bge-m3 }
autoRegenerate: true
```
## Auto-regeneration and admin flows
For `ngram` and `vector`, `autoRegenerate: true` can refresh stale enrichment data after config changes.
If regeneration is needed manually, the admin flow depends on project admin tokens with:
- `allowRegenerateSearch: true`
Treat regeneration as part of the search contract, not as an implementation footnote.
## Search and LLM are related but not identical
The LLM system and the embedding system are adjacent, but they are not the same thing.
- `llm.providers` drive chat/completion features
- `embedding.providers` drive vector search enrichment
- org/user budgets affect LLM usage workflows
- search design still needs its own retrieval and operator decisions
Do not assume that enabling editor AI automatically defines a sound search architecture.
## Anti-patterns
- leaving search unspecified and hoping auto-fallback is “good enough”
- enabling vector search without a real provider/runtime plan
- forgetting text indexes for `mode: text`
- enabling enrichment modes without a regeneration story
- mixing editor AI decisions with search decisions until neither is clear
## Verification checklist
After search-related changes, verify all of these:
1. the project has an explicit yes/no search decision
2. server-side embedding providers exist when vector search is configured
3. required text or search indexes exist
4. `?q=` and `?qName=` behavior matches the intended search contract
5. regeneration behavior is defined for enrichment-based modes
## What an LLM should inspect first
When asked to add or review search on this starter, inspect in this order:
1. `tibi-server/docs/04-collections.md`
2. `tibi-server/docs/02-configuration.md`
3. existing collection `search:` config
4. whether the project needs keyword, fuzzy, semantic, or no search
5. operator expectations for regeneration and provider secrets
This prevents over-engineered vector setups and under-specified search behavior.