提示词
提示词管理系统设计与实现
当 LLM 应用从原型进入生产,提示词就不再是"一段文字",而是核心业务逻辑的一部分。没有管理系统的提示词面临以下问题:
- 状态
- 已收录
- 语言
- 中文
- 来源
- …/src/data/.articles-bodies/ai-literacy--提示词管理系统设计与实现.html
- 重复副本
- 0
提取结果
提示词片段
Prompt Management System Architecture +------------------+ +------------------+ | Prompt Studio | | Evaluation | | (Web Editor) | | Pipeline | +--------+---------+ +--------+---------+ | | v v +------------------------------------------+ | Prompt Registry API | | | | +----------+ +---------+ +----------+ | | | Versions | | Labels | | Configs | | | +----------+ +---------+ +----------+ | | +----------+ +---------+ +----------+ | | | Variants | | Metrics | | Deploys | | | +----------+ +---------+ +----------+ | +------------------------------------------+ | | v v +------------------+ +------------------+ | PostgreSQL | | Cache Layer | | (Source of | | (Redis/Edge) | | Truth) | | | +------------------+ +------------------+ | v +------------------------------------------+ | Application Runtime | | | | prompt = registry.get("rag-system", | | label="production") | | compiled = prompt.compile(vars) | | response = llm.generate(compiled) | +------------------------------------------+from datetime import datetime from enum import Enum from pydantic import BaseModel class PromptType(str, Enum): TEXT = "text" # Plain text prompt CHAT = "chat" # Chat messages format TEMPLATE = "template" # With variable placeholders class PromptVersion(BaseModel): """A single immutable version of a prompt.""" id: str # uuid prompt_name: str # e.g., "rag-system-prompt" version: int # Auto-incrementing type: PromptType content: str | list[dict] # Text or chat messages config: dict # Model, temperature, etc. variables: list[str] # Template variables created_by: str # Author created_at: datetime commit_message: str # Why this change parent_version: int | None # Previous version class PromptLabel(BaseModel): """Mutable pointer to a version (like git tags).""" prompt_name: str label: str # "production", "staging", "canary" version: int # Points to a PromptVersion updated_at: datetime updated_by: str class PromptMetrics(BaseModel): """Evaluation metrics for a version.""" prompt_name: str version: int metric_name: str # "faithfulness", "relevancy", etc. value: float sample_size: int evaluated_at: datetime
# prompts/rag-system-prompt/v3.yaml name: rag-system-prompt version: 3 type: chat config: model: gpt-4o temperature: 0.3 max_tokens: 2048 messages: - role: system content: | You are a helpful assistant that answers questions based on the provided context. Rules: - Only use information from the provided context - If the context doesn't contain the answer, say "I don't know" - Cite specific sections when possible - Answer in {{language}} variables: - language # Compile-time variable metadata: author: maurice created: 2026-02-15 commit_message: "Add citation requirement and language variable" tags: [rag, production-ready]from fastapi import FastAPI, HTTPException from typing import Optional app = FastAPI() class PromptRegistry: """Core prompt registry with version control.""" async def create_version( self, name: str, content: str | list[dict], config: dict, commit_message: str, author: str, ) -> PromptVersion: """Create a new immutable version.""" current = await self.get_latest_version(name) new_version = (current.version + 1) if current else 1 version = PromptVersion( id=str(uuid4()), prompt_name=name, version=new_version, content=content, config=config, commit_message=commit_message, created_by=author, parent_version=current.version if current else None, # ... other fields ) await self.db.insert(version) return version async def set_label( self, name: str, label: str, version: int, author: str, ) -> PromptLabel: """Point a label to a specific version (like git tag).""" # Verify version exists v = await self.get_version(name, version) if not v: raise HTTPException(404, f"Version {version} not found") prompt_label = PromptLabel( prompt_name=name, label=label, version=version, updated_by=author, ) await self.db.upsert(prompt_label) # Invalidate cache await self.cache.delete(f"prompt:{name}:{label}") return prompt_label async def get_prompt( self, name: str, label: str = "production", version: Optional[int] = None, ) -> PromptVersion: """Get prompt by label or explicit version.""" cache_key = f"prompt:{name}:{label or version}" # Check cache first cached = await self.cache.get(cache_key) if cached: return PromptVersion.model_validate_json(cached) if version: result = await self.get_version(name, version) else: lbl = await self.db.get_label(name, label) result = await self.get_version(name, lbl.version) # Cache for 5 minutes await self.cache.set(cache_key, result.model_dump_json(), ex=300) return result registry = PromptRegistry()A/B Testing Flow User Request | v +--------------------+ | Traffic Router | | (hash(user_id) % | | 100 < threshold?) | +----+----------+----+ | | v v +--------+ +--------+ | Prompt | | Prompt | | v3 | | v4 | | (90%) | | (10%) | +--------+ +--------+ | | v v LLM Call LLM Call | | v v +--------------------+ | Metrics Collector | | (latency, quality, | | cost, user_score) | +--------------------+ | v +--------------------+ | Statistical | | Analysis | | (significance test)| +--------------------+
import hashlib from dataclasses import dataclass @dataclass class ABExperiment: name: str control_version: int # e.g., v3 treatment_version: int # e.g., v4 traffic_percentage: float # 0.0-1.0, percentage for treatment min_sample_size: int # Minimum samples before conclusion start_date: datetime status: str # "running", "concluded", "aborted" class ABRouter: def __init__(self, registry: PromptRegistry): self.registry = registry async def get_prompt_for_request( self, prompt_name: str, user_id: str, experiment: ABExperiment | None = None, ) -> tuple[PromptVersion, str]: """Returns (prompt, variant) for A/B tracking.""" if not experiment or experiment.status != "running": prompt = await self.registry.get_prompt(prompt_name) return prompt, "control" # Deterministic assignment based on user_id hash_val = int(hashlib.md5( f"{experiment.name}:{user_id}".encode() ).hexdigest(), 16) bucket = (hash_val % 1000) / 1000.0 if bucket < experiment.traffic_percentage: version = experiment.treatment_version variant = "treatment" else: version = experiment.control_version variant = "control" prompt = await self.registry.get_prompt( prompt_name, version=version, ) return prompt, variantPrompt Deployment Pipeline 1. DEVELOP Author writes/edits prompt in Prompt Studio -> Creates new version (v4) -> Label: "draft" 2. EVALUATE Automated eval pipeline runs: -> Faithfulness score -> Relevancy score -> Regression test (compare vs production) -> Cost estimation -> Label: "staging" (if eval passes) 3. CANARY Route 5% traffic to staging prompt -> Monitor metrics for 1 hour -> Compare with production baseline -> Label: "canary" (if metrics healthy) 4. PROMOTE Route 100% traffic to new version -> Label: "production" -> Old version labeled: "rollback-target" 5. MONITOR Continuous monitoring: -> Alert if quality drops > 10% -> Auto-rollback if critical threshold breached
async def evaluate_prompt_version( prompt_name: str, version: int, eval_dataset: str = "golden-set", ) -> dict: """Automated evaluation gate before promotion.""" prompt = await registry.get_prompt(prompt_name, version=version) production = await registry.get_prompt(prompt_name, label="production") dataset = await load_dataset(eval_dataset) results = {"new": [], "baseline": []} for sample in dataset: # Run new version new_output = await run_prompt(prompt, sample["input"]) new_score = await evaluate_output( new_output, sample["expected"], sample["context"], ) results["new"].append(new_score) # Run baseline (production) base_output = await run_prompt(production, sample["input"]) base_score = await evaluate_output( base_output, sample["expected"], sample["context"], ) results["baseline"].append(base_score) # Statistical comparison from scipy.stats import ttest_rel t_stat, p_value = ttest_rel(results["new"], results["baseline"]) avg_new = sum(results["new"]) / len(results["new"]) avg_base = sum(results["baseline"]) / len(results["baseline"]) verdict = { "new_avg": avg_new, "baseline_avg": avg_base, "improvement": avg_new - avg_base, "p_value": p_value, "significant": p_value < 0.05, "pass": avg_new >= avg_base * 0.95, # Allow max 5% regression } return verdict 正文
清洗后的原始内容
提示词管理系统设计与实现
从版本控制到生产部署:企业级 Prompt 管理系统的架构设计与工程实践 | 2026-02
一、为什么需要提示词管理
当 LLM 应用从原型进入生产,提示词就不再是"一段文字",而是核心业务逻辑的一部分。没有管理系统的提示词面临以下问题:
- 版本失控:谁改了提示词?改了什么?改坏了怎么回滚?
- 质量退化:新版本是否比旧版本好?没有对比就没有答案
- 部署混乱:开发环境的提示词和生产环境不一致
- 协作困难:产品经理、工程师、数据团队各改各的
本文从架构设计、版本控制、A/B 测试、部署流水线、评估集成五个维度设计一套完整的提示词管理系统。
二、架构设计
2.1 系统架构总览
Prompt Management System Architecture
+------------------+ +------------------+
| Prompt Studio | | Evaluation |
| (Web Editor) | | Pipeline |
+--------+---------+ +--------+---------+
| |
v v
+------------------------------------------+
| Prompt Registry API |
| |
| +----------+ +---------+ +----------+ |
| | Versions | | Labels | | Configs | |
| +----------+ +---------+ +----------+ |
| +----------+ +---------+ +----------+ |
| | Variants | | Metrics | | Deploys | |
| +----------+ +---------+ +----------+ |
+------------------------------------------+
| |
v v
+------------------+ +------------------+
| PostgreSQL | | Cache Layer |
| (Source of | | (Redis/Edge) |
| Truth) | | |
+------------------+ +------------------+
|
v
+------------------------------------------+
| Application Runtime |
| |
| prompt = registry.get("rag-system", |
| label="production") |
| compiled = prompt.compile(vars) |
| response = llm.generate(compiled) |
+------------------------------------------+
2.2 核心数据模型
from datetime import datetime
from enum import Enum
from pydantic import BaseModel
class PromptType(str, Enum):
TEXT = "text" # Plain text prompt
CHAT = "chat" # Chat messages format
TEMPLATE = "template" # With variable placeholders
class PromptVersion(BaseModel):
"""A single immutable version of a prompt."""
id: str # uuid
prompt_name: str # e.g., "rag-system-prompt"
version: int # Auto-incrementing
type: PromptType
content: str | list[dict] # Text or chat messages
config: dict # Model, temperature, etc.
variables: list[str] # Template variables
created_by: str # Author
created_at: datetime
commit_message: str # Why this change
parent_version: int | None # Previous version
class PromptLabel(BaseModel):
"""Mutable pointer to a version (like git tags)."""
prompt_name: str
label: str # "production", "staging", "canary"
version: int # Points to a PromptVersion
updated_at: datetime
updated_by: str
class PromptMetrics(BaseModel):
"""Evaluation metrics for a version."""
prompt_name: str
version: int
metric_name: str # "faithfulness", "relevancy", etc.
value: float
sample_size: int
evaluated_at: datetime
三、版本控制
3.1 版本控制策略
| 策略 | 适用场景 | 优势 | 劣势 |
|---|---|---|---|
| Git 文件管理 | 开发者团队 | 熟悉的工具链 | 非技术人员不友好 |
| 数据库版本 | 生产系统 | 动态部署,Label 机制 | 需要专用系统 |
| Prompt Registry | 企业级 | 完整生命周期管理 | 建设成本高 |
| 混合(Git + DB) | 推荐 | 开发用 Git,生产用 DB | 需同步机制 |
3.2 Git-based 版本管理
# prompts/rag-system-prompt/v3.yaml
name: rag-system-prompt
version: 3
type: chat
config:
model: gpt-4o
temperature: 0.3
max_tokens: 2048
messages:
- role: system
content: |
You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only use information from the provided context
- If the context doesn't contain the answer, say "I don't know"
- Cite specific sections when possible
- Answer in {{language}}
variables:
- language # Compile-time variable
metadata:
author: maurice
created: 2026-02-15
commit_message: "Add citation requirement and language variable"
tags: [rag, production-ready]
3.3 Registry API 实现
from fastapi import FastAPI, HTTPException
from typing import Optional
app = FastAPI()
class PromptRegistry:
"""Core prompt registry with version control."""
async def create_version(
self, name: str, content: str | list[dict],
config: dict, commit_message: str, author: str,
) -> PromptVersion:
"""Create a new immutable version."""
current = await self.get_latest_version(name)
new_version = (current.version + 1) if current else 1
version = PromptVersion(
id=str(uuid4()),
prompt_name=name,
version=new_version,
content=content,
config=config,
commit_message=commit_message,
created_by=author,
parent_version=current.version if current else None,
# ... other fields
)
await self.db.insert(version)
return version
async def set_label(
self, name: str, label: str, version: int, author: str,
) -> PromptLabel:
"""Point a label to a specific version (like git tag)."""
# Verify version exists
v = await self.get_version(name, version)
if not v:
raise HTTPException(404, f"Version {version} not found")
prompt_label = PromptLabel(
prompt_name=name, label=label,
version=version, updated_by=author,
)
await self.db.upsert(prompt_label)
# Invalidate cache
await self.cache.delete(f"prompt:{name}:{label}")
return prompt_label
async def get_prompt(
self, name: str, label: str = "production",
version: Optional[int] = None,
) -> PromptVersion:
"""Get prompt by label or explicit version."""
cache_key = f"prompt:{name}:{label or version}"
# Check cache first
cached = await self.cache.get(cache_key)
if cached:
return PromptVersion.model_validate_json(cached)
if version:
result = await self.get_version(name, version)
else:
lbl = await self.db.get_label(name, label)
result = await self.get_version(name, lbl.version)
# Cache for 5 minutes
await self.cache.set(cache_key, result.model_dump_json(), ex=300)
return result
registry = PromptRegistry()
四、A/B 测试
4.1 A/B 测试架构
A/B Testing Flow
User Request
|
v
+--------------------+
| Traffic Router |
| (hash(user_id) % |
| 100 < threshold?) |
+----+----------+----+
| |
v v
+--------+ +--------+
| Prompt | | Prompt |
| v3 | | v4 |
| (90%) | | (10%) |
+--------+ +--------+
| |
v v
LLM Call LLM Call
| |
v v
+--------------------+
| Metrics Collector |
| (latency, quality, |
| cost, user_score) |
+--------------------+
|
v
+--------------------+
| Statistical |
| Analysis |
| (significance test)|
+--------------------+
4.2 A/B 测试实现
import hashlib
from dataclasses import dataclass
@dataclass
class ABExperiment:
name: str
control_version: int # e.g., v3
treatment_version: int # e.g., v4
traffic_percentage: float # 0.0-1.0, percentage for treatment
min_sample_size: int # Minimum samples before conclusion
start_date: datetime
status: str # "running", "concluded", "aborted"
class ABRouter:
def __init__(self, registry: PromptRegistry):
self.registry = registry
async def get_prompt_for_request(
self, prompt_name: str, user_id: str,
experiment: ABExperiment | None = None,
) -> tuple[PromptVersion, str]:
"""Returns (prompt, variant) for A/B tracking."""
if not experiment or experiment.status != "running":
prompt = await self.registry.get_prompt(prompt_name)
return prompt, "control"
# Deterministic assignment based on user_id
hash_val = int(hashlib.md5(
f"{experiment.name}:{user_id}".encode()
).hexdigest(), 16)
bucket = (hash_val % 1000) / 1000.0
if bucket < experiment.traffic_percentage:
version = experiment.treatment_version
variant = "treatment"
else:
version = experiment.control_version
variant = "control"
prompt = await self.registry.get_prompt(
prompt_name, version=version,
)
return prompt, variant
五、部署流水线
5.1 Prompt CI/CD 流程
Prompt Deployment Pipeline
1. DEVELOP
Author writes/edits prompt in Prompt Studio
-> Creates new version (v4)
-> Label: "draft"
2. EVALUATE
Automated eval pipeline runs:
-> Faithfulness score
-> Relevancy score
-> Regression test (compare vs production)
-> Cost estimation
-> Label: "staging" (if eval passes)
3. CANARY
Route 5% traffic to staging prompt
-> Monitor metrics for 1 hour
-> Compare with production baseline
-> Label: "canary" (if metrics healthy)
4. PROMOTE
Route 100% traffic to new version
-> Label: "production"
-> Old version labeled: "rollback-target"
5. MONITOR
Continuous monitoring:
-> Alert if quality drops > 10%
-> Auto-rollback if critical threshold breached
5.2 自动化评估门禁
async def evaluate_prompt_version(
prompt_name: str, version: int,
eval_dataset: str = "golden-set",
) -> dict:
"""Automated evaluation gate before promotion."""
prompt = await registry.get_prompt(prompt_name, version=version)
production = await registry.get_prompt(prompt_name, label="production")
dataset = await load_dataset(eval_dataset)
results = {"new": [], "baseline": []}
for sample in dataset:
# Run new version
new_output = await run_prompt(prompt, sample["input"])
new_score = await evaluate_output(
new_output, sample["expected"], sample["context"],
)
results["new"].append(new_score)
# Run baseline (production)
base_output = await run_prompt(production, sample["input"])
base_score = await evaluate_output(
base_output, sample["expected"], sample["context"],
)
results["baseline"].append(base_score)
# Statistical comparison
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(results["new"], results["baseline"])
avg_new = sum(results["new"]) / len(results["new"])
avg_base = sum(results["baseline"]) / len(results["baseline"])
verdict = {
"new_avg": avg_new,
"baseline_avg": avg_base,
"improvement": avg_new - avg_base,
"p_value": p_value,
"significant": p_value < 0.05,
"pass": avg_new >= avg_base * 0.95, # Allow max 5% regression
}
return verdict
六、模板引擎
6.1 变量替换
import re
from typing import Any
class PromptCompiler:
"""Compile prompt templates with variable substitution."""
def compile(
self, template: str, variables: dict[str, Any],
strict: bool = True,
) -> str:
"""Replace {{variable}} placeholders with values."""
# Find all variables in template
required = set(re.findall(r'\{\{(\w+)\}\}', template))
provided = set(variables.keys())
if strict:
missing = required - provided
if missing:
raise ValueError(f"Missing variables: {missing}")
result = template
for key, value in variables.items():
result = result.replace(f"{{{{{key}}}}}", str(value))
return result
def compile_chat(
self, messages: list[dict], variables: dict[str, Any],
) -> list[dict]:
"""Compile chat format prompts."""
compiled = []
for msg in messages:
compiled.append({
"role": msg["role"],
"content": self.compile(msg["content"], variables),
})
return compiled
# Usage
compiler = PromptCompiler()
prompt = registry.get_prompt("rag-system", label="production")
compiled = compiler.compile(prompt.content, {
"language": "Chinese",
"max_sources": "3",
})
6.2 条件逻辑
# Advanced: Jinja2-based templates for complex logic
from jinja2 import Environment, BaseLoader
JINJA_ENV = Environment(loader=BaseLoader())
template_str = """
You are a {{ role }} assistant.
{% if context %}
Use the following context to answer:
{{ context }}
{% endif %}
{% if examples %}
Here are some examples:
{% for ex in examples %}
Q: {{ ex.question }}
A: {{ ex.answer }}
{% endfor %}
{% endif %}
Rules:
{% for rule in rules %}
- {{ rule }}
{% endfor %}
"""
template = JINJA_ENV.from_string(template_str)
compiled = template.render(
role="financial compliance",
context=retrieved_docs,
examples=few_shot_examples,
rules=["Cite sources", "Be concise", "Use formal tone"],
)
七、与可观测性集成
7.1 集成 Langfuse
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
async def answer_question(query: str, user_id: str) -> str:
# Fetch prompt from registry (linked to Langfuse)
prompt = langfuse.get_prompt("rag-system", label="production")
# Compile with variables
messages = prompt.compile(context=retrieved_docs, language="zh")
# Generate (auto-traced)
response = await openai.chat.completions.create(
model=prompt.config["model"],
messages=messages,
temperature=prompt.config["temperature"],
langfuse_prompt=prompt, # Link trace to prompt version
)
return response.choices[0].message.content
# In Langfuse dashboard:
# - See which prompt version was used for each trace
# - Compare quality metrics across versions
# - Track cost per prompt version
八、最佳实践
8.1 命名规范
| 层级 | 命名模式 | 示例 |
|---|---|---|
| 项目 | {project} |
customer-support |
| 功能 | {project}-{function} |
customer-support-classifier |
| 变体 | {project}-{function}-{variant} |
customer-support-classifier-concise |
8.2 提交规范
# Good commit messages
"Add citation requirement for compliance"
"Reduce hallucination by adding explicit constraints"
"Optimize token usage: -30% with same quality"
# Bad commit messages
"Update prompt"
"Fix"
"Try something new"
8.3 评估驱动原则
| 原则 | 描述 |
|---|---|
| 先建评估再改提示词 | 没有评估就没有优化方向 |
| 保持黄金测试集 | 每个提示词至少 50 个标注样本 |
| 自动门禁 | 评估不通过不允许上线 |
| 渐进发布 | canary -> staging -> production |
| 可回滚 | 永远保留上一个版本的 label |
九、总结
提示词管理系统的核心价值是把提示词从"隐性知识"变为"可追踪、可评估、可回滚的工程制品"。建议的实施路径:
- 第一阶段:Git 文件管理 + 手动评估(1-2 周)
- 第二阶段:Registry API + 自动评估门禁(2-4 周)
- 第三阶段:A/B 测试 + 渐进发布 + 可观测集成(4-8 周)
核心原则:Prompt 是代码,应该享有代码的全部工程化待遇。
Maurice | maurice_wen@proton.me