Tanomude — 設計と技術選定

Engineering decisions, and why技術選定と、その理由

Decision選定	Why理由
The key-sequence skeleton lives in code; the model fills slots onlyキー操作列の骨格はコード側、モデルはスロットのみ	Measured, not assumed: free-form generation scored 7.5% exact vs 87.5% for skeleton + slot-fill (96 runs, zero parse failures) — and in all 96 runs the model never once produced the correct abort sequence on a missing required field. So dates, day counts, branching and required-field guards are code; the model returns two normalized values and two branch decisions.前提ではなく測定で決定：自由生成は exact 7.5%、骨格＋スロットフィルは 87.5%（96ラン・parse失敗0）。さらに96ラン中、必須項目欠落時の正しい中断列は一度も生成されませんでした。だから日付・日数・分岐・必須項目ガードはコード側に置き、モデルは2つの値の正規化と2つの分岐判断だけを返します。
A custom agent loop, not a frameworkフレームワークではなく独自ループ	Existing agent frameworks were weighed against the safety and verification requirements and declined for the serving path: approval gating, screen-state verification, idempotency and rollback stay in plain code that can be read and tested.既存のエージェントフレームワークを安全性・検証要件に照らして評価したうえで本番経路には不採用。承認ゲート・画面状態の検証・冪等性・ロールバックを、読めてテストできる素のコードに置くためです。
Human approval is the only path to execution実行への唯一の経路は人手承認	A dangerous failure does not crash — it submits. Planning and execution are two separate calls: a plan stops at `awaiting_approval`, execution exists only inside the approve endpoint, and a decision on a task not awaiting approval is rejected with 409.危険な故障はクラッシュせず、送信されてしまいます。計画と実行は別々の呼び出しで、計画は `awaiting_approval` で停止し、実行は承認エンドポイントの内側にしか存在せず、承認待ち以外のタスクへの決定は 409 で拒否されます。
Hybrid retrieval — vector + full-text, rank-fusedハイブリッド検索 — ベクトル＋全文を順位融合	Queries are Japanese prose, but the load-bearing tokens are exact screen codes and keys (`DEST`, `P-###`). The dense vector channel carries the manuals' prose meaning; the sparse full-text channel (the unstemmed `simple` dictionary) carries the exact tokens — the hybrid catches both. Reciprocal-rank fusion combines the two rankings without having to calibrate two unrelated score scales; k=60 is the conventional RRF default, adopted as-is rather than tuned.問い合わせは日本語の文章ですが、効いてくるのは画面コードやキー（`DEST`・`P-###`）のような正確なトークンです。密なベクトル側が手順書の文意を、疎な全文検索側（語幹処理なしの `simple` 辞書）が正確な字面を受け持ち、両方を取りこぼしません。RRF は二つの異なるスコア尺度を調整せずに順位だけで融合します。k=60 はチューニングの結果ではなく、RRF の慣例的な既定値をそのまま採用した値です。
Structure-aware chunking構造認識チャンキング	The manual is split at its headings — one chunk per numbered section, the section number preserved — so a retrieved unit is one complete procedure step, and the retrieval eval can assert expected section numbers instead of fuzzy overlap.手順書は見出し単位で分割します — 番号つきの節ごとに1チャンク、節番号も保持。検索の戻り単位が「完結した手順1つ」になり、検索評価も曖昧な一致ではなく期待節番号で検証できます。
One Postgres (+pgvector) as the whole platform storeプラットフォームの保存層は Postgres（+pgvector）一つ	Chunks, vectors, full-text indexes, tasks, plans, approvals, the audit log, personal corrections and eval runs all live in one relational database — one backup, one transaction model, and no separate vector service to operate on-premises. Minimizing the components run on-premises is the same direct-control stance that keeps the serving path in plain code.チャンク・ベクトル・全文索引・タスク・計画・承認・監査ログ・個人修正・評価結果のすべてが一つのリレーショナル DB に載ります — バックアップは一つ、トランザクションモデルも一つ、オンプレで別途運用すべきベクトルサービスもありません。オンプレで運用する構成要素を最小に保つ — 本番経路を素のコードに置くのと同じ、直接制御の姿勢です。
A second, separate Postgres for the mock AS-400モック AS-400 には別の Postgres	The legacy stand-in keeps its own database, so the "AS-400" state is genuinely external to the platform: execution is verified by reading back across a real service boundary, not by inspecting the platform's own writes.レガシーの代役には専用 DB を持たせ、「AS-400」の状態をプラットフォームの外側に置いています。実行の検証は自分の書き込みをのぞき込むのではなく、実在のサービス境界越しの読み戻しで行われます。
Local models, fully on-premisesローカルモデル・完全オンプレミス	Ollama serves Gemma (`gemma4:e4b`) and TEI serves `multilingual-e5-large` (1024-dim; the corpus is bilingual — Japanese prose with English technical terms — so the embedder must be multilingual); nothing leaves the building. The embedding service is deliberately pinned to CPU so the 12 GB GPU stays free for the LLM.Ollama が Gemma（`gemma4:e4b`）を、TEI が `multilingual-e5-large`（1024次元。日本語＋英語（技術用語）の二言語コーパスのため、多言語埋め込みが必要）を配信し、データは社外に出ません。埋め込みサービスは意図的に CPU 固定 — 12GB の GPU を LLM 専用に空けるためです。
Determinism first決定論を最優先	temperature 0, a fixed seed, JSON-forced output; only after a parse failure does a retry step both the seed and the temperature (+0.3). The published numbers were identical across five consecutive runs because the serving path is deterministic by construction.temperature 0・固定シード・JSON 強制出力。parse 失敗時の再試行に限り、シードと temperature（+0.3 ずつ）を進めます。公表数値が連続5回の実行で完全一致したのは、本番経路が構成からして決定論的だからです。
Idempotency at the session seamセッション境界での冪等性	Every execution opens its screen session with the key `task:{task_id}`; the target enforces a unique constraint on it, so a duplicate submission yields one application, not two.すべての実行は `task:{task_id}` という冪等キーで画面セッションを開きます。対象側がこのキーに一意制約を張るため、重複送信しても申請は2件ではなく1件になります。

The core loopコアループ

backend/coreloop.py · slotfill.py

Planning is a code → LLM → code sandwich: deterministic guards refuse before the model is ever asked, the model only normalizes values and decides branches, and code assembles the actual keystrokes. Execution then drives the screen adapter step by step, asserting state after every action.

計画は「コード → LLM → コード」のサンドイッチです。決定論的なガードがモデルに尋ねる前に却下し、モデルは値の正規化と分岐判断だけを行い、実際のキー操作はコードが組み立てます。実行は画面アダプタをステップ単位で駆動し、各操作の後に状態を検証します。

Control flow of plan() + execute(), with the constants the code actually uses.plan() と execute() の制御フロー。定数はコードが実際に使う値。

MAX_REPLAN = 2 ROLLBACK_MAX_NAV = 3 PURPOSE_MAX = 20 MAX_PARSE_RETRY = 2 temperature 0 (+0.3/retry) seed fixed (steps on retry) idempotency task:{task_id}

The eval harness評価ハーネス

eval_runner.py · retrieval_eval.py · growth_eval.py

The harness does not test a copy of the logic — it drives the same HTTP surface the console uses (plan → approve), injects transient faults by swapping the execute runner, and checks field accuracy by reading the created application back out of the mock AS-400. Results persist to eval tables in the platform store.

ハーネスはロジックの複製をテストしません。コンソールと同じ HTTP 面（計画 → 承認）をそのまま駆動し、一時障害は実行ランナーの差し替えで注入し、フィールド精度は作成された申請をモック AS-400 から読み戻して照合します。結果はプラットフォーム保存層の評価テーブルに永続化されます。

Core-loop eval — 24 casesコアループ評価 — 24ケース

8 normal · 4 missing-field · 4 wrong-code · 4 transient · 4 duplicate. Each runs the real plan → approve path; expected outcome per case (submitted / refusal / 再入力 / 要調査 / idempotent).正常8・必須欠落4・誤コード4・一時障害4・重複4。各ケースが実際の計画→承認経路を実行し、期待結果（送信済／却下／再入力／要調査／冪等）と照合します。

Metrics: success rate · routing accuracy · field accuracy (trip read-back) · verify pass rate · average steps · recovery rate.指標：成功率・ルーティング精度・フィールド精度（申請の読み戻し）・verify 通過率・平均ステップ数・リカバリ率。

Retrieval eval — 4 queries検索評価 — 4クエリ

Each query carries the manual sections it must surface; vector, full-text and hybrid results are scored separately, so the fusion's contribution is visible.各クエリに「返すべき手順書の節」を持たせ、ベクトル・全文・ハイブリッドを別々に採点します。融合の寄与が見える設計です。

Metrics: precision@k · recall@k · precision@expected · MRR, at k=3.指標：precision@k・recall@k・precision@expected・MRR（k=3）。

Growth eval — 12 cases, A/B育成評価 — 12ケースの A/B

The same plan endpoint runs twice per case: control (no corrections) vs treatment (corrections seeded). Growth Δ = treatment − control on the 8 policy cases; the 4 boundary cases assert that protected slots did not move.同じ計画エンドポイントをケースごとに2回 — 対照（修正なし）と処置（修正あり）で実行します。育成 Δ ＝処置 − 対照（方針8ケース）。境界4ケースは保護スロットが動かないことを検証します。

Metrics: growth Δ · boundary respect rate (+ a control sanity check).指標：育成 Δ・境界尊重率（＋対照側の健全性チェック）。

1.0

Planning — success · routing · field accuracy計画 — 成功率・ルーティング・フィールド精度

1.0

Retrieval — recall@3 · precision@expected · MRR検索 — recall@3・precision@expected・MRR

0.625

Growth Δ — inference-tier policy slots育成 Δ — 推論ティアの方針スロット

1.0

Boundary respect — code-enforced境界尊重 — コード強制

The numbers that are not 1.0 are published as they are: recovery on transient faults 0.5 (bad input is handed to a person, not retried), verify pass rate 0.667 (a failed verify is a detection event), precision@3 0.5 (a fixed-k artifact; precision@expected on the same runs is 1.0), 9.56 steps per executed run. All figures were identical across five consecutive runs; CI gates a deterministic, model-free subset on every pull request, and the model-scored runs execute locally. The full ledger is on the status page実装状況ページ.

1.0 でない数値もそのまま公表しています：一時障害のリカバリ 0.5（入力不良は再試行せず人へ）、verify 通過率 0.667（verify の失敗は検知イベント）、precision@3 0.5（固定 k による構造上の値。同じ実行の precision@expected は 1.0）、実行あたり平均 9.56 ステップ。全数値は連続5回の実行で完全に一致しました。CI は全 PR で決定論的なモデル非依存サブセットをゲートし、モデル採点の実行はローカルで行います。全項目の台帳は実装状況ページにあります。

Operations design — intent, not a report運用設計 — 記録ではなく設計意図

Nothing in this section is built or running. There is no deployment, no alert has ever fired, no dashboard exists, and no on-call rotation is in place. It is recorded now so the eventual build has a target it can be checked against; rows below mark what already ships built versus what is only designed design.

この節にあるものは、まだ何ひとつ作られておらず、動いてもいません。デプロイは存在せず、アラートが鳴ったことは一度もなく、ダッシュボードもオンコール体制もまだありません。いざ作るときに照らし合わせられる目標として、いまのうちに書き残しておくものです。以下の行では、出荷済みの部分を built、設計のみの部分を design で区別します。

The premise: this system's dangerous failures do not crash — they submit. Operating it is about noticing wrongness early, diagnosing fast, and bounding what a wrong action can cost. Operations is therefore designed as a reader of records the product already writes (steps, decisions, verify outcomes, the audit log) plus one unbuilt instrumentation point: a wrapper logging tokens, cost, latency and retries per LLM call.

前提：このシステムの危険な故障はクラッシュせず、送信されてしまいます。だから運用の中心は「誤りに早く気づく・速く診断する・誤操作の被害を限定する」ことです。運用は、製品がもともと書き残す記録（ステップ・決定・verify 結果・監査ログ）の読み手として設計し、そこに未構築の計測点を一つ — LLM 呼び出しごとのトークン・コスト・レイテンシ・再試行を記録するラッパー — を加えます。

Layer層	Design intent設計意図
Alertsアラート	designFour families: heartbeat death · verify-failure-rate spike (the silent-malfunction tripwire) · error/retry spike · token-cost surge with a budget cap. Thresholds deliberately unstated until a measured baseline exists.4系統：ハートビートの死・verify 失敗率の急騰（静かな誤作動への仕掛け線）・エラー／再試行の急騰・トークンコスト急増と予算上限。しきい値は測定ベースラインができるまで意図的に定めません。
Dashboardダッシュボード	designOne screen, answerable in five seconds: live status, recent successes/failures, verify pass rate, error/retry rates, LLM calls per task, latency p50/p95 — the alert rules made continuously visible.一枚の画面で5秒で答える：ライブ状態・直近の成否・verify 通過率・エラー／再試行率・タスクあたり LLM 呼び出し・レイテンシ p50/p95 — アラートルールの常時可視化です。
Drill-downドリルダウン	builtThe substrate ships today: per-step timeline, intent→key inspector, screen-state rendering, an append-only audit log. 基盤は出荷済み：ステップ単位のタイムライン、意図→キーのインスペクタ、画面状態の表示、追記専用の監査ログ。designWhat remains: dashboard-to-trace navigation and the aggregation queries.残るのはダッシュボードからトレースへの導線と集計クエリです。
SLOs & on-callSLO とオンコール	designSignals: verify pass rate, completion without human rescue, replan frequency, plan latency p95. No targets stated — the eval figures measure a fixed suite, not an operational baseline. On-call is honestly one person; the design goal is that an alert never arrives naked, because the trace already exists one click deep.信号：verify 通過率・人の救援なしでの完了率・再計画頻度・計画レイテンシ p95。目標値は書きません — 評価数値は固定スイートの測定であり運用ベースラインではないからです。オンコールは正直に一人。アラートが裸で届かないこと（トレースが1クリック先にあること）が設計目標です。
Rollback, two levelsロールバック（2階層）	builtInside one execution: capped replan, rollback to a safe screen, bad-input short-circuit, idempotency keys — all on the product path today. 単一実行の内側：上限つき再計画・安全画面へのロールバック・入力不良の人への短絡・冪等キー — すべて製品経路で稼働済み。designAbove executions: rolling a wrong correction back to its prior version (the versioned, supersedes-chained storage and a quarantine status already ship) — the automatic trigger is the unbuilt part.実行の上位：誤った個人修正の前バージョンへの巻き戻し（バージョン管理・supersedes 連鎖・隔離ステータスは出荷済み）。未構築なのは自動トリガーの部分です。
育成 hygiene育成の衛生	builtThe blast-radius boundary is code-enforced today (a correction cannot move what the request grounds; measured 1.0). 被害半径の境界は今日コードで強制されています（修正は依頼が根拠づける値を動かせない。測定値 1.0）。designContamination detection inside the movable territory: scheduled eval re-runs, and a metric regression coinciding with a correction version becomes a flagged, quarantinable event. True unlearning stays unbuilt; quarantine-and-fallback is the containment until it exists.可動領域内の汚染検知：評価の定期再実行と、修正バージョンに重なる指標後退のフラグ化・隔離。真の unlearn は未構築のままで、それまでは隔離とフォールバックが封じ込めです。

Sections 01–04 are drawn from the implementation at the current commit, with source modules named per diagram; every figure is a published eval number or a constant the code defines. Section 05 is design intent only — see the standing note. The item-by-item implementation ledger is on the status & roadmap page. 01〜04節は現行コミットの実装から起こしたもので、各図に出典モジュールを明記しています。数値はすべて公表済みの評価数値またはコードが定義する定数です。05節は設計意図のみ — 冒頭の注記をご覧ください。項目別の実装台帳は実装状況ページにあります。

Design & technology choices設計と技術選定