Tanomude — 頼むで

Why it exists — Problems on the floorなぜ作ったか — 現場の課題

Problem課題

When the core system is legacy, younger staff and newly-assigned personnel don't know how to operate it; the operation method is tacit skill, with no document.

基幹システムがレガシーな場合、若手・新規着任者は操作方法がわからない。操作方法は暗黙知で、文書がない。

→ Solution→ 解決

A one-line instruction becomes a concrete key-sequence, so the operator can file without knowing the screens; manual RAG and each operator's corrections accumulate the veteran's knowledge in the system.

一行の指示が具体的なキー操作の計画になり、画面を知らなくても申請できる。手順書 RAG と個人修正が、熟練者の知識をシステムに蓄積する。

Tech: LLM · RAG (hybrid) · pgvector · multilingual-e5 · personal corrections

技術：LLM · RAG (hybrid) · pgvector · multilingual-e5 · personal corrections

Problem課題

A legacy core system has to be driven screen by screen, so efficiency tends to drop without proficiency.

レガシー基幹システムは画面を一つずつ操作する必要があり、習熟しないと業務効率が落ちやすい。

→ Solution→ 解決

Instruction → plan → approve → automated key-by-key execution; the manual screen-driving disappears and the operator focuses on the decision.

指示 → 計画 → 承認 → 自動実行。画面操作の手間が消え、人は判断に専念できる。

Tech: LLM · structured output · screen adapter · idempotent execution

技術：LLM · structured output · screen adapter · idempotent execution

Problem課題

Even when the system itself is complete, if approval criteria are informal and unwritten, requests like business trips can be bounced back repeatedly.

システムが完備していても、承認基準が形式化されていない場合、出張申請などが繰り返し差し戻されることがある。

→ Solution→ 解決

The plan is shown as a grounded approval card; a reject/revise note persists as a personal correction that reshapes the next plan, so the informal approval bar gradually becomes reusable, explicit knowledge.

計画を根拠つきの承認カードで提示。却下・修正のメモは個人修正として次の計画に反映され、形式化されていなかった承認基準が再利用できる知識として蓄積されていく。

Tech: structured approval card · personal corrections (growth) · grounded-slot immunity

技術：structured approval card · personal corrections (growth) · grounded-slot immunity

Architectureアーキテクチャ

The stack is standard — the engineering sits on top.

標準スタック。設計したのは、その上のメカニズム。

Used — off-the-shelf既存スタック

LLM (Gemma 4) · RAG · pgvector · multilingual-e5 · FastAPI · React · Docker

Built — engineered設計したメカニズム

grounded-slot immunity · correction / growth loop · screen-adapter seam · idempotent execution · eval harness

The real serving path: grounded plan → human gate → verified execution, with the growth loop and the eval harness around it.実際の実行経路：根拠づいた計画 → 人のゲート → 検証つき実行。その周りに育成ループと評価ハーネス。

FastAPI Ollama · Gemma (temperature 0) TEI · multilingual-e5 (1024-dim) Postgres + pgvector React + TypeScript fully on-premises完全オンプレミス

The agent core is a custom loop rather than a framework: existing agent frameworks were weighed against the safety and verification requirements and declined for the serving path, so that approval gating, screen-state verification, idempotency and rollback live in plain code that can be read and tested. Everything runs on-premises; nothing leaves the building.

エージェントのコアはフレームワークではなく独自ループです。既存のエージェントフレームワークを安全性・検証要件に照らして評価したうえで本番経路には採用せず、承認ゲート・画面状態の検証・冪等性・ロールバックを、読めてテストできる素のコードに置きました。すべてオンプレミスで動作し、データは社外に出ません。

The core mechanisms核となる仕組み

Grounded planning根拠づいた計画

The model extracts the request's slots — destination, dates, project, purpose — grounded by hybrid retrieval over the ingested operating manual (vector + full-text search, rank-fused) and by the operator's own persisted corrections. Every model response is a schema-validated structured output with retry; free-form text where a schema is required is treated as a defect.

モデルは依頼からスロット（目的地・日付・案件・目的）を抽出します。その根拠は、取り込んだ手順書へのハイブリッド検索（ベクトル検索と全文検索の融合）と、そのオペレーター自身の個人修正です。モデルの応答はすべてスキーマ検証付きの構造化出力（再試行つき）として強制され、スキーマが必要な箇所の自由テキストは欠陥として扱います。

Human approval人手承認

The plan is presented on a structured approval card — the analysis, the key-by-key plan, and the retrieved grounds it stands on. 承認 (approve), 修正 (revise) and 却下 (reject) are human acts, and approval is the only path to execution.

計画は構造化された承認カード — 分析・キー単位の計画・その根拠 — として提示されます。承認・修正・却下は人の行為であり、実行への唯一の経路は承認です。

Execute, then verify実行し、検証する

An approved plan replays key-by-key through a narrow screen adapter against the legacy target. After each action the screen state is asserted; a mismatch triggers replanning and rollback instead of a silent wrong submit, and bad input data is short-circuited to a person rather than retried blindly.

承認された計画は、狭い画面アダプタ越しにキー単位でレガシー対象へ再生されます。各操作の後に画面状態を検証し、不一致なら黙った誤送信ではなく再計画とロールバックに入ります。入力データの不良は盲目的に再試行せず、人へ引き渡します。

Per-operator growth — 育成育成 — 利用者ごとの成長

A reject or revise with a note persists a personal correction that reshapes that operator's next plan. Corrections can move only the model's inference slots: whatever the request itself grounds is correction-immune, enforced in code rather than left to the model, and a revise aimed at a grounded slot is declined with a notice.

却下や修正に添えたメモは個人修正として永続化され、そのオペレーターの次の計画を作り直します。修正が動かせるのはモデルの推論スロットだけです。依頼自体が根拠づける値は上書き不可で、これはモデル任せではなくコードで強制され、根拠づいたスロットを狙った修正は通知とともに反映されません。

A built-in eval harness組み込みの評価ハーネス

An in-repo eval harness scores planning, retrieval, recovery and the growth loop. It runs locally with the model; CI gates a deterministic, model-free subset on every pull request.

リポジトリ内蔵の評価ハーネスが、計画・検索・リカバリ・育成ループを採点します。モデルとともにローカルで実行され、CI は決定論的なモデル非依存サブセットを全 PR でゲートします。

A gated repositoryゲートで守られたリポジトリ

The repository runs under a gated discipline: a required CI verify check (lint, type-check, unit tests, golden-fixture parity and secret scanning), an independent automated review on every pull request, and multi-pass QC — including a five-run deterministic remeasure that re-established every published number at the public release commit.

リポジトリの運営には何重ものゲートを置いています。必須の CI verify チェック（lint・型チェック・ユニットテスト・ゴールデンフィクスチャ一致・シークレットスキャン）、全 PR への独立した自動レビュー、そして複数段の QC — 公開コミットで全公表数値を5回の決定論的再測定によって確かめ直したことを含みます。

Measured, not asserted主張ではなく測定

Every number below comes from running the eval harness locally with the model, is identical across five consecutive full runs, and was re-confirmed by an independent QC pass at the exact commit that went public.

以下の数値はすべて、評価ハーネスをモデルとともにローカル実行した結果です。連続5回のフル実行で完全に一致し、公開されたまさにそのコミットにおいて独立した QC により再確認されています。

1.0

Planning — success · routing · field accuracy計画 — 成功率・ルーティング・フィールド精度

1.0

Retrieval — recall@3 · precision@expected · MRR検索 — recall@3・precision@expected・MRR

0.625

Personal-correction growth — Δ (inference-tier policy slots)個人修正による育成 — Δ（推論ティアの方針スロット）

1.0

Boundary respect — a correction must not override grounded input境界尊重 — 修正が根拠づいた入力を上書きしないこと

The numbers that are not 1.0 are reported as they are. Recovery on transient faults is 0.5 — by design, bad input data is not blindly retried but handed to a person. Verify pass rate is 0.667 — a failed verify is a detection event that triggers rollback and replanning instead of a silent wrong submit. precision@3 is 0.5 — an artifact of fixed k=3 against expected sets smaller than three; precision@expected on the same runs is 1.0. An executed run takes 9.56 steps on average. Growth Δ 0.625 means the movable reuse-previous-project slot moves on four of four policy cases and the overseas policy slot on one of four, against a zero control baseline — the limiter is the model's judgment on the ambiguous slot, not the correction plumbing.

1.0 でない数値もそのまま報告しています。一時障害のリカバリは 0.5 — 設計上、不正な入力データは盲目的に再試行せず人へ引き渡します。verify 通過率は 0.667 — verify の不一致は検知イベントであり、黙った誤送信ではなくロールバックと再計画をトリガーします。precision@3 は 0.5 — 期待集合が 3 未満でも k=3 で割ることによる構造上のアーティファクトで、同じ実行の precision@expected は 1.0 です。実行あたりの平均ステップ数は 9.56。育成 Δ 0.625 は、可動スロットの前回案件再利用が方針4ケースすべてで動き、海外フラグの方針スロットは4ケース中1ケースで動くという意味です（修正なしでは何も動かないゼロの制御ベースラインに対して）。律速はモデルの曖昧スロットに対する判断であり、修正の配線ではありません。

The architecture itself was set by measurement before the product was built. A reliability spike on the target task measured free-form key-sequence generation at 7.5% exact accuracy; keeping the sequence skeleton in code and asking the model only for slot values and branch decisions raised it to 87.5% (96 runs, zero parse failures). The same spike produced the finding that shaped everything after it: across all 96 runs the model never once generated the correct abort sequence when a required field was empty — and slot-filling will faithfully assemble a hallucinated value. Safety is therefore engineered outside the model: code-level required-field guards, schema-enforced refusal, human approval before execution, and screen verification after it.

アーキテクチャ自体も、構築の前の測定で決めています。対象業務での信頼性スパイクでは、キー操作列の自由生成は exact 精度 7.5% でした。操作列の骨格をコード側に置き、モデルにはスロット値と分岐判断だけを任せる方式で 87.5% に上がります（96 ラン、parse 失敗 0）。同じスパイクが、その後のすべてを方向づけた所見も出しました。96 ラン中、必須項目が空のときに正しい中断（abort）列を生成できたことは一度もなく、スロットフィルは幻覚された値を忠実に組み立ててしまう、というものです。だから安全はモデルの外側で作ります — コードによる必須項目ガード、スキーマで強制する却下表現、実行前の人手承認、実行後の画面検証です。

A plain instruction → a verified green-screen operation一行の指示 → 検証済みのグリーンスクリーン操作

Why it exists — Problems on the floorなぜ作ったか — 現場の課題

Problem課題

→ Solution→ 解決

Problem課題

→ Solution→ 解決

Problem課題

→ Solution→ 解決

How it worksどう動くか

Architectureアーキテクチャ

Used — off-the-shelf既存スタック

Built — engineered設計したメカニズム

The core mechanisms核となる仕組み

Grounded planning根拠づいた計画

Human approval人手承認

Execute, then verify実行し、検証する

Per-operator growth — 育成育成 — 利用者ごとの成長

A built-in eval harness組み込みの評価ハーネス

A gated repositoryゲートで守られたリポジトリ

Measured, not asserted主張ではなく測定