ADR-035: Skill ingest (GitHub-hosted SKILL.md sources)

Status: Accepted
Date: 2026-06-27
Authors: Netresearch DTT GmbH

Context

Editors want to reuse the growing ecosystem of Claude Code skills — SKILL.md files with YAML front-matter (name + description) and a markdown body — inside nr-llm. These live on GitHub as a single file, as a whole repository (many SKILL.md under skills/, .claude/skills/ or <plugin>/skills/), or behind an Anthropic marketplace.json index that points at further repositories.

Fetching attacker-influenced markdown from the public internet and later feeding it into an LLM prompt raises two separate concerns that are easy to conflate:

Server-Side Request Forgery. The existing nr-vault transport (vault->http()) already blocks internal/private/metadata targets. That guard is about where a request may go, not who owns it.
Supply-chain origin and integrity. Even a non-SSRF target must be a real GitHub host, and the bytes we store must be the bytes we reviewed — a moving branch ref can change content under us.

This ADR records the decisions for Plan 1a — ingest only. Skills are parsed, materialized and reviewed, but not yet injected into prompts; injection, the MM attach tables, and checksum-verify-on-injection are deferred to Plan 1b.

Decision

Dedicated entities, not extended snippets. Two new Extbase entities — SkillSource (table tx_nrllm_skill_source) and Skill (table tx_nrllm_skill) — model the ingest domain. A skill is a materialized SKILL.md; a source produces N skills. Reusing PromptSnippet (ADR-031: Tagged Prompt Snippet Library) was rejected: snippets are editor-authored fragments, skills are synced remote artifacts with their own lifecycle (sync status, checksum, orphaning).
Ingest / use split. Unit 1 is split at the MM-table seam into Plan 1a (this ADR: sources, fetch, parse, review) and Plan 1b (attach + inject). Each ships fully implemented, no stubs.
SSRF guard ≠ GitHub-origin guard. On top of the nr-vault SSRF guard, GitHubClient enforces an app-level GitHub host allowlist: scheme = https AND host ∈ `{github.com, raw.githubusercontent.com, api.github.com, codeload.github.com}on the **initial request URL**. The transport does **not follow redirects** (any 3xx is treated as an error), so there is no redirect target to escape the allowlist. A rejected URL raises a typed :php:HostNotAllowedException` — never a silent skip.
Fetch by immutable commit SHA + checksum. A source ref (branch/tag) is resolved once to a commit SHA via GET /repos/{o}/{r}/commits/{ref}; the stored pinned_sha is the URL all bodies are fetched from (raw.githubusercontent.com by SHA, never by branch). A body_checksum (sha256) is computed at materialization and re-verified on injection in Plan 1b (fail-closed).
Disabled-by-default for multi-skill discovery. Every repo and marketplace skill arrives enabled = false and must be reviewed before use. A single_file source — one explicit admin act — may default enabled. Re-syncing an enabled skill whose recomputed body_checksum changed auto-reverts it to disabled and surfaces the diff for re-confirmation.
Namespaced upsert, orphan-disable. identifier is namespaced "{source_uid}:{path}" so identical skill names across sources never collide. Re-sync is upsert-by-(source, identifier); a skill that disappeared upstream is marked orphaned + disabled, never silently dropped.
Admin-only management. Sources and skills live in a new nrllm_skills access = admin backend submodule. The two tables are an escalation surface (the body becomes prompt context in 1b) and must never be granted to non-admin backend groups; sync-managed TCA fields (body_checksum, source_sha, raw_frontmatter, support_status, identifier) are read-only and github_token is never shown in a FormEngine form.
String-backed enums + bounded JSON. SkillSourceType, SyncStatus and SupportStatus are string-backed with values() / isValid() / tryFromString() (the project's Defensive-Enum rule). raw_frontmatter and the reserved allowed_tools JSON are byte- and shape-bounded at parse time even though allowed_tools is ignored in 1a.
Explicit ``symfony/yaml`` dependency. Front-matter is parsed with Symfony\Component\Yaml\Yaml; the package is added to composer.json require explicitly rather than relied on transitively.

Consequences

● Admins reuse the GitHub skill ecosystem from inside the backend, with SHA-pinned, checksum-verified, host-allowlisted fetches.
● The SSRF guard and the GitHub-origin allowlist are independent controls, stated and tested separately — neither masks the other.
● Disabled-by-default plus auto-disable-on-change means no remote content silently enters a prompt: every enable is a deliberate admin review, and an upstream change re-opens that review.
● Orphan-disable (never drop) keeps attached skills (Plan 1b) from vanishing under an editor and makes upstream deletions visible.
◐ Two more domain entities and a new submodule increase surface area; the split from PromptSnippet is intentional and documented here and in the administration guide.
◐ On hardened instances the global HTTP/allowed_hosts SSRF list must include the four GitHub hosts, or every sync fails closed — a deliberate, documented prerequisite.
✕ support_status = partial is not a safety signal. It only flags that referenced scripts/assets are not executed (always true in 1a); the prose stays fully untrusted. The injection-time output integrity controls land in Plan 1b.