Service Lifecycle Guide¶

This guide explains how to add a brand-new service to the ops-library/ops-control pairing without missing any of the moving parts (roles, documentation, tests, and metadata). Use it whenever you introduce a service such as homeassistant, minio, or any future workload that needs deploy/backup/restore/remove coverage, or an approved backup/restore exception.

Document meta

Last updated: 2026-03-13
Version: 1.5 (bump this when the checklist materially changes)
Feedback: open a GitHub issue in ops-library or mention it in the ops-control stand-up notes so we can track improvements.

Where This Documentation Lives¶

ops-library hosts the reusable, public automation and therefore the canonical documentation for how service roles are structured. Keep this guide here so contributors see it next to the role source and Sphinx docs (just docs-build → just docs-serve).
ops-control consumes the collection. After you publish a service in ops-library you wire it into ops-control via metadata/secrets, but the “how to build roles” contract belongs in ops-library.

Lifecycle Checklist¶

Follow this sequence for every service.

Example: Adding the “miniflux” service

Slug: miniflux
Roles: miniflux_deploy, miniflux_backup, miniflux_restore, miniflux_remove, and an optional miniflux_shared helper.

services-metadata.yml entry:

miniflux:
  description: "RSS reader"
  default_host: macmini
  capabilities: [deploy, backup, restore, remove]
  required_secrets:
    - miniflux_admin_password
    - miniflux_db_password

Service directories:

roles/
  miniflux_deploy/
  miniflux_backup/
  miniflux_restore/
  miniflux_remove/
  miniflux_shared/

Shared role inclusion:

- name: Load shared facts and paths
  ansible.builtin.include_role:
    name: local.ops_library.miniflux_shared

### 1. Define the service slug and metadata

1. Pick a short slug (e.g. `paperless`, `unifi`). This slug becomes the prefix for all ops-library roles (`paperless_deploy`, `paperless_backup`, etc.) and the key in `ops-control/services-metadata.yml`.
2. Update `ops-control/services-metadata.yml`:
   - Describe the service, default host, and capabilities. Capabilities map 1:1 to the default suffixes (`deploy`, `backup`, `restore`, `remove`) and drive the `just` helpers. Add `register` or other custom capabilities only when you provide a role for that action.
   - List `required_secrets` (even if empty) so `just create-secrets` knows which values to prompt for.
   - Specify overrides if a lifecycle uses a non-standard role name (see the `redis` example which uses `redis_install` instead of `<slug>_deploy`).
3. Create an encrypted secrets file in `ops-control/secrets/<env>/<service>.yml` so automation has a home for credentials from day one.
4. Confirm defaults such as `backup_root_prefix` in `services-metadata.yml` only when the service actually keeps a dedicated backup role. Do not assume every new service still uses the older controller-local `~/backups/<service>/` pattern; Echoport is now the primary backup/restore path for many services.

#### Choose the backup/restore disposition first

Before creating `<service>_backup` and `<service>_restore`, decide which public disposition the service should use:

- `primary`: dedicated roles remain the main public operator path
- `exception`: intentionally outside the default Echoport migration path
- `ad-hoc only`: keep callable for break-glass or manual use, but not the default workflow
- `deprecated`: retain only for compatibility while Echoport is the preferred path

For new services, default to the smallest truthful surface:

1. If Echoport is the intended operator path, do not create `<service>_backup` / `<service>_restore` roles just to satisfy naming symmetry.
2. If the service is an approved exception (today the clearest public examples are `mail_*` and `minio_*`), document why the dedicated lifecycle remains outside the normal Echoport path.
3. If no dedicated role exists, do not advertise unsupported `backup` / `restore` capabilities in `services-metadata.yml`.
4. Document the exact operator path in the service spec and runbook, including restore-drill evidence and post-restore verification commands.

### 2. Create service roles inside ops-library

For each advertised capability add a role under `roles/<service>_<action>/`. Re-use shared snippets carefully:

- use `roles/<service>_shared/` only when you are creating a real shared-defaults or shared-facts surface for sibling lifecycle roles
- use a narrow internal helper role or documented task library when the reuse is implementation-only

Offer both rsync (local dev) and git (production) deployment modes whenever the codebase lives in git. If Echoport is the truthful primary backup/restore path, omit dedicated backup/restore roles and document the alternate flow.

- **rsync mode** is ideal for hacking on a role locally (`just deploy-one <service> dev`). It pushes whatever is under `{{ service_repo_path }}`.
- **git mode** is for production/staging where FastDeploy or CI clones a clean tag/branch and reduces drift.

#### Deploy role

- Validates every required variable and secret up front with `assert`.
- Owns OS packages, users/groups, cache directories, and systemd units.
- Provides Traefik/UFW wiring, health checks, and idempotent upgrades (rsync and git modes where the service code lives in source control).

#### Backup role

- Only create a dedicated backup role when the service disposition is still `primary`, `exception`, or a deliberately retained `ad-hoc only` path.
- Dumps all state (databases, media, config) to `{{ backup_root_prefix }}/<service>/`. `backup_root_prefix` defaults to `/opt/backups` via metadata but can be overridden per host.
- Writes manifest files so restore can verify archives.
- Supports optional local fetch (rsync/scp) triggered by ops-control `just backup <service>`.

#### Restore role

- Only create a dedicated restore role when it remains a real supported public surface for that service.
- Validates archives/manifests and performs safety snapshots before destructive actions.
- Restores files/databases and brings services back online.
- Exposes `restore-check` mode for dry runs when feasible.

##### Restore pilot scaffold

Use the current restore pilot scaffold only when the role actually
matches the existing host-local pattern in this repo. Today that means
`fastdeploy_restore` and `unifi_restore`, which both:

- resolve archives on the target host
- split validation, destructive restore, health verification, and cleanup into
  separate task files
- keep rollback logic inside the role instead of depending on controller-local staging

The shared parts of that scaffold now live in the internal helper role
`local.ops_library.restore_pilot_internal`. The helper covers host-local
archive/snapshot validation plus the shared `block`/`rescue`/`always`
orchestration only. Callers still own service-specific safety backup, restore,
verification, rollback, and cleanup logic.

If that helper supports `latest` archive exclusions, treat them as soft
preferences rather than hard filters. If all discovered artifacts match the
exclusion list, the helper should fall back to the unfiltered candidates instead
of failing immediately.

Do not force other restore roles into this shape just because they also have multiple task files. The following remain intentionally outside the pilot scaffold:

- controller-fallback variants: `homeassistant_restore`, `paperless_restore`
- object-storage exception: `minio_restore`
- incomplete scaffold: `nyxmon_restore`
- hybrid split restore: `minecraft_java_restore`
- controller-local or mail-adjacent narrow restores: `vaultwarden_restore`, `mail_restore`, `postfixadmin_restore`, `snappymail_restore`

Do not treat that helper as a generic restore framework. Delayed roles stay out
of scope until they prove the same boundary in code and validation.

#### Remove role

- Deletes services safely with confirmation toggles (`service_confirm_removal`, `service_remove_data`, etc.).
- Removes users, configs, systemd units, and reverse-proxy artifacts as requested.

#### Optional roles

- `*_register` roles wire FastDeploy runners for ongoing maintenance.
- `*_shared` roles should be real shared-defaults or shared-facts surfaces for sibling lifecycle roles.
- Narrow internal helper roles are acceptable when multiple public roles share
  the same implementation detail but the public entrypoints must stay stable.
  Document the helper at the role level and keep the abstraction intentionally
  small.

`minio_shared` is the current example of a documented helper task library that
does not pretend to be a shared-defaults surface.

#### Role dependencies and shared infrastructure

- If a service depends on shared infrastructure (PostgreSQL, Redis, Traefik, firewall rules), document the expectation in the deploy role README and add asserts that verify the dependency (e.g., `postgres_host` reachable, Traefik dynamic config path exists).
- For hard dependencies, include or require the bootstrap roles (`postgres_install`, `redis_install`, etc.) and clearly separate which steps run on the controller vs. the target host.

### 3. Document each role

1. Each role requires a README based on `roles/README_TEMPLATE.md`. Include usage examples, variable tables, and recovery instructions.
2. Add Sphinx reference pages under `docs/source/roles/<category>/<role>.md` (copy from an existing role). Categories map to directories already present: `deployment`, `operations`, `removal`, `registration`, `bootstrap`, and `testing`. Update the relevant `index.md` file to include the new document so `just docs-build` picks it up. Missing toctree entries are why services disappear from local docs.
3. Reference this guide wherever it helps (e.g., mention “see the Service Lifecycle Guide for the full checklist” inside role docs).

### 4. Testing requirements

- Integration tests live under `tests/` (playbooks such as `tests/test_<role>.yml`) and rely on `test_roles.py`. Longer-running suites or utilities can sit under `ops_library_testing/` if they need Python helpers.
- Minimum coverage per lifecycle role:
  - **Deploy:** role runs twice without changes, systemd unit active, health endpoint reachable.
  - **Backup:** archive + manifest created, manifests list expected files, optional fetch succeeds.
  - **Restore:** validates archive, restores files, service healthy afterwards, supports `restore-check` when available.
  - **Remove:** confirmation guard works, data removal toggles respected, no lingering systemd units.
- Restore pilot roles that are candidates for shared extraction must add role-level Molecule coverage before helper extraction starts.
  Current pilot validation commands:
  `just molecule-test fastdeploy_restore`
  `just molecule-test unifi_restore`
- That restore harness should cover archive resolution, validation-only or dry-run behavior where the role truly supports it, rollback behavior, and post-restore health verification.
- When FastDeploy metadata is involved, add a test that validates generated service descriptors (JSON schema, permissions).
- Run focused tests with `just test-role <role>` during development and `just test` before publishing so regressions surface early.

### 5. Wire into ops-control workflows

- Extend `group_vars` when a service needs host-specific defaults that should not be baked into the public role. Typical locations:
  - `group_vars/services/<service>.yml` for service-wide overrides (ports, hostnames, storage paths).
  - `group_vars/hosts/<hostname>.yml` for host-only tweaks (e.g., `backup_root_prefix` on storage boxes).
- Update or create `playbooks/deploy-<service>.yml`, `playbooks/remove-<service>.yml`, etc., if the service needs orchestration beyond the one role (for example, prepping external dependencies).
- Ensure the `just` commands work end-to-end:
  - `just deploy-one <service>`
  - `just backup <service>`
  - `just restore <service> [archive]`
  - `just remove-one <service>`
- If backup/restore is centralized through Echoport or another orchestrator, explicitly document that `just backup <service>` / `just restore <service>` are not the primary operator path for that service and link the runbook commands instead.
- Update ops-control docs/runbooks so operators know about the new service.

### 6. Publish and validate

1. Run `just docs-build` and `just docs-serve` inside ops-library to confirm the new role pages appear. Fix warnings immediately so the docs site stays green.
2. Bump the collection version in `galaxy.yml` following semver:
   - **Patch:** bug fixes or doc-only updates.
   - **Minor:** new roles or backward-compatible features.
   - **Major:** breaking changes (variable rename, behavior change).
   Update `CHANGELOG.md` with a short entry describing the service addition.
3. Build and install the collection in ops-control (`just install-local-library`), then re-run the relevant `just` commands for a dry run on staging/VMs. Communicate in ops-control (e.g., Slack #ops or the weekly summary) so downstream users know to reinstall the collection.

## Common pitfalls

- **Missing toctree entries:** role docs will not appear in `docs-build` if you forget to update the relevant `index.md`.
- **No asserts for secrets:** always validate secrets to avoid `CHANGEME` values slipping into production.
- **Hard-coded paths:** use variables (`backup_root_prefix`, `<service>_home`, `fastdeploy_services_root`, etc.) so roles stay portable.
- **Untested restore flows:** a backup without a tested restore is not complete; always add at least one restore test.
- **ops-control gaps:** ensure every capability exposed in metadata has matching `just` recipes and playbooks before calling the service “done”.
- **Capability drift under dispositions:** if backup/restore is handled by Echoport (or another centralized plane), do not leave stale `backup`/`restore` capabilities in metadata or README language that implies the dedicated roles are still primary.
- **docs-build not run:** skipping the docs build is the fastest way to ship broken navigation. Run it every time.

## Getting This Guide In Front of Contributors

- Link this guide from `CLAUDE.md` (AI context file) so any assistant tasked with “add a new service” reads it before coding.
- Mention it in code reviews and PR templates when services are added.
- Keep it updated whenever you discover a missing step—treat this as the single source of truth for service lifecycle expectations.

By following this checklist each new service gains full lifecycle automation, documentation that ships with the public collection, and predictable hooks in ops-control. That means fewer “fail → analyze → fix → rerun” cycles and faster first-time-success when deploying a new workload.