OPS Management

The DoubleZero OPS Management portal is where contributors log and track incidents (unplanned outages) and maintenance (planned work) across the network. All tickets are visible to all contributors.

Portal: https://doublezero.xyz/ops-management

Portal vs Slack

The OPS Management portal and Slack work together. All incidents and maintenance are tracked as tickets, accessible via the portal or the API. Each ticket notifies the right Slack channels automatically and gives every contributor a shared view of what is happening on the network. Slack is where the conversation happens: sharing logs, coordinating with other contributors, and collaborating on active issues.

Tickets are the canonical record, whether created via the portal or the API. Slack threads are not: they don't update ticket status and aren't stored permanently. Always keep the ticket status current, even if the conversation is happening in Slack.

The portal and Slack serve different purposes. Use both, but for the right things.

Use the portal (or API) for...	Use Slack for...
Opening, updating, and closing tickets	Conversation and collaboration on an active issue
Recording status transitions	Sharing logs, screenshots, or starting a call
Assigning or escalating a ticket	Getting eyes on a problem quickly
Setting root cause on close	Coordinating with other contributors

Onboarding

Complete these steps once before using the portal.

1. Set Your Ops Manager Key

Register a Solana wallet pubkey as your Ops Manager key. Supported wallets: Phantom, Solflare, Coinbase Wallet.

doublezero contributor update \
  --ops-manager <OPS_MANAGER_PUBKEY> \
  --pubkey <CONTRIBUTOR_PUBKEY>

2. Connect Your Wallet on the Portal

Navigate to https://doublezero.xyz/ops-management.
Click Connect Your Wallet and select your wallet.
Sign the message to prove ownership of your Ops Manager key.

Once authenticated, the Incident Tracking Table shows.

3. Create API Keys (Optional)

For programmatic access instead of the web form:

Click Manage API Keys on the portal.
Create one or more API keys.
Download the API documentation from this page.

Incidents

An incident is an unplanned service-impacting event.

Severity Levels

Assign severity based on the impact to the DoubleZero network. You can update severity as the situation evolves.

Severity	Impact	Response
`sev1`	Full outage or major control/data plane breakage with no fallback	Drop everything immediately, even outside working hours. Escalate to DoubleZero Foundation immediately.
`sev2`	Partial but substantial impact; degraded service with possible fallback	Treat as urgent. Coordinate actively. Overnight response required for sustained degradation.
`sev3`	Limited or no user-visible impact; potential to escalate if unresolved	Top priority during working hours. Monitor closely. No after-hours escalation required unless impact increases.

Severity examples

Sev1 examples

More than 10% of user traffic blackholed on DoubleZero, no fallback to public internet
More than 80% of user onboarding, connect, or disconnect attempts failing
More than 20% of DZDs reporting interface errors
Controller returning valid but incorrect configs to DZD agents

Sev2 examples

More than 20% of users unable to send/receive traffic over DoubleZero tunnels, but failing back to public internet
0–10% of user traffic blackholed on DoubleZero without fallback
20–80% of new user onboarding, connect, or disconnect attempts failing
More than 20% of config agents failing to apply DZD config
0–20% of DZDs reporting interface errors
Upstream issues causing observability loss (monitoring/alerting down)
Onchain data pipeline down or producing incorrect data
More than 20% of internet latency collection or submission failing
Controller inaccessible by DZD agents
Controller returning invalid configs to DZDs that will not be applied

Sev3 examples

0–20% of users unable to send/receive traffic over DoubleZero tunnels, with fallback to public internet
0–20% of DZDs reporting interface errors
0–20% of DZDs experiencing config agent failures
0–20% of user onboarding, connect, or disconnect attempts failing
More than 20% of internet latency collection or submission failing for a single data provider
0–20% of internet latency collection or submission failing for all data providers
Bugs or tech debt causing alerting noise that cannot be silenced
DIA down or ledger RPC networking issues for 0–20% of devices for several hours
Low-impact issues such as minor bugs, cosmetic errors, or isolated incidents not affecting customer traffic
Small fraction of devices intermittently reporting errors without service disruption

Opening an Incident

Click Create New Record, select Type = Incident on the portal, or submit via the API.

Required:

Field	Description
`title`	Short summary (max 100 characters)
`description`	Detailed explanation (max 500 characters)
`severity`	`sev1`, `sev2`, or `sev3`
`status`	Cannot be set to a terminal state (`resolved`, `closed`) on create
Device and/or Link	At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as `device_pubkey` and/or `affected_link_pubkey`.

Optional:

Field	Description
`reporter_name` / `reporter_email`	Your contact details
`assignee`	Who is responsible for resolution
`internal_reference`	Your internal ticket ID (e.g. Jira, ServiceNow)
`start_at`	Defaults to creation time; editable

Once created, a notification is posted to the contributor incidents Slack channel with the ticket ID, severity, affected devices/links, and contributor name.

Updating an Incident

As the incident progresses, keep the ticket status current. This is the signal other contributors and DZ use to understand what's being worked on.

Status	When to set it
`open`	Initial state: issue reported, not yet being worked
`acknowledged`	You've seen it and taken ownership
`investigating`	Actively diagnosing: gathering logs, checking metrics
`mitigating`	Root cause known or suspected; applying a fix or workaround
`monitoring`	Fix applied; watching to confirm it holds
`resolved`	Issue confirmed fixed; root cause required
`closed`	Fully complete; no further action; root cause required

open → acknowledged → investigating → mitigating → monitoring → resolved → closed

You can skip statuses if appropriate. For example, jump straight from open to investigating if you immediately start working it. Always use the most accurate status for the current state.

Each status update posts a reply in the original Slack notification thread.

Closing an Incident

To move an incident to resolved or closed, a root cause must be set. You can set root cause at any earlier stage if you already know it; it becomes mandatory at close.

Code	Description
`hardware`	Hardware repair, replacement, or upgrade (SFP, NIC, cable, device)
`software`	Software or firmware fix, update, or restart
`configuration`	Configuration change, fix, or rollback
`capacity`	Congestion, capacity limits, or traffic management
`carrier`	Circuit, wavelength, or cross-connect provider issue
`network_external`	External network issue outside contributor control
`facility`	Datacenter infrastructure issue (power, cooling)
`fiber_cut`	Physical fiber damage repaired
`security`	Security incident mitigated
`human_error`	Operational mistake corrected
`false_positive`	No actual issue found after investigation
`duplicate`	Already tracked in another ticket
`self_resolved`	Issue resolved without intervention
`dz_managed`	Issue with a DoubleZero-managed software component (activator, controller, etc.)

Maintenance

A maintenance record is a planned, time-bounded activity that may affect availability. Create it in advance so other contributors can see and avoid conflicting windows.

Scheduling Maintenance

Click Create New Record > Maintenance on the portal, or submit via the API.

Required:

Field	Description
`title`	Short summary (max 100 characters)
`description`	Detailed explanation (max 500 characters)
`start_at`	Planned start time (UTC)
`end_at`	Planned end time (UTC); must be after `start_at`
Device and/or Link	At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as `device_pubkey` and/or `affected_link_pubkey`.

Once created, a notification is posted to the contributor maintenance Slack channel with the ticket ID, affected devices/links, planned window, and contributor name.

Managing Maintenance Status

Keep the status current as the window progresses.

Status	When to set it
`planned`	Scheduled, not yet started
`in-progress`	Work has begun
`completed`	Work finished successfully
`closed`	Auto-set 24 hours after `end_at`
`cancelled`	Called off before or during execution

planned → in-progress → completed → closed (auto 24h after end_at)
    ↓          ↓
    └──────────┴──→ cancelled

Permissions and Escalation

What Contributors Can Do

Create and manage tickets for their own devices and links only.
Assign tickets to themselves or escalate to DZ/Malbeclabs.
View all tickets across all contributors.

What DZ/Malbeclabs Admins Can Do

Create tickets for any contributor's devices and links.
Assign or reassign tickets between contributors.
Handle escalations and support requests.

DZX Link Ownership

DZX links connect devices from two different contributors. The A-side contributor (first device in the link name) owns the link and is the only one who can create tickets for it.

Example: For link deviceA:deviceB, the contributor who owns deviceA owns the link.

If the issue is on the Z-side:

A-side contributor creates a ticket for the DZX link.
Assign the ticket to DZ/Malbeclabs.
DZ/Malbeclabs investigates and reassigns to the Z-side contributor if needed.

We recognise this workflow is limited. Z-side contributors currently cannot create tickets for DZX links they don't own, which means coordination has to go through DZ/Malbeclabs. We are working to improve this so that both sides of a DZX link can declare incidents and maintenance independently.