OPS Management
The DoubleZero OPS Management portal is where contributors log and track incidents (unplanned outages) and maintenance (planned work) across the network. All tickets are visible to all contributors.
Portal: https://doublezero.xyz/ops-management
Portal vs Slack
The OPS Management portal and Slack work together. All incidents and maintenance are tracked as tickets, accessible via the portal or the API. Each ticket notifies the right Slack channels automatically and gives every contributor a shared view of what is happening on the network. Slack is where the conversation happens: sharing logs, coordinating with other contributors, and collaborating on active issues.
Tickets are the canonical record, whether created via the portal or the API. Slack threads are not: they don't update ticket status and aren't stored permanently. Always keep the ticket status current, even if the conversation is happening in Slack.
The portal and Slack serve different purposes. Use both, but for the right things.
| Use the portal (or API) for... | Use Slack for... |
|---|---|
| Opening, updating, and closing tickets | Conversation and collaboration on an active issue |
| Recording status transitions | Sharing logs, screenshots, or starting a call |
| Assigning or escalating a ticket | Getting eyes on a problem quickly |
| Setting root cause on close | Coordinating with other contributors |
Onboarding
Complete these steps once before using the portal.
1. Set Your Ops Manager Key
Register a Solana wallet pubkey as your Ops Manager key. Supported wallets: Phantom, Solflare, Coinbase Wallet.
doublezero contributor update \
--ops-manager <OPS_MANAGER_PUBKEY> \
--pubkey <CONTRIBUTOR_PUBKEY>
2. Connect Your Wallet on the Portal
- Navigate to https://doublezero.xyz/ops-management.
- Click Connect Your Wallet and select your wallet.
- Sign the message to prove ownership of your Ops Manager key.
Once authenticated, the Incident Tracking Table shows.
3. Create API Keys (Optional)
For programmatic access instead of the web form:
- Click Manage API Keys on the portal.
- Create one or more API keys.
- Download the API documentation from this page.
Incidents
An incident is an unplanned service-impacting event.
Severity Levels
Assign severity based on the impact to the DoubleZero network. You can update severity as the situation evolves.
| Severity | Impact | Response |
|---|---|---|
sev1 |
Full outage or major control/data plane breakage with no fallback | Drop everything immediately, even outside working hours. Escalate to DoubleZero Foundation immediately. |
sev2 |
Partial but substantial impact; degraded service with possible fallback | Treat as urgent. Coordinate actively. Overnight response required for sustained degradation. |
sev3 |
Limited or no user-visible impact; potential to escalate if unresolved | Top priority during working hours. Monitor closely. No after-hours escalation required unless impact increases. |
Severity examples
Sev1 examples
- More than 10% of user traffic blackholed on DoubleZero, no fallback to public internet
- More than 80% of user onboarding, connect, or disconnect attempts failing
- More than 20% of DZDs reporting interface errors
- Controller returning valid but incorrect configs to DZD agents
Sev2 examples
- More than 20% of users unable to send/receive traffic over DoubleZero tunnels, but failing back to public internet
- 0–10% of user traffic blackholed on DoubleZero without fallback
- 20–80% of new user onboarding, connect, or disconnect attempts failing
- More than 20% of config agents failing to apply DZD config
- 0–20% of DZDs reporting interface errors
- Upstream issues causing observability loss (monitoring/alerting down)
- Onchain data pipeline down or producing incorrect data
- More than 20% of internet latency collection or submission failing
- Controller inaccessible by DZD agents
- Controller returning invalid configs to DZDs that will not be applied
Sev3 examples
- 0–20% of users unable to send/receive traffic over DoubleZero tunnels, with fallback to public internet
- 0–20% of DZDs reporting interface errors
- 0–20% of DZDs experiencing config agent failures
- 0–20% of user onboarding, connect, or disconnect attempts failing
- More than 20% of internet latency collection or submission failing for a single data provider
- 0–20% of internet latency collection or submission failing for all data providers
- Bugs or tech debt causing alerting noise that cannot be silenced
- DIA down or ledger RPC networking issues for 0–20% of devices for several hours
- Low-impact issues such as minor bugs, cosmetic errors, or isolated incidents not affecting customer traffic
- Small fraction of devices intermittently reporting errors without service disruption
Opening an Incident
Click Create New Record, select Type = Incident on the portal, or submit via the API.
Required:
| Field | Description |
|---|---|
title |
Short summary (max 100 characters) |
description |
Detailed explanation (max 500 characters) |
severity |
sev1, sev2, or sev3 |
status |
Cannot be set to a terminal state (resolved, closed) on create |
| Device and/or Link | At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as device_pubkey and/or affected_link_pubkey. |
Optional:
| Field | Description |
|---|---|
reporter_name / reporter_email |
Your contact details |
assignee |
Who is responsible for resolution |
internal_reference |
Your internal ticket ID (e.g. Jira, ServiceNow) |
start_at |
Defaults to creation time; editable |
Once created, a notification is posted to the contributor incidents Slack channel with the ticket ID, severity, affected devices/links, and contributor name.
Updating an Incident
As the incident progresses, keep the ticket status current. This is the signal other contributors and DZ use to understand what's being worked on.
| Status | When to set it |
|---|---|
open |
Initial state: issue reported, not yet being worked |
acknowledged |
You've seen it and taken ownership |
investigating |
Actively diagnosing: gathering logs, checking metrics |
mitigating |
Root cause known or suspected; applying a fix or workaround |
monitoring |
Fix applied; watching to confirm it holds |
resolved |
Issue confirmed fixed; root cause required |
closed |
Fully complete; no further action; root cause required |
open → acknowledged → investigating → mitigating → monitoring → resolved → closed
You can skip statuses if appropriate. For example, jump straight from open to investigating if you immediately start working it. Always use the most accurate status for the current state.
Each status update posts a reply in the original Slack notification thread.
Closing an Incident
To move an incident to resolved or closed, a root cause must be set. You can set root cause at any earlier stage if you already know it; it becomes mandatory at close.
| Code | Description |
|---|---|
hardware |
Hardware repair, replacement, or upgrade (SFP, NIC, cable, device) |
software |
Software or firmware fix, update, or restart |
configuration |
Configuration change, fix, or rollback |
capacity |
Congestion, capacity limits, or traffic management |
carrier |
Circuit, wavelength, or cross-connect provider issue |
network_external |
External network issue outside contributor control |
facility |
Datacenter infrastructure issue (power, cooling) |
fiber_cut |
Physical fiber damage repaired |
security |
Security incident mitigated |
human_error |
Operational mistake corrected |
false_positive |
No actual issue found after investigation |
duplicate |
Already tracked in another ticket |
self_resolved |
Issue resolved without intervention |
dz_managed |
Issue with a DoubleZero-managed software component (activator, controller, etc.) |
Maintenance
A maintenance record is a planned, time-bounded activity that may affect availability. Create it in advance so other contributors can see and avoid conflicting windows.
Scheduling Maintenance
Click Create New Record > Maintenance on the portal, or submit via the API.
Required:
| Field | Description |
|---|---|
title |
Short summary (max 100 characters) |
description |
Detailed explanation (max 500 characters) |
start_at |
Planned start time (UTC) |
end_at |
Planned end time (UTC); must be after start_at |
| Device and/or Link | At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as device_pubkey and/or affected_link_pubkey. |
Once created, a notification is posted to the contributor maintenance Slack channel with the ticket ID, affected devices/links, planned window, and contributor name.
Managing Maintenance Status
Keep the status current as the window progresses.
| Status | When to set it |
|---|---|
planned |
Scheduled, not yet started |
in-progress |
Work has begun |
completed |
Work finished successfully |
closed |
Auto-set 24 hours after end_at |
cancelled |
Called off before or during execution |
planned → in-progress → completed → closed (auto 24h after end_at)
↓ ↓
└──────────┴──→ cancelled
Permissions and Escalation
What Contributors Can Do
- Create and manage tickets for their own devices and links only.
- Assign tickets to themselves or escalate to DZ/Malbeclabs.
- View all tickets across all contributors.
What DZ/Malbeclabs Admins Can Do
- Create tickets for any contributor's devices and links.
- Assign or reassign tickets between contributors.
- Handle escalations and support requests.
DZX Link Ownership
DZX links connect devices from two different contributors. The A-side contributor (first device in the link name) owns the link and is the only one who can create tickets for it.
Example: For link deviceA:deviceB, the contributor who owns deviceA owns the link.
If the issue is on the Z-side:
- A-side contributor creates a ticket for the DZX link.
- Assign the ticket to DZ/Malbeclabs.
- DZ/Malbeclabs investigates and reassigns to the Z-side contributor if needed.
We recognise this workflow is limited. Z-side contributors currently cannot create tickets for DZX links they don't own, which means coordination has to go through DZ/Malbeclabs. We are working to improve this so that both sides of a DZX link can declare incidents and maintenance independently.