web3-infra
A Small Grafana Dashboard for Blockchain Node Operations
How to shape a Prometheus and Grafana view around machine health, RPC reachability, sync state, block height, peer count, and per-node filtering.
After archive/RPC nodes are running, the next problem is visibility. Logs were enough to tell whether a node had started. They were not enough for daily operation.
The dashboard needed to answer a small set of questions:
- Is the machine healthy?
- Is the disk filling up?
- Is the node process reachable through RPC?
- Is it syncing?
- What height does it report?
- What height does the network report?
- How many peers does it have?
- Which node am I looking at?
That last question shaped the dashboard. Every panel had to support filtering by node name.
Terms used here
| Term | Meaning |
|---|---|
| Prometheus | A time-series database that scrapes metrics from targets. |
| Grafana | A dashboard UI that queries Prometheus and renders charts, tables, and status panels. |
| node_exporter | A Prometheus exporter for Linux CPU, memory, disk, filesystem, and network metrics. |
| Scrape target | An endpoint Prometheus calls on a schedule to collect metrics. |
| Label | A key-value tag on a metric series. Labels such as node make dashboard filtering possible. |
| RPC exporter | A small service that calls node RPC methods and exposes the result as Prometheus metrics. |
Two layers of monitoring
The monitor has two different jobs. The first is host monitoring. The second is chain monitoring.
Host monitoring comes from node_exporter.
It gives CPU, memory, disk, filesystem, and network data.
This layer works even when the blockchain client is still syncing or its RPC endpoint is not reachable from the monitor.
Chain monitoring needs client metrics and RPC checks. Client metrics differ across execution clients, consensus clients, and Substrate-based nodes. RPC checks make the dashboard more consistent, because every EVM-like node can answer a small set of common calls.
The split is useful during incidents. If CPU, memory, and disk panels are healthy but RPC is down, the host is probably alive and the failure is closer to the node process, binding, firewall, or client state. If the host panels are dead too, the problem is below the client.
The metrics that mattered
The dashboard was built around these groups:
| Group | Panels |
|---|---|
| Host | CPU usage, memory usage, filesystem usage, disk I/O, network traffic |
| Prometheus | Target up, scrape failures, scrape duration |
| RPC | RPC up, current block, highest block, syncing flag, peer count |
| Lag | Highest block minus current block |
| Chain-specific | Erigon, Substrate-style, and other client-native metrics where available |
For the RPC layer, a small exporter is enough. It calls methods such as:
eth_blockNumber
eth_syncing
net_peerCount
For Substrate-style nodes, the equivalent checks come from system and chain RPC calls or native metrics. The exact method names differ, but the dashboard goal is the same: current height, target height, sync state, and peers.
The exporter should expose neutral metrics such as:
node_rpc_up
node_rpc_current_block
node_rpc_highest_block
node_rpc_syncing
node_rpc_peer_count
node_rpc_last_success_timestamp
The prefix is less important than the labels.
At minimum, each series needs node, chain, and endpoint.
Node name as a first-class label
The dashboard filter depends on the node label.
Without it, panels become a pile of instance addresses and ports.
The Prometheus target should attach a stable label:
labels:
node: astar
chain: astar
role: archive-rpc
Grafana can then build a variable from:
label_values(up, node)
After that, panels can use:
up{node=~"$node"}
or:
node_rpc_current_block{node=~"$node"}
This is small, but it changes the dashboard from a wall of hostnames into an operational tool.
What RPC Up really means
RPC Up = 0 does not always mean the node is down.
It means the monitor could not complete the RPC check.
The common cases are:
| Symptom | Likely area |
|---|---|
| Host metrics up, RPC up is 0 | RPC binding, firewall, proxy, client health, or endpoint path |
| Host metrics down, RPC up is 0 | Instance, network, Prometheus target, or exporter failure |
| RPC up is 1, syncing is 1 | Node is alive but still catching up |
| RPC up is 1, block lag grows | Sync is too slow or stalled |
This matters for any EVM node when RPC is bound to 127.0.0.1.
The service can be healthy locally while the monitor cannot reach it directly.
In that case the fix is a deliberate local collection path or proxy, not a public RPC listener.
Template choice
I looked for official and community Grafana dashboards first. The useful pieces were:
- Node Exporter Full for Linux host metrics.
- Erigon-oriented dashboards for execution client internals.
- Substrate dashboards for peer and block metrics.
- Chain-specific dashboards when the client exposes stable Prometheus names.
None of them matched the full need by themselves. The fleet shape had mixed clients, mixed RPC behavior, and a hard requirement for node-name filtering. The final dashboard borrowed the obvious host panels and used a custom RPC layer for the common chain status.
That is a practical compromise. Official dashboards are good for client internals. A fleet dashboard needs a common language across clients.
Live updates without replacing nodes
One operational boundary was important: adding monitoring should not replace node instances. For live adjustments on the monitor host, a remote command path is safer than a broad Terraform apply when the target is only Prometheus config, Grafana provisioning, or exporter files.
The infrastructure code can still own the desired state. But while nodes are syncing, a monitoring-only change should not put unrelated EC2 resources at risk.
The dashboard I wanted
The useful first page is plain:
| Row | Content |
|---|---|
| Fleet health | Prometheus target up and RPC up by node |
| Sync | Syncing flag, current height, highest height, block lag |
| Peers | Peer count by node |
| Host pressure | CPU, memory, filesystem, disk read/write, network |
| Detail | Client-specific panels for the selected node |
The dashboard is less about beauty than response time. When a node is slow, I want to know whether the bottleneck is CPU, memory, disk, peers, RPC reachability, or chain sync state within the first minute.