web3-infra

A Small Grafana Dashboard for Blockchain Node Operations

How to shape a Prometheus and Grafana view around machine health, RPC reachability, sync state, block height, peer count, and per-node filtering.

Jun 23, 2026
GrafanaPrometheusnode-exporterblockchainmonitoring

After archive/RPC nodes are running, the next problem is visibility. Logs were enough to tell whether a node had started. They were not enough for daily operation.

The dashboard needed to answer a small set of questions:

  • Is the machine healthy?
  • Is the disk filling up?
  • Is the node process reachable through RPC?
  • Is it syncing?
  • What height does it report?
  • What height does the network report?
  • How many peers does it have?
  • Which node am I looking at?

That last question shaped the dashboard. Every panel had to support filtering by node name.

Terms used here

TermMeaning
PrometheusA time-series database that scrapes metrics from targets.
GrafanaA dashboard UI that queries Prometheus and renders charts, tables, and status panels.
node_exporterA Prometheus exporter for Linux CPU, memory, disk, filesystem, and network metrics.
Scrape targetAn endpoint Prometheus calls on a schedule to collect metrics.
LabelA key-value tag on a metric series. Labels such as node make dashboard filtering possible.
RPC exporterA small service that calls node RPC methods and exposes the result as Prometheus metrics.

Two layers of monitoring

The monitor has two different jobs. The first is host monitoring. The second is chain monitoring.

Host monitoring comes from node_exporter. It gives CPU, memory, disk, filesystem, and network data. This layer works even when the blockchain client is still syncing or its RPC endpoint is not reachable from the monitor.

Chain monitoring needs client metrics and RPC checks. Client metrics differ across execution clients, consensus clients, and Substrate-based nodes. RPC checks make the dashboard more consistent, because every EVM-like node can answer a small set of common calls.

The split is useful during incidents. If CPU, memory, and disk panels are healthy but RPC is down, the host is probably alive and the failure is closer to the node process, binding, firewall, or client state. If the host panels are dead too, the problem is below the client.

The metrics that mattered

The dashboard was built around these groups:

GroupPanels
HostCPU usage, memory usage, filesystem usage, disk I/O, network traffic
PrometheusTarget up, scrape failures, scrape duration
RPCRPC up, current block, highest block, syncing flag, peer count
LagHighest block minus current block
Chain-specificErigon, Substrate-style, and other client-native metrics where available

For the RPC layer, a small exporter is enough. It calls methods such as:

eth_blockNumber
eth_syncing
net_peerCount

For Substrate-style nodes, the equivalent checks come from system and chain RPC calls or native metrics. The exact method names differ, but the dashboard goal is the same: current height, target height, sync state, and peers.

The exporter should expose neutral metrics such as:

node_rpc_up
node_rpc_current_block
node_rpc_highest_block
node_rpc_syncing
node_rpc_peer_count
node_rpc_last_success_timestamp

The prefix is less important than the labels. At minimum, each series needs node, chain, and endpoint.

Node name as a first-class label

The dashboard filter depends on the node label. Without it, panels become a pile of instance addresses and ports.

The Prometheus target should attach a stable label:

labels:
  node: astar
  chain: astar
  role: archive-rpc

Grafana can then build a variable from:

label_values(up, node)

After that, panels can use:

up{node=~"$node"}

or:

node_rpc_current_block{node=~"$node"}

This is small, but it changes the dashboard from a wall of hostnames into an operational tool.

What RPC Up really means

RPC Up = 0 does not always mean the node is down. It means the monitor could not complete the RPC check.

The common cases are:

SymptomLikely area
Host metrics up, RPC up is 0RPC binding, firewall, proxy, client health, or endpoint path
Host metrics down, RPC up is 0Instance, network, Prometheus target, or exporter failure
RPC up is 1, syncing is 1Node is alive but still catching up
RPC up is 1, block lag growsSync is too slow or stalled

This matters for any EVM node when RPC is bound to 127.0.0.1. The service can be healthy locally while the monitor cannot reach it directly. In that case the fix is a deliberate local collection path or proxy, not a public RPC listener.

Template choice

I looked for official and community Grafana dashboards first. The useful pieces were:

  • Node Exporter Full for Linux host metrics.
  • Erigon-oriented dashboards for execution client internals.
  • Substrate dashboards for peer and block metrics.
  • Chain-specific dashboards when the client exposes stable Prometheus names.

None of them matched the full need by themselves. The fleet shape had mixed clients, mixed RPC behavior, and a hard requirement for node-name filtering. The final dashboard borrowed the obvious host panels and used a custom RPC layer for the common chain status.

That is a practical compromise. Official dashboards are good for client internals. A fleet dashboard needs a common language across clients.

Live updates without replacing nodes

One operational boundary was important: adding monitoring should not replace node instances. For live adjustments on the monitor host, a remote command path is safer than a broad Terraform apply when the target is only Prometheus config, Grafana provisioning, or exporter files.

The infrastructure code can still own the desired state. But while nodes are syncing, a monitoring-only change should not put unrelated EC2 resources at risk.

The dashboard I wanted

The useful first page is plain:

RowContent
Fleet healthPrometheus target up and RPC up by node
SyncSyncing flag, current height, highest height, block lag
PeersPeer count by node
Host pressureCPU, memory, filesystem, disk read/write, network
DetailClient-specific panels for the selected node

The dashboard is less about beauty than response time. When a node is slow, I want to know whether the bottleneck is CPU, memory, disk, peers, RPC reachability, or chain sync state within the first minute.