cloud-sre
Recovering an Erigon Mainnet archive node stuck at block 25393069
A field note from an Ethereum Mainnet Erigon archive-node incident: v3.5.0 kept failing with a gas-used mismatch at block 25393069, and rebuilding only chaindata from existing snapshots recovered the node.
This incident looked like a one-block problem.
An Ethereum Mainnet Erigon archive node was upgraded to erigontech/erigon:v3.5.0, then repeatedly failed in the execution stage around block 25393069.
It was not a random peer issue or a transient disk spike.
The same block kept producing a gas-used mismatch.
The recovery that worked was not skipping the block and not deleting the whole /data directory.
I kept the local snapshots, removed only /data/chaindata, and let Erigon rebuild the execution database from the existing snapshot data.
After that rebuild, the node passed 25393069, committed RPC height 25393999, and continued executing later blocks.
Terms
| Term | Meaning |
|---|---|
| Archive node | A node that keeps historical state so old receipts, traces, and state queries can be served. |
| Erigon | An Ethereum execution client that uses staged sync: headers, bodies, senders, execution, trie work, and indexes advance as separate stages. |
| Datadir | The node data directory. In this incident it was /data, containing snapshots, chaindata, logs, and downloader data. |
| chaindata | Erigon’s main execution database directory. Removing it forces execution state to be rebuilt, but does not remove all snapshots. |
| Snapshot | Prebuilt chain data that lets Erigon avoid replaying everything from genesis in the old linear way. |
| Gas-used mismatch | The node executes a block and computes a different gas-used value from the block header. The execution client must reject that block. |
| Staged sync | Erigon’s sync model where Headers, Bodies, Senders, Execution, TxLookup, and Finish can each have their own progress. |
Symptom
The node was running a normal Ethereum Mainnet archive setup:
erigontech/erigon:v3.5.0
--chain=mainnet
--datadir=/data
--prune.mode=archive
The datadir had been upgraded in place from an older version. After the upgrade, Erigon could open the database and see local snapshots, but execution could not pass the same block.
The core error was:
gas used mismatch block=25393069 header=20304193 execution=20137672
[4/6 Execution] rw exit err="invalid block, block=25393069, invalid block, gas used by execution: 20137672, in header: 20304193"
[4/6 Execution] Execution failed err="invalid block, block=25393069 ..."
Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block, block=25393069 ..."
The sync stages showed the same shape:
Headers 25393067
Bodies 25393067
Senders 25393067
Execution 25393067
Finish 25393067
The node was not dead.
It was consistently failing near 25393069.
Why I did not try to skip the block
An execution client cannot safely skip a failed block.
If block 25393069 does not execute to the header values, the following state root, receipts, and traces have no trustworthy base.
For an archive node, that matters even more because historical state and trace data depend on the earlier execution state.
So the practical options were:
- Prove that the problem was only a bad-block marker, an index, or a local stage issue, then repair that layer.
- Give up the current execution database and rebuild it from reliable snapshots and block data.
The lighter repairs I tried first
The Erigon image includes the integration tool.
First, I confirmed it was available:
docker exec erigon command -v integration
docker exec erigon integration --version
Then I stopped the main node so two processes would not write to /data at the same time:
docker stop erigon
I inspected staged progress:
docker run --rm \
-v /data:/data \
--entrypoint integration \
erigontech/erigon:v3.5.0 \
print_stages --datadir=/data
Execution was at 25393067.
Local block snapshots and database headers/bodies were already past that area.
Then I cleared bad-block markers:
docker run --rm \
-v /data:/data \
--entrypoint integration \
erigontech/erigon:v3.5.0 \
clear_bad_blocks --datadir=/data
That cleared the BadHeaderNumber table.
I also ran senders:
docker run --rm \
-v /data:/data \
--entrypoint integration \
erigontech/erigon:v3.5.0 \
stage_senders --datadir=/data --chain=mainnet --block=25393070
Then I tried rerunning execution after an unwind:
docker run --rm \
-v /data:/data \
--entrypoint integration \
erigontech/erigon:v3.5.0 \
stage_exec --datadir=/data --chain=mainnet --unwind=100 --block=25393070
The important lesson here is that stage_exec is not a quick button.
It can keep writing MDBX data for a long time.
In this incident, Docker block I/O kept growing into the hundreds of GB.
That did not prove the process was hung, but it also did not give a clean recovery inside the operational window.
If the integration container is started with docker run --rm, capture the result carefully:
- the container can disappear immediately after exit;
docker waitor a log-following command is useful for retaining the final exit code and tail logs.
Why I switched to removing chaindata
The upstream issue had a report that keeping snapshots and removing chaindata recovered the node [1].
That is not the same as deleting the whole node. The target was only:
/data/chaindata
I did not remove:
/data/snapshots
/data/downloader
/data/logs
With snapshots kept, Erigon can rebuild execution state from local data.
There is still a cost: Execution History, trie work, indexes, and TxLookup need to be rebuilt.
But it is much smaller than wiping the whole datadir.
The reset sequence
First, confirm no Erigon or integration process is still writing the datadir:
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Command}}'
ps -eo pid,stat,pcpu,pmem,etime,args | grep -E 'erigon|integration stage_exec' | grep -v grep
Stop the temporary execution container:
docker stop -t 60 <temporary-integration-container>
Confirm stage_exec is gone:
ps -eo args | grep -E 'integration stage_exec' | grep -v grep
Remove only chaindata:
rm -rf /data/chaindata
test ! -e /data/chaindata && echo chaindata_deleted
Start Erigon again:
docker start erigon
Right after restart, RPC can briefly report 0x0.
That is expected because the execution database has just been removed and stages need to advance again.
{
"currentBlock": "0x0",
"stages": [
{ "stage_name": "Execution", "block_number": "0x0" },
{ "stage_name": "TxLookup", "block_number": "0x0" },
{ "stage_name": "Finish", "block_number": "0x0" }
]
}
What recovery looked like
The first useful sign was that local snapshots were reused:
[1/6 OtterSync] Skipping SyncSnapshots, local preverified. Use snapshots reset to resync
Then Erigon downloaded execution history:
Downloading Execution History progress=30363/36158
During this phase, eth_blockNumber may not move.
That alone is not a failure.
After that, block insertion and execution resumed:
[BlockCollector] Inserting blocks from=25392000 to=25392999
[BlockCollector] Inserting blocks from=25393000 to=25393999
[4/6 Execution] parallel starting from=25392365 to=25393999
The real validation point was crossing the bad block:
[4/6 Execution] parallel executed blk=25392996
[4/6 Execution] parallel executed blk=25393199
25393199 is greater than 25393069.
That meant the rebuild had passed the original failure point.
Later, RPC stage progress was committed:
eth_blockNumber = 0x1837b4f
Execution = 0x1837b4f
TxLookup = 0x1837b4f
Finish = 0x1837b4f
0x1837b4f is 25393999.
At that point, the old gas used mismatch block=25393069 error had not reappeared.
Reusable runbook
When an Erigon archive node repeatedly fails on the same execution block:
- Keep the exact failure log: block number, header gas, execution gas.
- Confirm the actual running image and arguments, not only the Terraform or source-tree expectation.
- Use
eth_syncingandintegration print_stagesto see which stage is stuck. - Try
clear_bad_blocksif the evidence points to a bad-block marker, but do not assume it will recover the node. - Before running
stage_exec, make sure only one process can write the datadir. - If you reset, remove only
/data/chaindataand keep snapshots. - After restart, watch
Execution History,Execution,TxLookup, andFinish, not justeth_blockNumber. - Validate by checking whether the node passed the original failed block.
The most important safety boundary is simple: do not let the main Erigon process and an integration command write /data at the same time.
What I took from this incident
A fixed-block gas-used mismatch is not something a normal restart explains. It is usually either a client execution-rule problem or a local execution database that no longer works cleanly with the current client.
In this case, v3.5.0 still reproduced the failure at 25393069.
The lighter commands could clear markers and rerun some stages, but they did not give a clean recovery quickly enough.
Removing chaindata let Erigon rebuild from local snapshots and pass the bad block.
The order I would reuse is:
- confirm that the failure is fixed-block;
- check upstream reports for the same signature;
- try low-risk inspection and marker cleanup first;
- if a reset is needed, reset only the layer that needs rebuilding.
This is not an elegant fix. It is a bounded workaround with a clear verification point.