Recovering an Erigon Mainnet archive node stuck at block 25393069

This incident looked like a one-block problem.

An Ethereum Mainnet Erigon archive node was upgraded to erigontech/erigon:v3.5.0, then repeatedly failed in the execution stage around block 25393069. It was not a random peer issue or a transient disk spike. The same block kept producing a gas-used mismatch.

The recovery that worked was not skipping the block and not deleting the whole /data directory. I kept the local snapshots, removed only /data/chaindata, and let Erigon rebuild the execution database from the existing snapshot data. After that rebuild, the node passed 25393069, committed RPC height 25393999, and continued executing later blocks.

Terms

Term	Meaning
Archive node	A node that keeps historical state so old receipts, traces, and state queries can be served.
Erigon	An Ethereum execution client that uses staged sync: headers, bodies, senders, execution, trie work, and indexes advance as separate stages.
Datadir	The node data directory. In this incident it was `/data`, containing snapshots, `chaindata`, logs, and downloader data.
chaindata	Erigon’s main execution database directory. Removing it forces execution state to be rebuilt, but does not remove all snapshots.
Snapshot	Prebuilt chain data that lets Erigon avoid replaying everything from genesis in the old linear way.
Gas-used mismatch	The node executes a block and computes a different gas-used value from the block header. The execution client must reject that block.
Staged sync	Erigon’s sync model where `Headers`, `Bodies`, `Senders`, `Execution`, `TxLookup`, and `Finish` can each have their own progress.

Symptom

The node was running a normal Ethereum Mainnet archive setup:

erigontech/erigon:v3.5.0
--chain=mainnet
--datadir=/data
--prune.mode=archive

The datadir had been upgraded in place from an older version. After the upgrade, Erigon could open the database and see local snapshots, but execution could not pass the same block.

The core error was:

gas used mismatch block=25393069 header=20304193 execution=20137672
[4/6 Execution] rw exit err="invalid block, block=25393069, invalid block, gas used by execution: 20137672, in header: 20304193"
[4/6 Execution] Execution failed err="invalid block, block=25393069 ..."
Cannot update chain head err="updateForkChoice: [4/6 Execution] invalid block, block=25393069 ..."

The sync stages showed the same shape:

Headers     25393067
Bodies      25393067
Senders     25393067
Execution   25393067
Finish      25393067

The node was not dead. It was consistently failing near 25393069.

Why I did not try to skip the block

An execution client cannot safely skip a failed block.

If block 25393069 does not execute to the header values, the following state root, receipts, and traces have no trustworthy base. For an archive node, that matters even more because historical state and trace data depend on the earlier execution state.

So the practical options were:

Prove that the problem was only a bad-block marker, an index, or a local stage issue, then repair that layer.
Give up the current execution database and rebuild it from reliable snapshots and block data.

The lighter repairs I tried first

The Erigon image includes the integration tool. First, I confirmed it was available:

docker exec erigon command -v integration
docker exec erigon integration --version

Then I stopped the main node so two processes would not write to /data at the same time:

docker stop erigon

I inspected staged progress:

docker run --rm \
  -v /data:/data \
  --entrypoint integration \
  erigontech/erigon:v3.5.0 \
  print_stages --datadir=/data

Execution was at 25393067. Local block snapshots and database headers/bodies were already past that area.

Then I cleared bad-block markers:

docker run --rm \
  -v /data:/data \
  --entrypoint integration \
  erigontech/erigon:v3.5.0 \
  clear_bad_blocks --datadir=/data

That cleared the BadHeaderNumber table.

I also ran senders:

docker run --rm \
  -v /data:/data \
  --entrypoint integration \
  erigontech/erigon:v3.5.0 \
  stage_senders --datadir=/data --chain=mainnet --block=25393070

Then I tried rerunning execution after an unwind:

docker run --rm \
  -v /data:/data \
  --entrypoint integration \
  erigontech/erigon:v3.5.0 \
  stage_exec --datadir=/data --chain=mainnet --unwind=100 --block=25393070

The important lesson here is that stage_exec is not a quick button. It can keep writing MDBX data for a long time. In this incident, Docker block I/O kept growing into the hundreds of GB. That did not prove the process was hung, but it also did not give a clean recovery inside the operational window.

If the integration container is started with docker run --rm, capture the result carefully:

the container can disappear immediately after exit;
docker wait or a log-following command is useful for retaining the final exit code and tail logs.

Why I switched to removing chaindata

The upstream issue had a report that keeping snapshots and removing chaindata recovered the node ^[1].

That is not the same as deleting the whole node. The target was only:

/data/chaindata

I did not remove:

/data/snapshots
/data/downloader
/data/logs

With snapshots kept, Erigon can rebuild execution state from local data. There is still a cost: Execution History, trie work, indexes, and TxLookup need to be rebuilt. But it is much smaller than wiping the whole datadir.

The reset sequence

First, confirm no Erigon or integration process is still writing the datadir:

docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Command}}'
ps -eo pid,stat,pcpu,pmem,etime,args | grep -E 'erigon|integration stage_exec' | grep -v grep

Stop the temporary execution container:

docker stop -t 60 <temporary-integration-container>

Confirm stage_exec is gone:

ps -eo args | grep -E 'integration stage_exec' | grep -v grep

Remove only chaindata:

rm -rf /data/chaindata
test ! -e /data/chaindata && echo chaindata_deleted

Start Erigon again:

docker start erigon

Right after restart, RPC can briefly report 0x0. That is expected because the execution database has just been removed and stages need to advance again.

{
  "currentBlock": "0x0",
  "stages": [
    { "stage_name": "Execution", "block_number": "0x0" },
    { "stage_name": "TxLookup", "block_number": "0x0" },
    { "stage_name": "Finish", "block_number": "0x0" }
  ]
}

What recovery looked like

The first useful sign was that local snapshots were reused:

[1/6 OtterSync] Skipping SyncSnapshots, local preverified. Use snapshots reset to resync

Then Erigon downloaded execution history:

Downloading Execution History progress=30363/36158

During this phase, eth_blockNumber may not move. That alone is not a failure.

After that, block insertion and execution resumed:

[BlockCollector] Inserting blocks from=25392000 to=25392999
[BlockCollector] Inserting blocks from=25393000 to=25393999
[4/6 Execution] parallel starting from=25392365 to=25393999

The real validation point was crossing the bad block:

[4/6 Execution] parallel executed blk=25392996
[4/6 Execution] parallel executed blk=25393199

25393199 is greater than 25393069. That meant the rebuild had passed the original failure point.

Later, RPC stage progress was committed:

eth_blockNumber = 0x1837b4f
Execution      = 0x1837b4f
TxLookup       = 0x1837b4f
Finish         = 0x1837b4f

0x1837b4f is 25393999. At that point, the old gas used mismatch block=25393069 error had not reappeared.

Reusable runbook

When an Erigon archive node repeatedly fails on the same execution block:

Keep the exact failure log: block number, header gas, execution gas.
Confirm the actual running image and arguments, not only the Terraform or source-tree expectation.
Use eth_syncing and integration print_stages to see which stage is stuck.
Try clear_bad_blocks if the evidence points to a bad-block marker, but do not assume it will recover the node.
Before running stage_exec, make sure only one process can write the datadir.
If you reset, remove only /data/chaindata and keep snapshots.
After restart, watch Execution History, Execution, TxLookup, and Finish, not just eth_blockNumber.
Validate by checking whether the node passed the original failed block.

The most important safety boundary is simple: do not let the main Erigon process and an integration command write /data at the same time.

What I took from this incident

A fixed-block gas-used mismatch is not something a normal restart explains. It is usually either a client execution-rule problem or a local execution database that no longer works cleanly with the current client.

In this case, v3.5.0 still reproduced the failure at 25393069. The lighter commands could clear markers and rerun some stages, but they did not give a clean recovery quickly enough. Removing chaindata let Erigon rebuild from local snapshots and pass the bad block.

The order I would reuse is:

confirm that the failure is fixed-block;
check upstream reports for the same signature;
try low-risk inspection and marker cleanup first;
if a reset is needed, reset only the layer that needs rebuilding.

This is not an elegant fix. It is a bounded workaround with a clear verification point.