cloud-sre
Fixing a Polygon Erigon Archive Node Stuck at Block 87218600
A field note on diagnosing a deterministic gas-used mismatch at Polygon's Chicago hardfork activation block and fixing it by upgrading 0xPolygon/erigon from v3.5.0 to v3.6.0.
A Polygon Mainnet Erigon archive node stopped making progress at block 87218600.
The first instinct in archive-node incidents is to suspect disk, CPU, memory, peers, or a corrupted snapshot.
In this case, the failure was somewhere else.
The failure was deterministic and happened exactly at Polygon’s Chicago hardfork activation block.
The fix was to upgrade the running 0xPolygon/erigon build from v3.5.0 to v3.6.0, then restart the container with the same /data directory.
Background
A blockchain node downloads data, but it also re-executes blocks. It checks that its computed result matches the block header accepted by the network. That is why a node can have enough CPU, enough memory, enough disk I/O, and still stop at one block if its software does not understand the rules active at that block.
Polygon, like other EVM chains, changes execution rules through hardforks. A hardfork is a scheduled protocol upgrade. At a specific block height, new rules become active: gas pricing may change, precompile behavior may change, or chain-specific system transactions may be handled differently. If the client binary does not include those new rules, it may calculate a different result from the network and reject the block as invalid.
In this incident, 87218600 was the Chicago hardfork activation block for Polygon Mainnet.
The running node was still using an older v3.5.0 build.
That version could replay blocks before the activation point, but it failed when execution reached the new rule boundary.
Terms used in this post
| Term | Plain meaning |
|---|---|
| Archive node | A node that keeps historical state so old blocks, receipts, traces, and contract state can be queried. It needs much more storage than a pruned node. |
| Erigon | An Ethereum client implementation. This node uses Polygon’s Erigon fork, 0xPolygon/erigon, so it can follow Polygon Bor mainnet rules. |
| Bor mainnet | Polygon PoS mainnet’s execution chain. In Erigon, it is selected with --chain=bor-mainnet. |
| Hardfork | A scheduled protocol rule change at a specific block. Every node must run software that knows the new rules before reaching that block. |
| Chicago hardfork | The Polygon hardfork that activates at Mainnet block 87218600 in the 0xPolygon/erigon v3.6.0 release notes. |
| Gas used | The total gas the block execution consumed. The node recalculates it and compares it with the value in the block header. |
| StateSync | Polygon-specific state synchronization events. These can affect execution validation around Polygon forks. |
| Heimdall | Polygon’s consensus/checkpoint layer. Erigon can talk to a remote Heimdall endpoint using --bor.heimdall=.... |
| Fork choice | The process of deciding which chain head is valid. If execution says a block is invalid, fork choice cannot advance to it. |
| Datadir | The directory where Erigon stores databases and snapshots. Here it is /data. |
The symptom
The node repeatedly unwound to the last valid tip and crashed with a gas-used mismatch:
gas used mismatch block=87218600 header=79467913 execution=62712118
Execution failed block=87218600
err="invalid block, txnIdx=186, gas used by execution: 62712118, in header: 79467913"
pos sync failed: unexpected bad block at finalized waypoint
The useful clue was the block number: 87218600.
That block is not random. The 0xPolygon/erigon v3.6.0 release notes list Polygon Mainnet Chicago activation at block 87218600 and tell validators, RPC providers, node operators, and infrastructure partners to upgrade before the hardfork activation [1].
The practical interpretation is:
- If the node fails once at a random block, it may be data, peers, or resources.
- If it fails repeatedly at the exact hardfork activation block, the client version becomes the first suspect.
- Deleting data is risky and usually unnecessary until the running binary version has been verified.
What was ruled out
The node was already running on archive-node class infrastructure:
| Layer | Status |
|---|---|
| Storage | 15TB class data volume, upgraded to high-IOPS EBS |
| Memory | 128GB class host after earlier tuning |
| P2P | Peers recovered after restart |
| Heimdall | Remote Heimdall scraper kept advancing |
| Snapshot data | Existing /data was readable and execution could replay up to the failing block |
After restart, the node could download and execute blocks again:
GoodPeers eth68=3
inserting fetched blocks start=87214502 end=87216293 blocks=1792
That made a local resource problem less likely. The process was not randomly failing under load; it was replaying to the same hardfork boundary and failing there.
The GitHub match
There were already open issues in 0xPolygon/erigon with the same class of failure:
- Issue
#143reports the same gas mismatch pattern onbor-mainnetwithv3.5.0-230b11a7, and points at StateSync gas handling after Polygon forks [2]. - Issue
#133reportspos sync failed: unexpected bad block at finalized waypointon the same code line [3]. - Issue
#100shows an earlier version failing with the samegas used by execution ... in header ...pattern [4].
Those issues did not prove the final fix by themselves, but they changed the investigation direction. The failure looked like a client/fork-rule mismatch, not a disk or peer problem.
The decisive version check
The container was still running the old build:
Build info git_tag=v3.5.0-dirty git_commit=230b11a713...
After cloning v3.6.0, the source tree existed, but the container had not yet been rebuilt or restarted with the new image.
That mistake matters: having the source on disk does nothing if Docker is still running the old image.
The correct check was:
docker logs --tail 100 polygon-erigon | grep 'Build info'
docker images | grep polygon-erigon
Before the fix, only the local-v3.5.0 image was present.
The fix
Build the v3.6.0 image:
docker stop polygon-erigon
rm -rf /opt/erigon-v3.6.0
git clone --branch v3.6.0 --single-branch https://github.com/0xPolygon/erigon.git /opt/erigon-v3.6.0
cd /opt/erigon-v3.6.0
DOCKER_BUILDKIT=1 docker build -t polygon-erigon:local-v3.6.0 .
Then remove the old container and start a new one with the same /data mount:
docker update --restart=no polygon-erigon
docker rm -f polygon-erigon
docker network create erigon-net 2>/dev/null || true
The runtime change was the image:
polygon-erigon:local-v3.6.0
The archive data was not deleted. The node kept using:
--chain=bor-mainnet
--datadir=/data
--prune.mode=archive
--db.size.limit=12TB
--db.pagesize=16KB
--bor.heimdall=https://heimdall-api.polygon.technology
For RPC safety, JSON-RPC and metrics were bound to localhost at the Docker publish layer:
-p 127.0.0.1:8545:8545
-p 127.0.0.1:6060:6060
That kept local validation working without reopening public RPC.
Validation
The new container showed the expected build:
Build info git_tag=v3.6.0-dirty git_commit=231d67e50b...
Initialised chain configuration ... Chicago: 87218600
At first, the node had a few temporary peer warnings:
can't use any peers to download blocks
No GoodPeers
Those were transient. The node later found peers and resumed block insertion:
GoodPeers eth68=3
inserting fetched blocks start=87214502 end=87216293 blocks=1792
The real validation was crossing the failing block:
blk=87218538
blk=87218647
blk=87218727
blk=87218801
That confirmed v3.6.0 had crossed 87218600.
There was no repeat of:
gas used mismatch block=87218600
unexpected bad block
Execution failed
Runbook
When a Polygon Erigon node hits this class of failure:
- Check the failing block number.
- Check the running binary version. The source directory alone is not enough.
- Compare the failing block with Polygon hardfork activation blocks.
- Search upstream issues for the exact error pattern.
- Upgrade to the release that contains the hardfork rules.
- Reuse the existing datadir unless upstream explicitly says the database format changed.
- Validate by crossing the exact failing block.
The two most useful commands are:
docker logs --tail 100 polygon-erigon | grep 'Build info'
and:
docker logs --since 15m polygon-erigon | \
grep -E 'gas used mismatch|unexpected bad block|Execution failed|polygon.sync.*crashed'
The first proves what binary is really running. The second proves whether the old failure is still present.
Why this mattered
This incident looked like a performance problem at first because the archive node had already gone through storage and instance tuning. But performance tuning cannot fix a hardfork rule mismatch.
The operational lesson is simple: when a blockchain node fails deterministically at one block, especially a known fork activation block, treat the client version as a first-class suspect.
References
- 0xPolygon/erigon v3.6.0 release notes
- 0xPolygon/erigon issue #143: Block 76879430 - same gas mismatch pattern
- 0xPolygon/erigon issue #133: pos sync failed: unexpected bad block at finalized waypoint
- 0xPolygon/erigon issue #100: gas used by execution mismatch
- Polygon documentation: Erigon archive node