cloud-sre
Why I Moved a Polygon Erigon Archive Node from v3.6.0 to v3.6.1-beta
A field note on moving a Polygon Erigon archive node from the Chicago hardfork release to the v3.6.1 beta after peer and post-hardfork sync symptoms stayed suspicious.
The first upgrade fixed the obvious failure.
v3.5.0 could not cross Polygon Mainnet block 87218600, the Chicago hardfork activation block.
Moving to 0xPolygon/erigon v3.6.0 added the Chicago rules and the node crossed that block.
The second problem was less clean.
After v3.6.0, the node still showed weak peer behavior and repeated post-hardfork sync warnings.
It was not a security group problem: public P2P was reachable, JSON-RPC stayed closed to the internet, and local RPC still worked.
The safer next move was a controlled test of v3.6.1-beta, not deleting /data and starting over.
Terms used here
| Term | Meaning |
|---|---|
| Archive node | A node that keeps historical state so old receipts, traces, and contract state can be queried. It needs much more disk than a pruned node. |
| Hardfork | A scheduled protocol rule change at a specific block. A node must run software that knows the new rules before it reaches that block. |
| Chicago hardfork | The Polygon hardfork activated on Mainnet at block 87218600 in the 0xPolygon/erigon v3.6.0 release notes. |
| Peer | Another node connected through P2P. Erigon needs peers to download blocks and exchange network data. |
| GoodPeers | Erigon’s view of peers that can currently serve the data it needs. A low number can be network-related, version-related, or validation-related. |
| Fork choice | The process that decides which chain head is valid. If execution marks a block invalid, fork choice cannot advance to it. |
| Datadir | The directory where Erigon stores its databases and snapshots. On this node it is /data. |
| Beta release | A pre-release build. It can contain important fixes, but it should be tested with a rollback path. |
What v3.6.0 fixed
The previous failure was deterministic:
gas used by execution: 62712118, in header: 79467913, headerNum=87218600
pos sync failed: unexpected bad block at finalized waypoint
The block number was the key.
The v3.6.0 release notes say the release includes the changes required for the Chicago hardfork and list Polygon Mainnet activation at block 87218600 [1].
After rebuilding and starting polygon-erigon:local-v3.6.0, the node printed:
Build info git_tag=v3.6.0-dirty git_commit=231d67e50b...
Initialised chain configuration ... Chicago: 87218600
It then crossed the failing block:
blk=87218538
blk=87218647
blk=87218801
That made v3.6.0 the right fix for the Chicago rule mismatch.
But it did not mean the node was finished catching up.
What still looked wrong
After the v3.6.0 upgrade, the node could move forward, but the peer layer still looked weak.
The logs sometimes showed:
[sync] can't use any peers to download blocks, will try again in a bit
[p2p] No GoodPeers
Those messages can be misleading. They do not always mean the EC2 security group or Docker port mapping is wrong. In this case, the public P2P port was reachable from outside:
nc -vz <node-public-ip> 30303
Docker was also publishing the right ports:
30303/tcp -> 0.0.0.0:30303
30303/udp -> 0.0.0.0:30303
42069/udp -> 0.0.0.0:42069
8545/tcp -> 127.0.0.1:8545
6060/tcp -> 127.0.0.1:6060
That port layout is intentional. P2P stays public. JSON-RPC and metrics stay local.
The open upstream issue #154 reports the same v3.6.0 class of symptom: No GoodPeers after upgrading to v3.6.0, with can't use any peers to download blocks in the log [2].
That issue by itself does not prove the same failure mode on every node.
It does make one thing clear: if P2P reachability checks pass, the next suspect is the client version and its post-hardfork sync behavior.
Why I did not delete the datadir
Deleting chaindata, heimdall, or polygon-bridge is expensive on an archive node.
It turns a version problem into a long restore problem.
The local checks pointed the other way:
- The node could open
/data/chaindata. - It could download checkpoint ranges.
- It could insert fetched blocks.
- It could enter the
Executionstage. - It had already crossed
87218600.
The better test was to change only the client build and keep the same /data.
That gives a clean comparison: same disk, same snapshot, same instance, same network boundary, newer Erigon.
Why v3.6.1-beta was worth testing
v3.6.1-beta is marked as a pre-release, so it is not the default conservative choice.
The reason to test it here was the changelog.
The release notes describe it as a maintenance release with bug fixes, including:
- execution and P2P fixes around
ssTxsencoding and missing hardfork blocks - post-
v3.6.0backports - deterministic state sync
- a P2P forkid change for Polygon-specific forks
Those items match the area where the node was still suspicious: post-Chicago execution, peer compatibility, and sync validation [3].
The startup command
The runtime shape stayed the same.
Only the image moved from polygon-erigon:local-v3.6.0 to polygon-erigon:local-v3.6.1-beta.
EXT_IP="$(curl -sS http://169.254.169.254/latest/meta-data/public-ipv4)"
sudo docker run -d \
--name polygon-erigon \
--restart unless-stopped \
--network erigon-net \
--log-driver=json-file \
--log-opt max-size=100m \
--log-opt max-file=5 \
-v /data:/data \
-p 127.0.0.1:8545:8545 \
-p 30303:30303 \
-p 30303:30303/udp \
-p 42069:42069/udp \
-p 127.0.0.1:6060:6060 \
polygon-erigon:local-v3.6.1-beta \
--chain=bor-mainnet \
--datadir=/data \
--prune.mode=archive \
--db.size.limit=12TB \
--db.pagesize=16KB \
--metrics \
--metrics.addr=0.0.0.0 \
--metrics.port=6060 \
--http \
--ws \
--http.addr=0.0.0.0 \
--http.vhosts='*' \
--http.port=8545 \
--http.api=web3,net,eth,trace \
--private.api.addr=127.0.0.1:9090 \
--torrent.port=42069 \
--bor.heimdall=https://heimdall-api.polygon.technology \
--rpc.batch.concurrency=16 \
--rpc.batch.limit=5000 \
--rpc.returndata.limit=5000000 \
--maxpeers=200 \
--log.dir.path=/data/logs \
--log.dir.prefix=erigon \
--log.dir.verbosity=info \
--verbosity=3 \
--nat=extip:${EXT_IP}
The --nat=extip:${EXT_IP} value matters.
If the advertised public IP is stale, peers may not be able to connect back correctly.
What I checked after the switch
First, confirm the running build:
sudo docker logs --tail 100 polygon-erigon | egrep 'Build info|Public IP|Starting Erigon'
The expected signs are:
Build info git_tag=v3.6.1-beta-dirty
[torrent] Public IP ip=<node-public-ip>
Starting Erigon on Bor Mainnet...
Then confirm ports:
sudo docker port polygon-erigon
The expected shape is:
8545/tcp -> 127.0.0.1:8545
6060/tcp -> 127.0.0.1:6060
30303/tcp -> 0.0.0.0:30303
30303/udp -> 0.0.0.0:30303
42069/udp -> 0.0.0.0:42069
Finally, watch sync:
sudo docker logs -f polygon-erigon | \
egrep 'GoodPeers|No GoodPeers|bad block|fork choice|Execution|inserting fetched blocks|update fork choice'
The useful post-upgrade log looked like this:
inserting fetched blocks start=87269286 end=87271077 blocks=1792
inserting fetched blocks start=87271078 end=87274917 blocks=3840
update fork choice block=87274917
GoodPeers eth69=3 eth68=3
[4/6 Execution] blk=87269385 blks=52 blk/s=2.6
Peer count was still not high, but the node was moving. That is the important distinction. A low peer count is a watch item. A low peer count plus no block insertion, no execution progress, and repeated bad-block errors is a stop condition.
Rollback boundary
The rollback plan stayed simple:
sudo docker stop polygon-erigon
sudo docker rm polygon-erigon
Then start the same command with:
polygon-erigon:local-v3.6.0
I would roll back if v3.6.1-beta showed any of these:
- repeated
fork choice update bad block - repeated
unexpected bad block - no
Executionprogress over multiple samples - no block insertion after P2P reachability was already confirmed
In this run, the early signal was better than that. The node kept downloading, inserting, updating fork choice, and executing.
Takeaway
The move from v3.6.0 to v3.6.1-beta was not a blind upgrade.
It was a narrow test after the stable Chicago release fixed one failure but left enough P2P and post-hardfork sync symptoms to justify trying the maintenance beta.
The rule I would reuse is:
- keep
/datawhen the database can still be opened and execution is progressing - change one variable at a time
- keep public RPC closed
- verify the actual running image, not the source directory on disk
- treat beta builds as reversible operational tests, not as permanent assumptions