web3-infra
When Local RPC Works but Another EC2 Times Out
A field note on debugging a private Erigon JSON-RPC timeout across EC2 instances by separating client failure, node health, socket binding, and security group rules.
The indexer exited after retrying an Ethereum JSON-RPC call.
The error looked like a node problem at first, because the failed call was
eth_getLogs over HTTP.
rpc error after retry, exiting
Post "http://<node-private-ip>:8545": dial tcp <node-private-ip>:8545: i/o timeout
The node was already synced.
eth_syncing returned false, and Erigon logs kept printing new
head validated lines with a low block age.
The failure was between two EC2 instances, not inside the Erigon sync path.
Terms used here
| Term | Meaning |
|---|---|
| JSON-RPC | The HTTP API clients use to query chain data. Erigon commonly exposes it on TCP 8545. |
| Security group | An AWS firewall attached to an instance network interface. It controls allowed inbound and outbound traffic. |
| Source security group | A security group used as the source of an inbound rule. Traffic is allowed from network interfaces that have that source group attached. |
| Socket binding | The address a process listens on, such as 127.0.0.1:8545 for local-only or 0.0.0.0:8545 for all interfaces. |
| Timeout | The client sent a request but did not get a response in time. For private EC2 traffic, this often points to filtering or routing rather than an application error. |
Start with the exact failure
The indexer log had two useful details:
- the destination was the node private address on port
8545 - the error was
i/o timeout
A timeout is different from connection refused.
connection refused usually means the packet reached the host, but nothing was
listening on that port.
i/o timeout usually means the packet did not get a usable response.
That can be a security group, network ACL, route, host firewall, or a process
that is bound only to loopback while the client uses the private IP.
Commands used during the check
These are the commands that shaped the investigation. The identifiers are placeholders.
Find the node and caller instances:
aws ec2 describe-instances \
--profile <aws-profile> \
--region <region> \
--instance-ids <node-instance-id> \
--query 'Reservations[].Instances[].{InstanceId:InstanceId,Name:Tags[?Key==`Name`]|[0].Value,PrivateIp:PrivateIpAddress,SecurityGroups:SecurityGroups[].GroupId,Subnet:SubnetId,Vpc:VpcId}' \
--output json
aws ec2 describe-instances \
--profile <aws-profile> \
--region <region> \
--filters 'Name=private-ip-address,Values=<caller-private-ip>' \
--query 'Reservations[].Instances[].{InstanceId:InstanceId,Name:Tags[?Key==`Name`]|[0].Value,State:State.Name,PrivateIp:PrivateIpAddress,SecurityGroups:SecurityGroups[].GroupId,Subnet:SubnetId,Vpc:VpcId}' \
--output json
Read the security groups:
aws ec2 describe-security-groups \
--profile <aws-profile> \
--region <region> \
--group-ids <node-sg-id> <caller-sg-id> \
--query 'SecurityGroups[].{GroupId:GroupId,Name:GroupName,Ingress:IpPermissions,Egress:IpPermissionsEgress}' \
--output json
Run read-only checks on the node through SSM. The wrapper command starts the remote shell command and returns a command ID:
aws ssm send-command \
--profile <aws-profile> \
--region <region> \
--instance-ids <node-instance-id> \
--document-name AWS-RunShellScript \
--comment 'read-only rpc listen check' \
--parameters commands='<json-array-of-shell-commands>'
aws ssm get-command-invocation \
--profile <aws-profile> \
--region <region> \
--command-id <command-id> \
--instance-id <node-instance-id>
The shell commands run on the node were:
ss -lntp | egrep ':8545|:8546|:8551' || true
curl -sS -m 3 \
-H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://127.0.0.1:8545
docker ps --format 'table {{.Names}}\t{{.Ports}}'
Run the caller-side TCP and JSON-RPC check through SSM with the same wrapper pattern. The shell commands run on the caller were:
timeout 5 bash -lc '</dev/tcp/<node-private-ip>/8545' \
&& echo tcp_ok || echo tcp_failed
curl -sS -m 5 \
-H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://<node-private-ip>:8545
Add the narrow inbound rule after the checks point to the security group:
aws ec2 authorize-security-group-ingress \
--profile <aws-profile> \
--region <region> \
--group-id <node-sg-id> \
--ip-permissions 'IpProtocol=tcp,FromPort=8545,ToPort=8545,UserIdGroupPairs=[{GroupId=<caller-sg-id>,Description="JSON-RPC from application server"}]'
Then run the caller-side check again and expect tcp_ok plus a normal
eth_blockNumber response.
Check the node before changing networking
The first check was local RPC on the node host.
curl -sS -m 3 \
-H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://127.0.0.1:8545
The node returned a block number.
eth_syncing returned false.
The logs were also healthy:
head validated ... age=2s
Timings: Forkchoice Commit ... commit=1s
That removed chain sync from the suspect list. The node process was alive, the local RPC API responded, and Erigon was tracking the current head.
Check how the port is bound
Local success does not guarantee that another machine can connect. The process may listen only on loopback.
ss -lntp | egrep ':8545|:8546|:8551'
docker ps --format 'table {{.Names}}\t{{.Ports}}'
The useful signal was:
0.0.0.0:8545
[::]:8545
That meant the container had published the RPC port on all host interfaces.
If the output had only shown 127.0.0.1:8545, the fix would have been a node
or container bind setting.
In this case, binding was not the blocker.
Test from the caller
The caller host still could not open the TCP connection:
timeout 5 bash -lc '</dev/tcp/<node-private-ip>/8545' \
&& echo tcp_ok || echo tcp_failed
The result was:
tcp_failed
The same host also timed out with a JSON-RPC request:
curl -sS -m 5 \
-H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
http://<node-private-ip>:8545
Now the shape was clear:
| Check | Result |
|---|---|
| Node local RPC | OK |
| Node socket binding | 0.0.0.0:8545 |
| Caller to node TCP | timeout |
| Caller to node JSON-RPC | timeout |
The failure sat between the two hosts.
Read the security groups as a data path
Both instances were in the same VPC and subnet. The caller had outbound access. The node security group allowed P2P and metrics, but not JSON-RPC from the caller.
That is a useful pattern for private blockchain nodes.
P2P may be public, metrics may be allowed from monitoring, and JSON-RPC should
stay narrow.
The right change was not to reuse an open security group or publish 8545 to
the internet.
The right change was one inbound rule:
node security group
TCP 8545
source: application server security group
description: JSON-RPC from application server
Using a source security group keeps the rule tied to the caller role rather than a raw IP address. If the application server is replaced and the same security group is attached to the new network interface, the rule still matches.
Verify the fix from the caller
After adding the rule, the same caller-side checks changed:
tcp_ok
The RPC calls also returned normally:
{"jsonrpc":"2.0","id":1,"result":"0x..."}
{"jsonrpc":"2.0","id":2,"result":false}
The first response is the current block number.
The second response is eth_syncing=false.
That combination says the caller can reach the node and the node is already at
the head.
What this avoids
This kind of timeout can waste time because the failing process is an indexer, and the destination is a blockchain node. It is easy to start reading sync logs, disk I/O, or Erigon snapshot messages. Those checks are useful, but they answer a different question.
The sequence that worked was smaller:
- Confirm the RPC error and destination.
- Confirm local RPC on the node.
- Confirm the socket is not loopback-only.
- Confirm the caller cannot open TCP.
- Compare security group rules to the intended data path.
- Add one narrow inbound rule.
- Re-run the same caller-side TCP and RPC checks.
The operational boundary is simple: private RPC should be reachable from the machines that need it, and unreachable from everywhere else. When local RPC works and remote RPC times out, inspect the network path before touching the node process.