web3-infra

When Local RPC Works but Another EC2 Times Out

A field note on debugging a private Erigon JSON-RPC timeout across EC2 instances by separating client failure, node health, socket binding, and security group rules.

Jun 23, 2026
AWSSecurity GroupsErigonJSON-RPCtroubleshooting

The indexer exited after retrying an Ethereum JSON-RPC call. The error looked like a node problem at first, because the failed call was eth_getLogs over HTTP.

rpc error after retry, exiting
Post "http://<node-private-ip>:8545": dial tcp <node-private-ip>:8545: i/o timeout

The node was already synced. eth_syncing returned false, and Erigon logs kept printing new head validated lines with a low block age. The failure was between two EC2 instances, not inside the Erigon sync path.

Terms used here

TermMeaning
JSON-RPCThe HTTP API clients use to query chain data. Erigon commonly exposes it on TCP 8545.
Security groupAn AWS firewall attached to an instance network interface. It controls allowed inbound and outbound traffic.
Source security groupA security group used as the source of an inbound rule. Traffic is allowed from network interfaces that have that source group attached.
Socket bindingThe address a process listens on, such as 127.0.0.1:8545 for local-only or 0.0.0.0:8545 for all interfaces.
TimeoutThe client sent a request but did not get a response in time. For private EC2 traffic, this often points to filtering or routing rather than an application error.

Start with the exact failure

The indexer log had two useful details:

  • the destination was the node private address on port 8545
  • the error was i/o timeout

A timeout is different from connection refused. connection refused usually means the packet reached the host, but nothing was listening on that port. i/o timeout usually means the packet did not get a usable response. That can be a security group, network ACL, route, host firewall, or a process that is bound only to loopback while the client uses the private IP.

Commands used during the check

These are the commands that shaped the investigation. The identifiers are placeholders.

Find the node and caller instances:

aws ec2 describe-instances \
  --profile <aws-profile> \
  --region <region> \
  --instance-ids <node-instance-id> \
  --query 'Reservations[].Instances[].{InstanceId:InstanceId,Name:Tags[?Key==`Name`]|[0].Value,PrivateIp:PrivateIpAddress,SecurityGroups:SecurityGroups[].GroupId,Subnet:SubnetId,Vpc:VpcId}' \
  --output json

aws ec2 describe-instances \
  --profile <aws-profile> \
  --region <region> \
  --filters 'Name=private-ip-address,Values=<caller-private-ip>' \
  --query 'Reservations[].Instances[].{InstanceId:InstanceId,Name:Tags[?Key==`Name`]|[0].Value,State:State.Name,PrivateIp:PrivateIpAddress,SecurityGroups:SecurityGroups[].GroupId,Subnet:SubnetId,Vpc:VpcId}' \
  --output json

Read the security groups:

aws ec2 describe-security-groups \
  --profile <aws-profile> \
  --region <region> \
  --group-ids <node-sg-id> <caller-sg-id> \
  --query 'SecurityGroups[].{GroupId:GroupId,Name:GroupName,Ingress:IpPermissions,Egress:IpPermissionsEgress}' \
  --output json

Run read-only checks on the node through SSM. The wrapper command starts the remote shell command and returns a command ID:

aws ssm send-command \
  --profile <aws-profile> \
  --region <region> \
  --instance-ids <node-instance-id> \
  --document-name AWS-RunShellScript \
  --comment 'read-only rpc listen check' \
  --parameters commands='<json-array-of-shell-commands>'

aws ssm get-command-invocation \
  --profile <aws-profile> \
  --region <region> \
  --command-id <command-id> \
  --instance-id <node-instance-id>

The shell commands run on the node were:

ss -lntp | egrep ':8545|:8546|:8551' || true

curl -sS -m 3 \
  -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://127.0.0.1:8545

docker ps --format 'table {{.Names}}\t{{.Ports}}'

Run the caller-side TCP and JSON-RPC check through SSM with the same wrapper pattern. The shell commands run on the caller were:

timeout 5 bash -lc '</dev/tcp/<node-private-ip>/8545' \
  && echo tcp_ok || echo tcp_failed

curl -sS -m 5 \
  -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://<node-private-ip>:8545

Add the narrow inbound rule after the checks point to the security group:

aws ec2 authorize-security-group-ingress \
  --profile <aws-profile> \
  --region <region> \
  --group-id <node-sg-id> \
  --ip-permissions 'IpProtocol=tcp,FromPort=8545,ToPort=8545,UserIdGroupPairs=[{GroupId=<caller-sg-id>,Description="JSON-RPC from application server"}]'

Then run the caller-side check again and expect tcp_ok plus a normal eth_blockNumber response.

Check the node before changing networking

The first check was local RPC on the node host.

curl -sS -m 3 \
  -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://127.0.0.1:8545

The node returned a block number. eth_syncing returned false. The logs were also healthy:

head validated ... age=2s
Timings: Forkchoice Commit ... commit=1s

That removed chain sync from the suspect list. The node process was alive, the local RPC API responded, and Erigon was tracking the current head.

Check how the port is bound

Local success does not guarantee that another machine can connect. The process may listen only on loopback.

ss -lntp | egrep ':8545|:8546|:8551'
docker ps --format 'table {{.Names}}\t{{.Ports}}'

The useful signal was:

0.0.0.0:8545
[::]:8545

That meant the container had published the RPC port on all host interfaces. If the output had only shown 127.0.0.1:8545, the fix would have been a node or container bind setting. In this case, binding was not the blocker.

Test from the caller

The caller host still could not open the TCP connection:

timeout 5 bash -lc '</dev/tcp/<node-private-ip>/8545' \
  && echo tcp_ok || echo tcp_failed

The result was:

tcp_failed

The same host also timed out with a JSON-RPC request:

curl -sS -m 5 \
  -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  http://<node-private-ip>:8545

Now the shape was clear:

CheckResult
Node local RPCOK
Node socket binding0.0.0.0:8545
Caller to node TCPtimeout
Caller to node JSON-RPCtimeout

The failure sat between the two hosts.

Read the security groups as a data path

Both instances were in the same VPC and subnet. The caller had outbound access. The node security group allowed P2P and metrics, but not JSON-RPC from the caller.

That is a useful pattern for private blockchain nodes. P2P may be public, metrics may be allowed from monitoring, and JSON-RPC should stay narrow. The right change was not to reuse an open security group or publish 8545 to the internet. The right change was one inbound rule:

node security group
TCP 8545
source: application server security group
description: JSON-RPC from application server

Using a source security group keeps the rule tied to the caller role rather than a raw IP address. If the application server is replaced and the same security group is attached to the new network interface, the rule still matches.

Verify the fix from the caller

After adding the rule, the same caller-side checks changed:

tcp_ok

The RPC calls also returned normally:

{"jsonrpc":"2.0","id":1,"result":"0x..."}
{"jsonrpc":"2.0","id":2,"result":false}

The first response is the current block number. The second response is eth_syncing=false. That combination says the caller can reach the node and the node is already at the head.

What this avoids

This kind of timeout can waste time because the failing process is an indexer, and the destination is a blockchain node. It is easy to start reading sync logs, disk I/O, or Erigon snapshot messages. Those checks are useful, but they answer a different question.

The sequence that worked was smaller:

  1. Confirm the RPC error and destination.
  2. Confirm local RPC on the node.
  3. Confirm the socket is not loopback-only.
  4. Confirm the caller cannot open TCP.
  5. Compare security group rules to the intended data path.
  6. Add one narrow inbound rule.
  7. Re-run the same caller-side TCP and RPC checks.

The operational boundary is simple: private RPC should be reachable from the machines that need it, and unreachable from everywhere else. When local RPC works and remote RPC times out, inspect the network path before touching the node process.