Troubleshooting an IPFS Cluster

4 May 2022

IPFS Cluster

A pretty cool tool: cluster.ipfs.io/. It lets you create clusters of IPFS nodes that are orchestrated togethere to make sure content is available and replicated.

Note that it’s not a decentralized solution, there is no incentive to connect and replicate data on this cluster, it’s designed to be operated in a trusted environment (single organization for example) like Hadoop’s HDFS for example. I would see this like a “big IPFS node”.

Troubleshooting

We had an issue where the IPFS cluster was misconfigured. Nodes couldn’t joint the cluster, and a few github actions where failing because of this.

Internal Link: [[IPFS Distribution is Failing]]

The cluster was ipfs-websites.collab.ipfscluster.io. At the time of the error, I didn’t know the login / password combination to try to connect to the cluster directly. But just having the address is enough to identify a few errors.

There is a page collab.ipfscluster.io that contains a few instructions about this cluster and other Protocol Labs’ cluster.

They share a command to follow the pinset from the cluster:

ipfs-cluster-follow ipfs-websites run --init ipfs-websites.collab.ipfscluster.io

This command printed 0 Peers. A functioning cluster should show a few peers.

Pushing further, I tried to connect to the nodes:

curl https://ipfs.io/ipns/ipfs-websites.collab.ipfscluster.io -o cluster.json

for peer in $(cat cluster.json | jq -r '.cluster.peer_addresses[]'); do
  echo $peer;
  addr=$(echo $peer | sed -E 's;/dns4/([^/]*)/tcp/([^/]*)/.*;\1;')
  ping -q -c 2 ${addr}
  ipfs swarm connect $peer;
done 

ping would succeeds, but the ipfs swarm connect call ends with the following error:

ipfs swarm connect NODE
Error: connect PEERID failure: failed to dial PEERID:
  * [addr] dial addr: connect: connection refused

This is a sign that the cluster is down. It should end with something like:

ipfs swarm connect NODE
Error: connect PEERID failure: failed to dial PEER:
  * [addr] failed to negotiate security protocol: incoming message was too large

This is the error you get when you miss the correct password / auth configuration (which we expect here).

These where the sign that the cluster was misconfigured and had to be fixed.


Laurent Senta

I wrote software for large distributed systems, web applications, and even robots. These days I focus on making developers, creators, and humans more productive through IPDX.co.