Self-service tooling for debugging IPFS request handling

What is this

End game

A user can hit a Boxo-based HTTP gateway (e.g., ipfs.io gateway, localhost Kubo gateway) and if there was an error, get a link to download an “IPFS request trace”. They can then use tooling locally or centrally hosted that can analyze the “IPFS request trace” to pinpoint the issue.

When using a hosted solution like check.ipfs.network, they can get a link to the analyzed result that can be shared with others to help with troubleshooting or educating.

More specifically this means:

  1. Specifying the the HAR (HTTP Archive) equivalent for IPFS so one can share traces of fully resolving an "IPFS request".
  2. Tooling squarely focused on “why can’t I find my CID” or “why is is it slow”.
    1. It should be able to process “IPFS Request Trace” file.
    2. It should be able to query an outside 3rd party (like check.ipfs.network) that can answer what it sees from its perspective to help corroborate where the issue may be.
      1. A supporting tool could be a site that iframes in a bunch of gateways to show what they observe when fetching a CID.
    3. It should break down the steps of “are their provider records in DHT in IPNI”, are there peer records, are the peers dialable, what transports do they support, etc.
  3. Hosted solution at something like check.ipfs.network that can generate analyses and do real-time network probes.
  4. Implementation changes in Kubo and Helia to produce “IPFS Request Traces”.
    1. Should also have an option to bypass any local caches to help with reproducibly.

Short term (low hanging fruit)

To start getting some wins in this area, there are some easier items we can tackle first.

Update check.ipfs.network with:

  1. IPNI results
  2. Browser-node results (for giving a view from the browser as an extra ipfs node datapoint)
  3. Surfacing common connectivity issues
  4. Documentation/workflow review/overhaul, taking useful learnings/pardigms from https://pl-diagnose.on.fleek.co/
  5. Actively drive people to these tools on ipfs.io gateway error pages (discussed more in )

Why this is a good idea

  1. Gives users easier agency to determine how to unblock themselves. “I should get this gateway provider to improve or not to use this gateway”, “I should talk with the content provider”, etc.
  2. Reduces maintainer and community support time as there should be less incoming requests, and when there are, they include the relevant state/context to further debug.
    1. We don’t have a measurement around how much time this will save, but “gut feel” is that this will alleviate more than we reason.

Notes from Past Discussions

  • There are some existing tools, but they are missing updates around
    • cid.contact (IPNI routing)
    • connectivity of transports - What platforms/environments can my node talk to?
    • They aren’t producing an “IPFS Request Trace” that can be analyzed.
  • We need to actually error (time out) and give a trace of what was happening (HAR file equivalent)
  • Great opportunity to say something like “hey, it looks like you support tcp or quic… those might work but people in browsers won’t be able to fetch that data so please add support”
  • There is some dependence on so a browser can ask other nodes about content availability

Instances where this would have been helpful

  • Meta: We need to find a master tracking issue for this that we can start putting user stories against.
  • On 2023-06-15 had issues trying to diagnose issues with cid.contact content (https://filecoinproject.slack.com/archives/C03RQFURZLM/p1686843076251129 ). Tooling should do cid.contact queries. It should expose cache headers.
  • 2023-06-20: Masih has sunk a massive amount of time into keeping Infura’s data discoverable through IPNI for Project Rhea. Stems from Infura’s unwillingness to prioritize updates to how they run IPFS nodes.
    • Juan: i think this can only be truly resolved by making sure Infura knows what the problem is -- gateways reporting content routing errors to the users & devs, so they know what they have to do to make things discoverable
  • 2023-06-21: really need a way to trace to understand where a response came from so we can analyze and understand the network. This would have been relevant here: https://github.com/protocol/network-measurements/issues/49#issuecomment-1600514045

Related Items