Update 05-Feb-2021 approx 17:30 UTC
Huge thanks to Wayon, Jag, Gimre and the rest of the team for taking the time to explain the below alongisde the investigation work, I’ve tried to summarise the current state as I understand it from what is happening:
The team are still looking at the issue, the summary of currently known information is below, it is ongoing and subject to change as more is known:
-
The issue affects api-broker, exactly why/how is being confirmed
-
The chain is still operating and Finality is still working, it can be checked on a known working API node: http://18.144.6.168:3000/chain/info
-
On affected nodes, the api-broker is down, so rest gateway is not aware of the current state of the chain (MongoDB isn’t updated when it is down), so REST reports what it knows about up until it went down rather than the actual current state of the chain on the peer node
-
The node list site (https://symbolnodes.org/nodes_testnet) relies on REST calls for chain height so may have issues reporting the actual height while the broker-node(s) are down
-
The auto-recovery issue that is present in Bootstrap (#108) means just restarting isn’t quite enough, that issue was already known and we knew needed to be addressed and has obviously now risen in priority
-
Resetting and resynchronising the node does appear to resolve the issue and bring it back online, this is the only concrete approach we know definitely fixes it, but we are still looking for other ones
The process appears conceptually to have been something like:
-
Api-broker had issues on some nodes (root still being identified 100%),
-
Api-broker failed due to the above and stopped
-
Bootstrap Auto-recovery doesn’t allow it to restart and api-node ends in a state that cannot be easily recovered
-
Peer node is still functioning normally.
-
Issue only affects Dual or API nodes, it just happens that most nodes are dual nodes and most NGL nodes are dual and voting to simulate Mainnet in terms of SuperNodes
The work is going to continue today and over the weekend, we are likely to start resetting the NGL nodes in small batches soon and that will obviously take a day or two due to the number of nodes involved and not wishing to disrupt the chain or finality.
Edit: Just noticed a tweet from Jag so linking here as well: https://twitter.com/Jaguar0625/status/1357725263245762560