All,
It has been a week or so since the last full update, so I am collecting things together here to show what is happening.
The short version - a lot, and it is ongoing but showing signs of concluding soon, Testnet is going to have to be reset, that should have no additional impact on date beyond what is already expected.
Testnet Current State (0.10.0.4 release):
- Forks have been resolved and the network has come back together, it is producing blocks, projects and exchanges are developing against it actively
- Testnet Finality is stopped and will not be able to restart, explained further below
- Apart from Finality, testnet is functioning normally and usable, it remains behind the mainnet code, with the deep roll back patch and now multiple others from the issue resolution. Patches are discussed in more detail below
- We are aware some exchanges and community projects are using the current Testnet for development, integration, listing work etc. It will remain up and NGL nodes will keep running this version, community nodes are appreciated if they can be left running, but if you are running multiple it is probably fine to scale back to 1 or 2 and save costs.
Issues Progress
- MongoDB memory usage was part of the problem as it grew too large, coupled with core-server usage and consumed all resources. An alpha version of bootstrap has been updated with MongoDB memory limiting which helps nodes sync while the issues are being resolved, this will be reviewed to see if is it still necessary after the fix below
- An additional MongoDB setting to manage the cache more aggressively has been tested successfully with little/no impact on performance and is being added to Bootstrap for the next public release
- Core-server memory usage also grew alongside the above issues, RocksDB memory usage is being profiled and reviewed to see if it is possible to achieve similar to the above on core-server, issues below will also help with this.
- ‘Unconfirmed transaction’ issue investigation; recent testing made good progress, more work ongoing today+tomorrow, debug progressing and we have managed to reproduce on an internal environment which is a huge step in the right direction
- A node stalling/hanging issue has been found as part of root cause analysis for the unconfirmed transactions, it appears to be unrelated to load but happened to occur in the load test (see below), this has been reproduced and is being debugged this week
These issues combined and coupled with the Deep Rollback patch not being present, compounded each other to cause the bigger problem. Work is ongoing on the Node Hanging, RocksDB and Unconfirmed Transaction issues with strong lines of investigation on all, a bit more time is needed to confirm a concrete plan for these but they are progressing well.
What has been happening since last update (30/06)
- The core developers and test team have been working exceptionally hard unpicking the findings from testnet, reproducing what was found and applying patches and/or tuning based on the learnings, then repeating the process.
- It is slow and painstaking work because they need to try multiple scenarios on different node types under different loads to consider all angles.
We now have fixes in progress or complete for all but two areas, those two areas are:
- An issue that can cause a node to hang ( issue has been reproduced, root cause analysis is ongoing)
- RocksDB memory optimisations to reduce the memory consumed, work is ongoing
In addition, an entirely new network with 500 nodes has been stood up internally to allow the tests to be rerun on a similar environment. That network has:
- All the latest patches on it, and has helped with getting to a high degree of confidence on what happened on Testnet.
- We have run a 130tps test on it which was passing for 6 hours before we encountered the node hanging/stalling issue again which had previously been proving very hard to reproduce, we now can reliably.
There is good reason to believe that when that issue is fixed, that the test would have continued passing and have degraded gracefully as it maxed out. That issue is the primary focus right now and is unrelated to load (it just happens to be easier to see with more load, but could also occur at 1tps from what we have seen)
Testnet – What happened (result of forensics)
Based on having been able to reproduce all the issues and debug them, the below is what is believed with high confidence to have happened on Testnet. It is a layering of a few issues that compound each other:
-
A small number of nodes got stuck on unresolvable forks, they did so with a large number of unconfirmed transactions in them. The nodes that did not hang, continued without them and confirmed the transactions. The hung nodes then unhung or restarted and they kept trying to sync and broadcast large numbers of unconfirmed transactions to good nodes that had already confirmed them, causing a DDOS-like condition and memory spike. It also led to various forks being created which then grew beyond 600 blocks.
-
A known issue was present with deep rollbacks (>600 blocks) on Testnet, which was fixed several weeks ago for mainnet release, but could not be applied to Testnet due to being a breaking change.
-
This caused the inability to get back from the forks when they happened.
-
This work and effectively DDOS-like behaviour, caused memory to spike (on MongoDB and Core-Server we think) which caused more nodes to fall over as memory reached upper limits
-
Unfortunately, the way the voting nodes were scaled meant that this impacted a supermajority of the nodes and as a result they failed to keep voting. It has been impossible to recover them to a state where they will either vote again, or approve the voting node members to change – essentially they don’t exist anymore, but are still in the voting pool. Which is why finality has stalled and cannot now ever restart on that Testnet. We have learned some very useful lessons from this (and tested them on the devnet) and have a workaround/fix for the next deployment
What needs to be fixed/has been fixed?
-
A patch has been created and is in final testing which will cap the maximum queue sizes of the unconfirmed transactions (and other caches) to reject transactions in favour of letting the queue build up for later processing, giving protection from the large queues and capping tps effectively
-
A patch is being created that will more aggressively defend from attempts to sync older transactions such as the unconfirmed ones in this instance – they will be rejected
-
A setting has been applied to MongoDB that more aggressively manages the size of the wiredTigerCache, this has been tested and found to constrain MongoDB RAM usage far more effectively with little/no impact on performance
-
The deep rollback patch has been incorporated into a new build so that if forks occur, they can be recovered from in the scenario that occurred
-
We have some RocksDB work to do to ensure the memory management is aggressive enough to not respond with a spike in this type of scenario
-
The Node Hanging issue which appeared to be the spark for the bigger fire, has been reproduced and is being resolved
-
There are changes to incorporate in the bootstrap tools which will apply a better set of defaults into the standard deployment to incorporate all of the above and a couple of easier debug modes
It is expected this resolution work will take this week and into early next week, further updates will be provided as things progress.
Next Steps
Testnet needs to be reset, this will occur as soon as practical is currently likely during the week commencing 11th Jan, this is subject to change as testing continues though.
This reset is not expected to change launch date anymore than the new release would have done anyway and will need to soak for minimum of 1 month from release, assuming testing below passes
- The new testnet will have 500 nodes, all of the latest patches, including deep rollback, and be fully reset.
- It will then have stress tests conducted on them by NGL before it is released (we will reset our large devnet for this)
- We will ask NEMTus to perform their testing event and triage any issues raised and will coordinate more with Gotoh-san on this as we have more information
- A confirmed date will be communicated once the testing concludes and results can be analysed, at this stage it looks like for mid-late Feb but we cannot be more specific and it depends on the results of testing
The target for all of the above is to complete by the end of the week commencing 11th of Jan, but there is still work to do, at that time we expect to have a more concrete position/plan on Snapshot and Launch dates.
It will be a few days before the next update as the next phases will take some time due to the size of the testing work involved, I will update as soon as there is more to update with