All,
An update on the issues highlighted by the stress test, for context additional background is available here.
Summary
- Original Summary here (22/12) here:
- @Jaguar0625’s tweet here (23/12)
- Follow up summary here
The investigation is ongoing but is making progress and various issues have been fixed. I want to share as much of that as practical here. A reminder of what led to this:
- Stress testing on private networks was ongoing since before Testnet launch and showed no issues
- 18-19 Dec: Planned stress testing to target of 100tps occurred and was passed
- 19-21 Dec: Was pushed to 130tps, Public Testnet behaved differently to internal ones tested before
- This exposed 2 issues: MongoDB memory usage & Unconfirmed Transaction Cache management
- Testnet is in an unstable state - not directly from the issues, but they started it (see below)
I will cover each issue separately below, the quick version is that all issues have progressed, some have been resolved and others are ongoing - this week is critical in terms of any impact on launch.
What is written below is entirely transparent and there is nothing held back, the situation progresses every day, across multiple time zones and I will update further as soon as possible
Unconfirmed Transaction Cache (Ongoing but progressing)
A lot of Failure_Core_Past_Deadline errors (Fixed)
This is a hypothesis but is the most likely scenario. When the MongoDB memory issue below happened, it caused the core server on some nodes to a stall due to lack of memory, with various unconfirmed transactions in the cache. These were not handled appropriately and as a result were still “hanging around” when the nodes were restarted, causing the Failure_Core_Past_Deadline
errors some people have seen in the logs.
Because of the number of nodes involved and number of transactions, there were on some nodes 3-4m of these messages when trying to sync on Testnet which coupled with the MongoDB memory issue made it hard to sync and the node crashed repeatedly.
The spurious unconfirmed transactions have now been cleared and it is possible to synchronise a node (see comment below on forks though) you may still see the occasional one in the logs but they are nothing to worry about and should clear over time.
Recovery from Stalled State (Ongoing)
There is an ongoing cache management issue that is still being investigated by the Core Developers and NGL Test team, who have been working on it before Christmas and continued over the weekend and into this week. This is the primary focus for the Core Devs right now. When the server enters the state above the cache needs to manage itself more appropriately is the high level version.
This issue could only be seen due to multiple other issues combining which is why it was found now.
Mongo DB Memory Management Issue (Fixed)
When placed under large, concentrated load, the MongoDB component began to consume a lot of memory, ultimately consuming it all and causing problems.
The Symbol Bootstrap has had a pre-set option available to throttle the memory usage of MongoDB for some time. Previously this was optional and no default was set, in light of the above problems, we are setting the default to cap at 50% of the total memory.
This is a relatively simple fix for node owners - install the new Symbol Bootstrap version.
npm install symbol-bootstrap@alpha
The full build should be released shortly and I will try and remember to update the above, it is possible to use the mem_limit option manually if needed as well (or if not using Bootstrap)
Full explanation: https://nem2.slack.com/archives/C9YKR0EUX/p1605110410149200
Make sure that you set the version to 2.4 (this will need to be set manually on bootstrap for now as well)
Testnet State (Ongoing but Progressing)
Testnet initially had issues due to the stress test overwhelming some nodes and leaving them in a faulty state. Those nodes made up the majority of voting nodes for finality and are the main nodes in the Testnet generally so this caused Testnet to have issues.
Synchronisation Issues (Fixed)
Bringing the nodes back online caused a flood of the Failure_Core_Past_Deadline
messages as the issue with the Unconfirmed Transactions Cache above was investigated. The messages appeared to cause MongoDB memory usage to spike (again) during synchronisation. Those messages have now been cleared and the MongoDB memory management fix is present so synchronisation is now possible.
Fork Issues (Ongoing but likely close to complete)
A fork has occurred and the Testnet is currently sitting on two main forks, one that is ~670 blocks ahead of the other, both with finality at 246568. The furthest ahead fork looks correct (from chain weight and length). We have moved most NGL nodes onto the correct fork and expect to force the remaining ones over in the next 12-24 hours.
This appears to have happened due to known rollback issues which were mentioned in the 0.10.0.4 release announcement (119 and 120). They have been fixed but cannot be applied to Public Testnet without a full reset, so is being tested separately. As a result it will be necessary to resynchronise all nodes on the incorrect fork and force them onto the correct fork. We are testing this process right now and will issue instructions assuming it works correctly, essentially to switch off node discovery for the initial synchronisation and force the use of known correct nodes.
A good way to see which fork your node is on is to use this site: https://symbolnodes.org/nodes_testnet/ and compare to https://api-01.eu-west-1.0.10.0.x.symboldev.network if you are ~600 blocks behind, its the incorrect fork that should have been rolled back.
Next Steps
The above hopefully makes sense, happy to answer any questions for bits that don’t. In terms of how we move forward from here, the immediate priorities are:
- Get NGL Nodes onto correct fork and publish instructions so we can get Testnet working normally
- Continue investigation and resolution for the Unconfirmed Transaction Cache issues
- Patch Testnet and re-run the regression test(s) and stress tests
this takes 2-3 days following the patch
Delay Launch or Not
There have been several questions about whether there will be a delay to launch, and if there is, will the snapshot date move?
The plan and estimates have always been communicated as contingent on successful testing, to date we have managed to recover from multiple challenges with only minor changes to the date. We are clearly now in a scenario that a delay is a very real risk.
-
The remainder of this week is critical to answer that questions and what is shared above is all the information that is known, as more is known it will be shared. The end of this week is a clear cut off in terms of decisions. If possible a decision will be made sooner, but it depends on the ongoing work. Clearly the next question if that occurs is “how long” which we are unable to answer until the investigation work concludes (target is later this week) which is the other component to the conversation.
-
Snapshot date/block height changing with launch or staying the same, this is something that has strong opinions on both sides and will be put to a community PoI vote IF a delay occurs
Any decision to delay is based on the requirement to launch an appropriately tested and robust platform that sets Symbol up for success for years to come. Launching with a known serious issue or forcing a launch when not appropriate will ultimately cause more reputational (and price) damage than a delay. However, if the issues can be resolved reliably and in enough time to allow launch to continue, then a robust and tested platform can be launched.
I will post further updates this week as possible