Testing Update (11-Jan-2021)
Summary
- New Testnet has been created, 500 nodes, currently internal only
- Patch testing has gone well
- Stress Testing passed: 100 & 150tps
- Stress Testing failed: 400tps, due to configuration, being re-run
- Testnet reset and full release expected shortly (after 400tps pass)
- Memory usage is much improved, even under very heavy load
This update follows on from: Symbol launch issues & Testnet update (06-Jan-2021)
As per @Jaguar0625’s tweet on 08-Jan: https://twitter.com/Jaguar0625/status/1347611656021532675
Patches have been completed and handed to the test team for validation. So far, these look good. The Core Developers and Test teams have also been working very closely on various configuration items to improve rollback handling.
The test team have created a new 500 node network, with learnings from the previous tests, new patches and some minor configuration changes. The tests below were run over the weekend and through Monday:
Normal Running Tests - Passed up to 150tps
The following tests have been run over the past few days:
-
Automation/regression testing on an internal dev environment
-
Stress test on an internal dev environment at 100tps, increased to 150tps
-
Stress test on the new 500 node testnet at 150tps for ~12 hours
Summary of the 150tps stress test:
- 150tps test finished with 10mil txs over ~12 hours
- MongoDB stayed at around 2Gig
- The core servers are still in sync and had no memory issues.
- The pass means normal functioning, no overload occurred.
Prior to these fixes, it overloaded at 130tps on the public Testnet and didn’t recover well so the patches are working well for functional improvements.
This is in a controlled environment with known node sizes/performances, it will be re-run once the new Testnet is made public to ensure behaviour is the same with community nodes present.
Overload Test - 400tps, failed due to config
A final test was run at 400tps which was passing for ~8-10 hours. It was capping the transactions at 150-200tps so the patches are working. However, toward the end of the test (final ~2 hours) the run encountered issues which are believed to be configuration related; rollbacks and data/packet sizes meant some nodes fell behind and couldn’t recover. As a result the test failed but is being re-run with configuration amendments and is expected to pass later today or tomorrow morning (UTC). The good news is that until the issues toward the end of the test, the memory usage was much improved with the patches and was constant.
The Testnet has also been brought back into sync which has forced usage of the Deep Rollback patches and configurations have been adjusted in collaboration with the core devs for the next test.
Immediate Next Steps
The current plan is to rerun the 400tps test which will complete, at the earliest, late on Tues (UTC time). If it does not pass then an additional cycle may be required, if it does pass then a decision and plan can be taken to release publicly and start community testing.
A further patch is being produced today which is a more aggressive node banning approach in certain scenarios and will also be included in the next test and next release, this will most likely be included in the test above.
Further memory profiling work on the Core Server is being undertaken by the Core Devs and the NGL Test/Dev team to see if additional optimisations are possible, these will be assessed for inclusion if any are found.
A further update will be provided as soon as there is more information, we are now getting the end of the resolution work and the outcome looks to be positive in terms of resilience and memory usage, fingers crossed for the final testing.