Symbol launch issues & Testnet update (06-Jan-2021)

All,

It has been a week or so since the last full update, so I am collecting things together here to show what is happening.

The short version - a lot, and it is ongoing but showing signs of concluding soon, Testnet is going to have to be reset, that should have no additional impact on date beyond what is already expected.

Testnet Current State (0.10.0.4 release):

  • Forks have been resolved and the network has come back together, it is producing blocks, projects and exchanges are developing against it actively
  • Testnet Finality is stopped and will not be able to restart, explained further below
  • Apart from Finality, testnet is functioning normally and usable, it remains behind the mainnet code, with the deep roll back patch and now multiple others from the issue resolution. Patches are discussed in more detail below
  • We are aware some exchanges and community projects are using the current Testnet for development, integration, listing work etc. It will remain up and NGL nodes will keep running this version, community nodes are appreciated if they can be left running, but if you are running multiple it is probably fine to scale back to 1 or 2 and save costs.

Issues Progress

  • MongoDB memory usage was part of the problem as it grew too large, coupled with core-server usage and consumed all resources. An alpha version of bootstrap has been updated with MongoDB memory limiting which helps nodes sync while the issues are being resolved, this will be reviewed to see if is it still necessary after the fix below
  • An additional MongoDB setting to manage the cache more aggressively has been tested successfully with little/no impact on performance and is being added to Bootstrap for the next public release
  • Core-server memory usage also grew alongside the above issues, RocksDB memory usage is being profiled and reviewed to see if it is possible to achieve similar to the above on core-server, issues below will also help with this.
  • ‘Unconfirmed transaction’ issue investigation; recent testing made good progress, more work ongoing today+tomorrow, debug progressing and we have managed to reproduce on an internal environment which is a huge step in the right direction
  • A node stalling/hanging issue has been found as part of root cause analysis for the unconfirmed transactions, it appears to be unrelated to load but happened to occur in the load test (see below), this has been reproduced and is being debugged this week

These issues combined and coupled with the Deep Rollback patch not being present, compounded each other to cause the bigger problem. Work is ongoing on the Node Hanging, RocksDB and Unconfirmed Transaction issues with strong lines of investigation on all, a bit more time is needed to confirm a concrete plan for these but they are progressing well.

What has been happening since last update (30/06)

  • The core developers and test team have been working exceptionally hard unpicking the findings from testnet, reproducing what was found and applying patches and/or tuning based on the learnings, then repeating the process.
  • It is slow and painstaking work because they need to try multiple scenarios on different node types under different loads to consider all angles.

We now have fixes in progress or complete for all but two areas, those two areas are:

  • An issue that can cause a node to hang ( issue has been reproduced, root cause analysis is ongoing)
  • RocksDB memory optimisations to reduce the memory consumed, work is ongoing

In addition, an entirely new network with 500 nodes has been stood up internally to allow the tests to be rerun on a similar environment. That network has:

  • All the latest patches on it, and has helped with getting to a high degree of confidence on what happened on Testnet.
  • We have run a 130tps test on it which was passing for 6 hours before we encountered the node hanging/stalling issue again which had previously been proving very hard to reproduce, we now can reliably.

There is good reason to believe that when that issue is fixed, that the test would have continued passing and have degraded gracefully as it maxed out. That issue is the primary focus right now and is unrelated to load (it just happens to be easier to see with more load, but could also occur at 1tps from what we have seen)

Testnet – What happened (result of forensics)

Based on having been able to reproduce all the issues and debug them, the below is what is believed with high confidence to have happened on Testnet. It is a layering of a few issues that compound each other:

  • A small number of nodes got stuck on unresolvable forks, they did so with a large number of unconfirmed transactions in them. The nodes that did not hang, continued without them and confirmed the transactions. The hung nodes then unhung or restarted and they kept trying to sync and broadcast large numbers of unconfirmed transactions to good nodes that had already confirmed them, causing a DDOS-like condition and memory spike. It also led to various forks being created which then grew beyond 600 blocks.

  • A known issue was present with deep rollbacks (>600 blocks) on Testnet, which was fixed several weeks ago for mainnet release, but could not be applied to Testnet due to being a breaking change.

  • This caused the inability to get back from the forks when they happened.

  • This work and effectively DDOS-like behaviour, caused memory to spike (on MongoDB and Core-Server we think) which caused more nodes to fall over as memory reached upper limits

  • Unfortunately, the way the voting nodes were scaled meant that this impacted a supermajority of the nodes and as a result they failed to keep voting. It has been impossible to recover them to a state where they will either vote again, or approve the voting node members to change – essentially they don’t exist anymore, but are still in the voting pool. Which is why finality has stalled and cannot now ever restart on that Testnet. We have learned some very useful lessons from this (and tested them on the devnet) and have a workaround/fix for the next deployment

What needs to be fixed/has been fixed?

  • A patch has been created and is in final testing which will cap the maximum queue sizes of the unconfirmed transactions (and other caches) to reject transactions in favour of letting the queue build up for later processing, giving protection from the large queues and capping tps effectively

  • A patch is being created that will more aggressively defend from attempts to sync older transactions such as the unconfirmed ones in this instance – they will be rejected

  • A setting has been applied to MongoDB that more aggressively manages the size of the wiredTigerCache, this has been tested and found to constrain MongoDB RAM usage far more effectively with little/no impact on performance

  • The deep rollback patch has been incorporated into a new build so that if forks occur, they can be recovered from in the scenario that occurred

  • We have some RocksDB work to do to ensure the memory management is aggressive enough to not respond with a spike in this type of scenario

  • The Node Hanging issue which appeared to be the spark for the bigger fire, has been reproduced and is being resolved

  • There are changes to incorporate in the bootstrap tools which will apply a better set of defaults into the standard deployment to incorporate all of the above and a couple of easier debug modes

It is expected this resolution work will take this week and into early next week, further updates will be provided as things progress.

Next Steps

Testnet needs to be reset, this will occur as soon as practical is currently likely during the week commencing 11th Jan, this is subject to change as testing continues though.

This reset is not expected to change launch date anymore than the new release would have done anyway and will need to soak for minimum of 1 month from release, assuming testing below passes

  • The new testnet will have 500 nodes, all of the latest patches, including deep rollback, and be fully reset.
  • It will then have stress tests conducted on them by NGL before it is released (we will reset our large devnet for this)
  • We will ask NEMTus to perform their testing event and triage any issues raised and will coordinate more with Gotoh-san on this as we have more information
  • A confirmed date will be communicated once the testing concludes and results can be analysed, at this stage it looks like for mid-late Feb but we cannot be more specific and it depends on the results of testing

The target for all of the above is to complete by the end of the week commencing 11th of Jan, but there is still work to do, at that time we expect to have a more concrete position/plan on Snapshot and Launch dates.

It will be a few days before the next update as the next phases will take some time due to the size of the testing work involved, I will update as soon as there is more to update with

20 Likes

What tests are planned for the future?
If there are any operations that should be tested intensively, we will cooperate.

I think there are things that even individual users can do.

4 Likes

Thankyou @GodTanu I will speak to the team and make sure I give you the correct information.

We have an automated test suite which is run and documented in code, I just need to check the exact manual tests to give you the right information :+1:

I think for individual users - definitely everything in the wallet, is very useful, especially the more advanced features (aggregates, multi-sig, restrictions) but I will get a better list shortly

Sweet! Awesome breakdown. Much appreciated as always

1 Like

@GodTanu, I hope this translates ok, let me know if not:

The tests from right now will be:

  1. Rebuild the internal test net with 500 nodes and reset.
  2. Run stress at 130tps for 12 hours
  3. If that passes the increase a lot (300-1000tps) until it overloads, stop, check that it recovers
  4. If that passes, release to the community

:point_up: is happening for the next 2-4 days

After that, we will wait for community nodes to join, some community testing, NEMtus testing and then run stress/performance test on the public network and make sure the results are still ok.

The standard testing covers:

  1. The “normal” chain transactions (transfer, aggregate, multi-sig, proofs etc)
  2. Finality and voting. More nodes the better.
  3. Stress and perf testing
  4. Harvesting different config(remote, local, delegated)
  5. Monitoring(good to have some community verify the setup)
  6. Restrictions
  7. Rollbacks of nodes
  8. CLI

There are also some manual tests which are not documented in a way that is easy to communicate just now, they focus on specific patches

There is a repo for this testing as well which documents the tests in the code::

Any individual testing that the community can do is very very useful and we encourage everyone to test as much as possible after the testnet is reset, it will make Symbol better for everyone

6 Likes

results of testing mean those one that are you doing this week and beginnig of next week right? Or are you talking about other testing that will happen later? @DaveH

Correct, the testing that is planned for next week (including the NemTus and our subsequent stress test), it won’t be complete at the beginning, more likely the end. See message above your reply for details of what it includes

3 Likes