Symbol 0.10.04 Testnet Update (29/12/2020)

DaveH · December 29, 2020, 11:16pm

All,

An update on the issues highlighted by the stress test, for context additional background is available here.

Summary

The investigation is ongoing but is making progress and various issues have been fixed. I want to share as much of that as practical here. A reminder of what led to this:

Stress testing on private networks was ongoing since before Testnet launch and showed no issues
18-19 Dec: Planned stress testing to target of 100tps occurred and was passed
19-21 Dec: Was pushed to 130tps, Public Testnet behaved differently to internal ones tested before
This exposed 2 issues: MongoDB memory usage & Unconfirmed Transaction Cache management
Testnet is in an unstable state - not directly from the issues, but they started it (see below)

I will cover each issue separately below, the quick version is that all issues have progressed, some have been resolved and others are ongoing - this week is critical in terms of any impact on launch.

What is written below is entirely transparent and there is nothing held back, the situation progresses every day, across multiple time zones and I will update further as soon as possible

Unconfirmed Transaction Cache (Ongoing but progressing)

A lot of Failure_Core_Past_Deadline errors (Fixed)
This is a hypothesis but is the most likely scenario. When the MongoDB memory issue below happened, it caused the core server on some nodes to a stall due to lack of memory, with various unconfirmed transactions in the cache. These were not handled appropriately and as a result were still “hanging around” when the nodes were restarted, causing the Failure_Core_Past_Deadline errors some people have seen in the logs.

Because of the number of nodes involved and number of transactions, there were on some nodes 3-4m of these messages when trying to sync on Testnet which coupled with the MongoDB memory issue made it hard to sync and the node crashed repeatedly.

The spurious unconfirmed transactions have now been cleared and it is possible to synchronise a node (see comment below on forks though) you may still see the occasional one in the logs but they are nothing to worry about and should clear over time.

Recovery from Stalled State (Ongoing)
There is an ongoing cache management issue that is still being investigated by the Core Developers and NGL Test team, who have been working on it before Christmas and continued over the weekend and into this week. This is the primary focus for the Core Devs right now. When the server enters the state above the cache needs to manage itself more appropriately is the high level version.
This issue could only be seen due to multiple other issues combining which is why it was found now.

Mongo DB Memory Management Issue (Fixed)

When placed under large, concentrated load, the MongoDB component began to consume a lot of memory, ultimately consuming it all and causing problems.

The Symbol Bootstrap has had a pre-set option available to throttle the memory usage of MongoDB for some time. Previously this was optional and no default was set, in light of the above problems, we are setting the default to cap at 50% of the total memory.

This is a relatively simple fix for node owners - install the new Symbol Bootstrap version.

npm install symbol-bootstrap@alpha

The full build should be released shortly and I will try and remember to update the above, it is possible to use the mem_limit option manually if needed as well (or if not using Bootstrap)
Full explanation: https://nem2.slack.com/archives/C9YKR0EUX/p1605110410149200

Make sure that you set the version to 2.4 (this will need to be set manually on bootstrap for now as well)

Testnet State (Ongoing but Progressing)

Testnet initially had issues due to the stress test overwhelming some nodes and leaving them in a faulty state. Those nodes made up the majority of voting nodes for finality and are the main nodes in the Testnet generally so this caused Testnet to have issues.

Synchronisation Issues (Fixed)
Bringing the nodes back online caused a flood of the Failure_Core_Past_Deadline messages as the issue with the Unconfirmed Transactions Cache above was investigated. The messages appeared to cause MongoDB memory usage to spike (again) during synchronisation. Those messages have now been cleared and the MongoDB memory management fix is present so synchronisation is now possible.

Fork Issues (Ongoing but likely close to complete)
A fork has occurred and the Testnet is currently sitting on two main forks, one that is ~670 blocks ahead of the other, both with finality at 246568. The furthest ahead fork looks correct (from chain weight and length). We have moved most NGL nodes onto the correct fork and expect to force the remaining ones over in the next 12-24 hours.

This appears to have happened due to known rollback issues which were mentioned in the 0.10.0.4 release announcement (119 and 120). They have been fixed but cannot be applied to Public Testnet without a full reset, so is being tested separately. As a result it will be necessary to resynchronise all nodes on the incorrect fork and force them onto the correct fork. We are testing this process right now and will issue instructions assuming it works correctly, essentially to switch off node discovery for the initial synchronisation and force the use of known correct nodes.

A good way to see which fork your node is on is to use this site: https://symbolnodes.org/nodes_testnet/ and compare to https://api-01.eu-west-1.0.10.0.x.symboldev.network if you are ~600 blocks behind, its the incorrect fork that should have been rolled back.

Next Steps

The above hopefully makes sense, happy to answer any questions for bits that don’t. In terms of how we move forward from here, the immediate priorities are:

Get NGL Nodes onto correct fork and publish instructions so we can get Testnet working normally
Continue investigation and resolution for the Unconfirmed Transaction Cache issues
Patch Testnet and re-run the regression test(s) and stress tests
this takes 2-3 days following the patch

Delay Launch or Not

There have been several questions about whether there will be a delay to launch, and if there is, will the snapshot date move?

The plan and estimates have always been communicated as contingent on successful testing, to date we have managed to recover from multiple challenges with only minor changes to the date. We are clearly now in a scenario that a delay is a very real risk.

The remainder of this week is critical to answer that questions and what is shared above is all the information that is known, as more is known it will be shared. The end of this week is a clear cut off in terms of decisions. If possible a decision will be made sooner, but it depends on the ongoing work. Clearly the next question if that occurs is “how long” which we are unable to answer until the investigation work concludes (target is later this week) which is the other component to the conversation.
Snapshot date/block height changing with launch or staying the same, this is something that has strong opinions on both sides and will be put to a community PoI vote IF a delay occurs

Any decision to delay is based on the requirement to launch an appropriately tested and robust platform that sets Symbol up for success for years to come. Launching with a known serious issue or forcing a launch when not appropriate will ultimately cause more reputational (and price) damage than a delay. However, if the issues can be resolved reliably and in enough time to allow launch to continue, then a robust and tested platform can be launched.

I will post further updates this week as possible

Stinghe_Dorian · December 29, 2020, 7:06pm

Thank you!

tolya · December 29, 2020, 8:30pm

Thanks for the update! hope the core DEV team can resolve all issues in time needed.

However i have questions. Reading 500 Nodes + Performance Test I didn’t understood what are the strong opinions on the case where the snapshot would be delayed with the launch of the network?
Could you please explain what are the concers regarding to NOT delay the snapshot with the launch. I read that topic and I only saw through peoples who want to have the same snapshot date as its already announced, too much exchanges and investors are already involved in this event and its a metter of NEM/XYM reputation. This really is a common thing for crypto industry to have delays between snapshot and the mainnnet launch/distibution itself, in some cases those delay between those events are several months, or for an example could be even a year (XRP/SPARK airdrop).

What is on the other side?

Also as you stated that till the end of the week there should be clearness in this case, as i understand we have events already starting i guess this week. JAN 1, NEM and Symbol annoouncments, Nem trading competision JAN 4 that will be Monday next week. Whatever the decision will be and whatever the Pol vote will occur(what will probablyy take also some time). How it will affect those events?

Wittmann · December 29, 2020, 8:35pm

Absolutely agree. I see no point in transferring a snapshot. But let the vote decide everything.

tolya · December 29, 2020, 9:51pm

I have only one concern with the vote and it’s that I didn’t see people arguing to delay the snapshot. I saw plenty of people arguing to delay the mainnet launch as the network have issueses what is absolutely understandably.

What i mean is if there is no reason to delay the snapshot but the vote for some reasons will held then the outcome of the vote will probably not the one most people think it will be (could be wrong ofc)… Thats why it would be good to see people here that want a delay of the snapshots and some lights of reasons why its better before a vote will held. Only reasoning of delaying the snapshot that are coming to my head is some bad assumptions that i didn’t even want to mention as we have a constructive conversation.

This said i also wanna be transparant and say that I am only with NEM for a bit more then a year and i joined the forum for this conversation about the XYM launch, I never before had a voiting on the NEM NIS1 platform with the NEM vote model. The thing is that usually those votes in the end aren’t really stransmit the real situation or the voice of the most… at least that is what i saw in same situation on different places around crypto industry.

shizuilab · December 30, 2020, 1:53am

Thank you very much for sharing important information and a lot of efforts made by development team. For now, most of my questions were answered. I understand the current version is close to the final release version, but, I can not be too optimistic with it, because problems you stated here occurred in a very early stage of testing. Other features, such as aggregate transactions and finalization, have not been fully tested yet, right?

My opinion is that the community vote should not be done to determine whether the release date will postponed, but, to decide what kind of criteria should be cleared before release.

leoinker · December 30, 2020, 3:16am

No it’s not early stages of testing. More like final stages.
Aggregate transactions and finalization have been tested as well.
The idea of community vote was not to change release date, but snapshot date. - But we do not know if that’s needed yet. We should know more by end of week.

tolya · December 30, 2020, 3:42am

Yeah i got thet the idea of community vote is to change the snapshot date (if needed) of course the release of the network would be delayed without a vote if the is a need as thats the priority and the concern of the stability of the network.

Just a simple question is why should be the snapshot be delayed? Even if the mainnet wil be delayed. So far I am very active in Russion community and in english as well, i didnt see a single person who is wanting to delay the date that was already announced by the team, exchanges and what have a cauntdown on the main NEM website for weeks. Only people i saw is who wants the snpshot stay at the same day. Also people i communicate recently already after this post was made are saying if there will be a “community vote” that only means that snapshot will be delayed… So that is why i am curious about the reasons. David stated in this post there is strong opinions on both sides, but i didn’t saw anyone on the side of “I want to delay the snapshot with the mainnet launch”.

Good luck to resolve all issues!

tresto · December 30, 2020, 3:53am

All I can say is these.

Thank you and Good luck.

dusanjp · December 30, 2020, 4:05am

This alpha won’t work as it is
Isn’t this confusing?

shizuilab · December 30, 2020, 5:03am

Thank you very much for your reply.

No it’s not early stages of testing. More like final stages.
Aggregate transactions and finalization have been tested as well.

Now, I understood many tests have been already done in the past. I feel much relieved to know that. I will watch the finalized height will proceed after the issues will be fixed (maybe next week).

The idea of community vote was not to change release date, but snapshot date. - But we do not know if that’s needed yet. We should know more by end of week.

Sorry, I thought that the release (mainnet launch) date is somehow connected to the snapshot date. Anyway, community members just like me can not choose the right way if you will not provide sufficient information to do that.

meyns · December 30, 2020, 7:58am

The question if the mainnet launch has to be delayed can only be answered by the devs. There will be no vote for that.
If however the mainner has to be delayed, there would be a vote to decide if we keep the snapshot date or move the snapshot date together with the launch date.

I personally would like to have snapshot as close to launch as possible. I dont see a reason why snapshot cant be delayed. Alao there is a host a announcements planned before launch, I would find it wierd to have snapshot first and then later lots of announcements and then launch

Market_Chanakya · December 30, 2020, 8:45am

Lots of exchange already announced snapshot. Team is working hard. I wish there will be no delay . Delay will be more hectic and create panic

simakki · December 30, 2020, 10:29am

Panic already achieved. Catastrophic event for investors.

suky321 · December 30, 2020, 1:52pm

Snapshot date should stay as it is, and mainnet delayed if necessary. Would respect the majority decision if a vote is required.

UBFPT · December 30, 2020, 2:05pm

Correct me if I’m wrong,
For some futur developments, some xem were transfered to finance the project?
The volume is very low lately, holders are waiting, and the price keeps slowly going down. If they announce a postpone of the mainnet and the snapshot, not only the price will drill the ground but the perspective for the future would be catastrophic. And off course the xem to be used for NIS1 are gonna be worth 2, 3, maybe 4 time less.

tolya · December 30, 2020, 2:42pm

Not sure but xem chart was excellent higher lows only up, and then since the first msg on forum about snapshot delay come out it’s only going down. And the sad part nothing was even announced to the wider community back then, There were only those msg on forum so who first saw it could jump from the ship with less damage…

Market_Chanakya · December 30, 2020, 3:00pm

Team also not confirm now. They are in confusion. Let’s c but now xem is going down only let’s c dave is great hope we get clear picture this week

Wittmann · December 30, 2020, 3:22pm

Transferring a snapshot will entail serious consequences. A strong collapse in prices, loss of reputation of investors and exchanges, and in the worst case, removal from exchanges.

Market_Chanakya · December 30, 2020, 3:27pm

True let’s c I know dave is a hard working guy. This dump becz of Fud. A clear info abt events Is important