500 Nodes + Performance Test

h-gocchi · December 23, 2020, 11:47pm

Yes, I caught Jaguar’s tweets in real time and sent him a mention.
The core developers are working hard.

DaveH · December 24, 2020, 12:19am

Is there a disconnect between NGLs and the perception of core developers?

No, we are all working together as a single team to work through the issues

Why did most of the staff perform stress tests when they were on vacation in the first place?

Several stress tests have been performed in development environments over the past few months, they did not show issues, the first tests on Testnet also showed no issue up to 100tps. It was the last test(s) that was meant to go above 100tps that found the issue(s).

They may have been able to be performed earlier (as per Jag’s tweet) but could have been on the final code version or and the Testnet needed to be recovered from the upgrade issue, so they still would have had to be retested with the bigger Testnet chain because it is the place that has a Mainnet like network and chain.

Often the time estimates are really sweet

Sorry I don’t understand the term ‘sweet’ in this sentence, I saw it in some of the machine translations on twitter as well and didn’t understand those either. I will ask one of our team to interpret for me and reply if possible.

tresto · December 24, 2020, 12:24am

His word “sweet” means “loose” or “lax”.

GodTanu · December 24, 2020, 12:33am

and Sweet means thoughtless.

GodTanu · December 24, 2020, 1:26am

Are the specifications for Symbol’s tps performance still undecided?
Is it in the process of being tuned?

DaveH · December 24, 2020, 11:56am

The below is written with the best information we have at this time, some of it may change as investigation continues.

I hope it helps with some of the conversations. The summary is that we all want the same answers (including the wider community, devs, me and the rest of the management team). It its not that information is known and is not being published; investigation needs to complete to have the information to communicate and that takes time. I know it is hard given the stage of plan, time of year, length of time waited already etc and am sorry to have ask for further patience, however there is no other option than to ask for it while the investigation work completes.

In terms of availability: I will be mostly online all through the festive period and on most of the normal channels; with exception of the 25th of Dec when I will be with family. Feel free to ping me on whichever one suits you the best and I hope everyone who is celebrating has an enjoyable Christmas and New Year period

DaveH

Are the specifications for Symbol’s tps performance still undecided?

The specifications were to meet 100tps, which it passed.

Is it in the process of being tuned?

The 130tps test was to check how it would respond if 100tps was exceeded. It found two problems (copied from Jag’s tweet) some of which is node/network tuning but the tps target itself is not being tuned/changed.

1. MongoDB usage / configuration => causes large memory usage and crashes in broker and REST`

This is possibly config/tuning/throttling but investigation is ongoing

2. Unconfirmed transactions => causes large memory usage in server`

Is being looked into in the code to see how best to handle this situation, the existing approach catches most but not quite all of it. It may also be tuning/config but investigation is ongoing

A few other questions from telegram etc I will try and wrap up together, paraphrased:

Given the test at 100tps was successful can we just cap it at 100tps for now?

Is being looked at as an option, if it is done, nodes & the network must be able handle and recover from numbers above that cap (see issue 2 above). It requires thought and consideration to ensure it can defend from a DoS type issue if the tps were to spike. Testnet must also to be running normally to be able to retest it (see below) if it is capped

So yes it is an option, but to get to that option, the investigation needs to complete and is ongoing.

Why is Testnet not running normally?

To perform the final stress tests, Testnet had to scale to a Mainnet number of nodes - and the number of voting nodes had to scale. NGL scaled these as was planned, but that meant a super-majority were run by NGL as a result; not a normal Mainnet scenario.

Testing caused the NGL nodes to fail due to issues above. This left the NGL nodes in a poor state (did not fail gracefully), those nodes now need to be rectified which is being worked on in parallel, updates will be provided as progress occurs. The network is still generating blocks, but not finalising, I have synchronised a dual node overnight from block 0.

As the troubleshooting occurs, it is likely to place load/errors on other nodes so non-NGL nodes may experience unstable performance while the troubleshooting is occurring, which some people are seeing.

On Mainnet this ownership and hosting is decentralised and rather than being a single test pool at NGL, a super-majority of distributed nodes would need to fail in a similar way to occur. By resolving the problems noted by Jag above, they would fail gracefully and recover. So the current Testnet behaviour is a direct result of having to scale Testnet centrally for testing.

We will update as information is available on getting Testnet running normally.

Is launch going to be delayed

This risk exists, has been known to exist and been communicated throughout this plan. Until the above investigation is complete it is not yet known if it is necessary. The decision to delay will not be taken lightly (obviously) and IF it needs to be taken, information is needed to be able to say for how long. That information is not available until the investigation completes.

Jag has taken the step of asking NemTus to postpone testing until investigation is more advanced for the same reason, thank you @h-gocchi for responding quickly and co-ordinating the change.

There is no intention to force a launch of something that is not ready, nor is there an intention to delay unless is necessary - investigation by the devs needs to complete before this can be known either way. The teams are working actively (the majority are present during Christmas period generally).

Estimations being Sweet/Lax/Loose/Thoughtless/Etc

Thanks for the clarifications above on the term @tresto @GodTanu

Estimations are made with the best knowledge that is present when they are made, they have been agreed and communicated jointly with the Core Developers, NGL Developers and NGL Exec team prior to any communication.

They have always been communicated as estimates with a level of risk. Risk decreases over time but cannot be removed entirely. As more information is known, they have been adjusted over time and as more information is known from this investigation they will be adjusted if necessary, or confirmed that it is not necessary.

garp · December 24, 2020, 12:06pm

Thank you and congratulations with the courage to communicate about these details. This is the kind of transparency that is needed, certainly and most wanted in situations as this.

tresto · December 24, 2020, 12:48pm

Thanks for your report.
I can understand if unforeseen problems arise and the launch date is postponed to resolve them. However, I think it was not a good idea to announce the number of snapshot blocks, etc. when the outlook was not yet definite. This is because the postponement announcement (or anxiety about postponement) will cause a lot of people (but that’s only for the information poor) to lose money in the market. I think you should take this into consideration a little more.

janikjan · December 24, 2020, 1:57pm

If the mainnet launch will be delayed (I hope it won’t), then you should leave the snapshot date as it is on 14th January 2021. Because I think for investors, exchanges that is the most important date and that date should not be changed.

DaveH · December 24, 2020, 4:41pm

Let’s handle this conversation once we know if it is necessary or not. There are strong opinions on both sides of it and I think IF we end up in that situation a PoI vote is the sensible way to decide it.

Alex_Sorokin · December 24, 2020, 7:34pm

Hi DaveH!

First of all, thanks for such a detailed report. We appreciate your communication, and I believe that the team is doing their best.
However, one point remains unclear for me. You wrote:

The specifications were to meet 100tps, which it passed.

So technically the testnet works as it should work, isn’t it? This looks a little bit strange that we’re going to postpone the release which, hmm, works as intended. You may notice, that some explanation of the possible postponement was given in your message:

It requires thought and consideration to ensure it can defend from a DoS type issue if the tps were to spike.

But what’s the point of getting 130 tps to defend from DoS? Why it is so important to reach precisely 130 tps to defend from DoS? And does it mean, that on 100 tps the network is unstable and has a high risk of DoS? I really don’t understand why 130 tps is much safer than 100, can you explain it?

For people, who are not so involved in the development, the whole situation looks like an example of overperfectionism. Maybe I’m wrong, so I’d like to see your opinion on these details.

GodTanu · December 25, 2020, 12:39am

Thanks for the answer.

If you don’t mind, please let me know the specifications clearly.
I would like you to determine the product specification.

1.TPS performance specification is 100? I want to know the final specification.
2. What is the performance specification including Aggregate TX? Currently, Testnet can include up to 100Tx.
3. Is the block generation speed 30 seconds? Currently, it is 30 seconds in our testnet.

If undecided, when will this be determined?

kitsutsukick · December 26, 2020, 12:15pm

I also think it’s important to make the specifications clear and to publicize them so that everyone can understand them.

It may have already been documented somewhere…

I would like to know if there is a specification for chain protection if there is more Tx than can be handled. For example, like bitcoin, it could be pooled somewhere and the creators of the block (usually in order of highest fees) could incorporate the Tx into the block.

Translated with www.DeepL.com/Translator (free version)

Wittmann · December 26, 2020, 5:20pm

You can reschedule the Symbol launch date. But the date of the snapshot cannot be postponed, otherwise it will be a big failure in front of the crypto community and exchanges.

DaveH · December 27, 2020, 6:23pm

Thanks @GodTanu answers below:

1.TPS performance specification is 100? I want to know the final specification.

Correct, 100 TPS

What is the performance specification including Aggregate TX? Currently, Testnet can include up to 100Tx.

100, same as Testnet

Is the block generation speed 30 seconds? Currently, it is 30 seconds in our testnet.

30 sec, same as Testnet.

DaveH · December 27, 2020, 6:23pm

Answered above: 500 Nodes + Performance Test

DaveH · December 27, 2020, 6:30pm

Hi @Alex_Sorokin

A couple of points, hopefully clears it up.

So technically the testnet works as it should work, isn’t it? This looks a little bit strange that we’re going to postpone the release which, hmm, works as intended

Yes it achieves the TPS required, however, it did not handle the situation elegantly when that was exceeded (hence why Testnet is running strangely just now) which needs to be investigated before assuming it “works as intended”

But what’s the point of getting 130 tps to defend from DoS

130 in this context is basically arbitrary (it could be 101, 110, 500 etc). I probably could have better would be better said that when it exceeds the target TPS, performance/effectiveness/processing should degrade predictably/gracefully

Rather than over perfectionism, it is more a case of ensuring Mainnet can cope with a scenario that exceeds the target transactions per second, by appropriately managing the unconfirmed Tx cache, it is not to say it will serve 130TPS, just that it is will handle the situation robustly and reliably.

@kitsutsukick

I would like to know if there is a specification for chain protection if there is more Tx than can be handled. For example, like bitcoin, it could be pooled somewhere and the creators of the block (usually in order of highest fees) could incorporate the Tx into the block

There is - it is the unconfirmed transaction cache, however as per Jaguar’s tweet, an issue was identified with it in the stress test which is being investigated.

Graf_de · December 27, 2020, 10:52pm

Hi Dave,
I am watching the situation with the launch of Symbol and this is very similar to intentional manipulation. You inform community about risk of another launch postponing, referring to the fact that the testnet starts to behave incorrectly when the number of transactions per second exceeds the limits you specified.

You yourself wrote that the test network works exactly as planned, but for some reason you start to dramatize and come up with scenarios and non-existent ideals that are one today and another tomorrow.

Why don’t you want to follow the example of business giants like Mercedes-Benz, VW, Toyota, Microsoft, Apple, Sony? All of them release their products to the market in a far from perfect condition, and then over time they refine and sell them as an already improved product, and this business model has been working for many years.

And what does the NEM team do? For years they has been trying to achieve the ideal that is the reason for the postponement of the launch Catapult/Symbol.

With all my love for NEM, I sincerely do not understand why you are not focusing on the strengths that the project has and working correctly, but on scenarios that may never arise and against this background you scare the community with a possible next postponement. XEM had not yet had time to properly rise in price, when after this scary message XEM lost 40% in price. Or maybe it’s just convenient for you to be in a continuous search for the ideal and Symbol launching is not the goal?

bootarou · December 28, 2020, 3:52am

あれ？0一個少なくない？1000では？

Xatoria · December 28, 2020, 5:09am

Have you thought about updating the product through a phased hard fork?
For some reason you guys seem to be obsessed with making and delivering every detail in the early stages of a product.
In Ethereum, the value increases despite postponement after postponement. This is because it has an excellent method of creating expectations for the future.