Apart from the new features added, I also carefully reviewed the performance aspects, and did some enhancements for the sake of robustness.
One observation from the real production application was that, GC would become a problem, sooner or later, and when it comes, it comes badly. I tried using –nouse-idle-notification and mitigated the problem noticeably, but it might just let the server run longer.
Which is ok, because cluster2 had some built in mechanism to kill workers, for example, served more than 10k connections. This simple criteria is actually quite effective, and if you can predict the traffic, and tune the number, you could achieve pretty good performance already. But the greed inside me grows, I’d like to solve it a bit nicer, that’s all.
And you’ll see a little more added to the assertion:
The assertOld is simple, just a safety measure, don’t let the worker run longer than a couple of days if you share the same concern. The assertBadGC is a simple heuristic, which is trying to deduce the fact that GC is hurting the performance by consuming more resources. The use of TPS measures the perf, while the rest, CPU, memory, GC rate all reflect the resource consumption, and we want resource consumption to be attributed to higher TPS, not GC, and that’s what this assertion is trying to do.
Well, that gives us some confidence that our workers are monitored closely, but another problem we encountered was that when multiple workers died around the same time, the performance degraded, which is easy to understand, as the rest of the workers must handle the work load now.
The answer to it, as I tried, is to organize the workers’ death as the following:
To queue the death and act as a FIFO.
A few other performance related enhancements were added, but the above 2 are the biggest change. Still testing the heuristics, and will update the post after more results collected.