I’ve been busy with this goal for the last year or so, 2 emerging features I tried to deliver, one being the cluster2 and the other is logging.
I’ll explain cluster2 in this post and logging later, as in fact, logging is more complicated, much trickier than you could think of in an asynchronous world. Not to deviate from the main topic, here’s what I’ve been working: https://github.com/inexplicable/cubejs-cluster2/tree/cluster3
I took over cluster2 from ql.io’s ownership and had added a couple of features that were needed for operational interests, mainly heartbeat data to reflect performance metrics, status view & update to alter runtime behavior, gc related optimizations etc. All of those were nice, but adding complexity to the cluster2 module gradually till we decided it was time for a refresh. A refresh, as we defined, adding more cool stuff into, while as important, kicking obsolete stuff out.
A few things I discontinued:
- multi apps, multi ports
- none cluster mode
- it looks so appealing to support this for development environment esp. but later you will complain the disconnect it caused for different environments
- and the debugging difficulty has been solved quite differently too to answer to this argument
- worker ecv
A few things I’m adding:
- cluster caching
- debugging support
- much simplified api based on promise
Debugging is one of the most exciting piece, which is to give developer a well defined flow of ecv control, worker signal, node-inspector start, debug view, all out of the box from the new cluster2. Based on bootstrap UI & node-inspector ~0.4.0, the experience has shocked a few of my colleagues already.
Cluster caching is a newly proposed feature, to solve practical problems, we’ve ran into a dozen of cases where workers are acquiring same data, and waste network bandwidth. The answer we had before was to use delegation, let master handle the data fetching for all workers, and broadcast to all of them. It works ok, but caused more problems than we thought, master could fail, and even worse become slow. Master must have the data fetching logic, which is supposed to be in the worker only (as there is your user application). After carefully rethink of the problem, we found a cluster shared caching a better fit to solve this.
Rule of thumb, the data fetching still happens in worker, but given the atomic get semantic support, only one worker will be allowed to fetch the data, while the others will just leverage the result when it gets back. Check cluster2/cache, cluster2/cache-usr, cluster2/cache-mgr and you will understand how that works. Quite some tuning to be done, but the model is proven to solve the data sharing problem.
It’s now under testing, and will be released as 0.6.x, which is a big big incompatible change, omg, hope it’s worth the changes from our users.