I have one question about Fuzed. (Turned out to be a few questions, oops.)
What happens when a Master node dies? (Or, say it's EC2 instance kicks the bucket, rare but possible. Or in the case of a non-EC2 deployment the hardware just fails.)
My limited understanding of Erlang leads me to think the unhandled death of a Master node is the death of all the nodes in the cluster.
Further, I understand this does not have to be. I remember there being some way for another node to respond to the death of another node, handle that death, and let things continue along.
Is anything like this set up in Fuzed? Would this be handled by having multiple master nodes for a cluster, and they watch each other?
Would there be any difference in setting up Fuzed to handle the death of the Master node process and the disappearance of the Master node instance?
When a master node dies, all the worker nodes go into a hibernation state, pinging the master's previous hostname until they can reconnect to that host, and then they reregister their resources with the master's resource_fountain.
One of the next major features for fuzed is handling the master as a single point of failure. We're exploring two options currently:
1. When a master dies the cluster can re-elect a new master on one of the machines in the cloud and everyone re-registers their assets. This process assumes that master death is relatively rare, and so minimizes the resources necessary for redundant operation at the cost of a slight time gap in service while the master is re-elected and resources are re-registered to it.
2. Clusters can create numerous masters which all maintain identical state. When one dies, another master will move forward and become the primary master. This approach requires more hardware in the cloud, but even if the master faults are common they don't allow for a gap of service.
As for the difference between process death and machine death, yes there are differences. If you're interested in how we handle it, please check out master_beater.erl (great name, huh?), which is a gen_fsm that worker nodes use to eagerly reconnect to the master. Also check out the fuzed.ap and fuzed_supervisor.erl. Erlang provides very good resources for handling this.
What happens when a Master node dies? (Or, say it's EC2 instance kicks the bucket, rare but possible. Or in the case of a non-EC2 deployment the hardware just fails.)
My limited understanding of Erlang leads me to think the unhandled death of a Master node is the death of all the nodes in the cluster.
Further, I understand this does not have to be. I remember there being some way for another node to respond to the death of another node, handle that death, and let things continue along.
Is anything like this set up in Fuzed? Would this be handled by having multiple master nodes for a cluster, and they watch each other?
Would there be any difference in setting up Fuzed to handle the death of the Master node process and the disappearance of the Master node instance?