I have one question about Fuzed. (Turned out to be a few questions, oops.) What ...

KirinDave · on June 6, 2008

I have a few answers for you:

When a master node dies, all the worker nodes go into a hibernation state, pinging the master's previous hostname until they can reconnect to that host, and then they reregister their resources with the master's resource_fountain.

One of the next major features for fuzed is handling the master as a single point of failure. We're exploring two options currently:

1. When a master dies the cluster can re-elect a new master on one of the machines in the cloud and everyone re-registers their assets. This process assumes that master death is relatively rare, and so minimizes the resources necessary for redundant operation at the cost of a slight time gap in service while the master is re-elected and resources are re-registered to it.

2. Clusters can create numerous masters which all maintain identical state. When one dies, another master will move forward and become the primary master. This approach requires more hardware in the cloud, but even if the master faults are common they don't allow for a gap of service.

As for the difference between process death and machine death, yes there are differences. If you're interested in how we handle it, please check out master_beater.erl (great name, huh?), which is a gen_fsm that worker nodes use to eagerly reconnect to the master. Also check out the fuzed.ap and fuzed_supervisor.erl. Erlang provides very good resources for handling this.