Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very quick rollout is crucial for this kind of service. On top of what you wrote, institutionalizing rollback by default if something catastrophically breaks should be the norm.

Been there in those calls, begging to people in charge who perhaps shouldn't have been, "eh, maybe we should attempt a rollback to the last known good state? cause, it, you know.... worked". But investigating further before making any change always seems to be the preferred action to these people. Can't be faulted for being cautious and doing things properly, right? I kid you not - this is their instinct.

If I recall correctly it took CF 2 hours to roll back the broken changes.

So if I were in charge of Cloudflare (4-5k employees) I'd both look at the processes and the people in charge.





It does seem insane to me that there isnt a process to catch the panic, unwind back to a reasonable place in the call stack, load the last known good configuration and continue execution as normal. You would go from having a global 2 hour outage to a warning on a dashboard that can be investigated in a timely manner rather than blowing up half the internet



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: