Loading…
Tuesday, April 17 • 4:30pm - 4:55pm
VM State Management (take 2)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

At the Essex design summit I presented an idea to limit the operations on a Instance based on the combination of vm_state and task_state. We tried to make the model quite restrictive to eliminate as many race conditions as possible (for example rebuilds occurring in the middle of a rebuild). However having run with this model for some time we have found it necessary to allow more operations that we at first envisaged to cope with Instances that get “stuck” in a particular state (for example is a service is restarted whilst a rebuild is in progress the Instances can get stuck in “Rebuilding-Spawing”). Only allowing delete in this case (which is the operation which always leads to a deterministic end state no matter what it’s paired with) isn’t a great user experience, but allowing other operations (such as a further rebuild) creates more potential race conditions that have to be handled in the code. I therefore propose moving to a model which can block operations on a combination of vm_state, task_state, and updated_at – leading to three possible rules: i) Operations which are always allowed ii) Operations which are always blocked iii) Operations which are allowed only if now()- instance[updated_at] is greater than some thresholds (which can be different for each operation) So for example, for an Instance which has the state Active-None Start, Unpause, Resume, UnRescue would always be blocked Other operations would always be allowed Whereas for an instance which has the state Active-Rebooting: Start, Unpause, Resume, and Unrescue would always be blocked Terminate, Reboot, and Rebuild would only be allowed if the now()-instance[updated_at] is greater than 3 minutes I believe this consideration of “how long since the last time the state changed” comes very close to the symanatic of limiting each instance to only one concurrent operation whilst still being safe against failed operations and without having to make major changes to the overall concurrency model. I would also like to explore with the community the degree to which these rules should be configurable vs hard coded (for example some might see a need to be able to configure to always allow operations such as delete which have a billing impact). (Session lead is Phil Day)

Tuesday April 17, 2012 4:30pm - 4:55pm PDT
Seacliff AB

Attendees (0)