)]}' {"/PATCHSET_LEVEL":[{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"89cbbe9e_ea0c57fa","updated":"2025-10-30 11:49:19.000000000","message":"-1 given the comments, but for sure it would be a nice feature. Thanks forwanting to work on it !","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"66dcacf3825bbf4540a006c0907985181ae7f087","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"7bd98e29_80d73291","updated":"2025-10-30 13:03:38.000000000","message":"Maybe worth noting that there was an attempt to solve this via the threading approach some years ago https://review.opendev.org/c/openstack/nova/+/666245 - and we still run that in production.\n\nIn general, I feel like this spec sends somewhat mixed messages in only talking about currently-building instances and rpc-requests in general. It even has a list of other events, but from how it\u0027s written only wants to achieve finishing instance builds. Could be improved, but no major problem for me.\n\nConsidering some tasks can take quite long and thus shutdown will, too, instances on the nova-compute cannot be managed. In our experience, k8s isn\u0027t too happy if volume-attachments take long - it will retry, but it will also throw errors. Would it make sense to allow a new nova-compute service to start up in parallel, using shared locks to prohibit them working on the same instances?","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"}],"specs/2026.1/approved/graceful_nova_compute_shutdown.rst":[{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"1cfd38493b1823fca038c4123bb3ce2cda2bcf4a","unresolved":true,"context_lines":[{"line_number":5,"context_line":"https://blueprints.launchpad.net/nova/+spec/graceful-nova-compute-shutdown"},{"line_number":6,"context_line":""},{"line_number":7,"context_line":"This specification proposes implementing graceful shutdown behavior for the"},{"line_number":8,"context_line":"nova-compute service to ensure that in-queue RPC operations can be completed"},{"line_number":9,"context_line":"before the service terminates."},{"line_number":10,"context_line":""},{"line_number":11,"context_line":"Problem description"}],"source_content_type":"text/x-rst","patch_set":2,"id":"46ee9501_1a792dd2","line":8,"range":{"start_line":8,"start_character":36,"end_line":8,"end_character":59},"updated":"2025-10-30 11:52:56.000000000","message":"in flight maybe. The request that are in the RPC bus queued up is OK to be kept there for later when the nova-compute service is started up again.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"1cfd38493b1823fca038c4123bb3ce2cda2bcf4a","unresolved":true,"context_lines":[{"line_number":26,"context_line":""},{"line_number":27,"context_line":"* I want to safely shut down nova-compute services during maintenance"},{"line_number":28,"context_line":" without interrupting ongoing instance operations."},{"line_number":29,"context_line":"* I want in-progress instance builds to complete gracefully before the"},{"line_number":30,"context_line":" service stops."},{"line_number":31,"context_line":"* I want to avoid manual cleanup or recovery of failed instance builds."},{"line_number":32,"context_line":"* I need a reliable mechanism to handle large-scale maintenance where"}],"source_content_type":"text/x-rst","patch_set":2,"id":"398d91e5_5c185865","line":29,"range":{"start_line":29,"start_character":2,"end_line":29,"end_character":36},"updated":"2025-10-30 11:52:56.000000000","message":"this is fairly specific. Eventually I want that all in progress VM lifecycle operations are completed. But if you want to limit the scope of this spec that is fine to me as well. In this case I suggest to state that we are intentionally limiting this spec to make in-progress instance builds completed and the rest of the lifecycle operations will be handled by a later spec.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":26,"context_line":""},{"line_number":27,"context_line":"* I want to safely shut down nova-compute services during maintenance"},{"line_number":28,"context_line":" without interrupting ongoing instance operations."},{"line_number":29,"context_line":"* I want in-progress instance builds to complete gracefully before the"},{"line_number":30,"context_line":" service stops."},{"line_number":31,"context_line":"* I want to avoid manual cleanup or recovery of failed instance builds."},{"line_number":32,"context_line":"* I need a reliable mechanism to handle large-scale maintenance where"}],"source_content_type":"text/x-rst","patch_set":2,"id":"b41605c5_a9c0aec8","line":29,"range":{"start_line":29,"start_character":2,"end_line":29,"end_character":36},"in_reply_to":"398d91e5_5c185865","updated":"2025-10-30 15:41:17.000000000","message":"Yeah, there are many other operation on instance can be in-progress, stopping them (not handling in graceful shutdown) is same issue as stopping in-progress boot.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"66dcacf3825bbf4540a006c0907985181ae7f087","unresolved":true,"context_lines":[{"line_number":31,"context_line":"* I want to avoid manual cleanup or recovery of failed instance builds."},{"line_number":32,"context_line":"* I need a reliable mechanism to handle large-scale maintenance where"},{"line_number":33,"context_line":" pre-disabling each compute node individually (e.g., across thousands"},{"line_number":34,"context_line":" of hypervisors) is impractical or time-consuming."},{"line_number":35,"context_line":""},{"line_number":36,"context_line":"As a user:"},{"line_number":37,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"2c478e4d_cc5b89ec","line":34,"updated":"2025-10-30 13:03:38.000000000","message":"pre-disabling is also only a solution to prevent new builds from happening. any other lifecycle operation can still happen and leave inconsistencies around","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":true,"context_lines":[{"line_number":48,"context_line":" RPC operations:"},{"line_number":49,"context_line":""},{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"34c0da54_c5251291","line":51,"updated":"2025-10-30 11:49:19.000000000","message":"given a compute service can have multiple workers, I guess the counter would be a semaphore ?","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":48,"context_line":" RPC operations:"},{"line_number":49,"context_line":""},{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"3f885a71_1bd55ee3","line":51,"in_reply_to":"16ad8aea_7068b40a","updated":"2025-10-30 15:41:17.000000000","message":"true, but at the same time, if without some semaphore or lock, then the increment/decrement and wait for the counter can eb race condition which can lead to false counter value to stop compute.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":48,"context_line":" RPC operations:"},{"line_number":49,"context_line":""},{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"16ad8aea_7068b40a","line":51,"in_reply_to":"34c0da54_c5251291","updated":"2025-10-30 13:49:54.000000000","message":"\u003e given a compute service can have multiple workers\n\nI\u0027m not sure what you mean here... nova-compute is single-process, unlike conductor or scheduler right? A semaphore would also have the side-effect of limiting the number of in-progress operations (the size) which may or may not be desirable.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"0ef0ce6e7fe9f75640cd7cc53ab4ebeb1f380389","unresolved":true,"context_lines":[{"line_number":48,"context_line":" RPC operations:"},{"line_number":49,"context_line":""},{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"dae09de1_e4d29872","line":51,"in_reply_to":"3f885a71_1bd55ee3","updated":"2025-10-31 08:54:24.000000000","message":"I think Sylvain refers to multiple RPC request handler threads / greenlets within a single worker process.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":true,"context_lines":[{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""},{"line_number":55,"context_line":"2. Enhance the compute service stop procedure to handle graceful"},{"line_number":56,"context_line":" shutdown:"}],"source_content_type":"text/x-rst","patch_set":2,"id":"365bd598_e90da9d7","line":53,"updated":"2025-10-30 11:49:19.000000000","message":"how can we ensure we don\u0027t race in between decrementing the counter ?","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":50,"context_line":" * Increment a counter whenever an RPC method (either CAST or CALL) is"},{"line_number":51,"context_line":" invoked within the compute service."},{"line_number":52,"context_line":" * Decrement the counter when the corresponding RPC method completes"},{"line_number":53,"context_line":" execution."},{"line_number":54,"context_line":""},{"line_number":55,"context_line":"2. Enhance the compute service stop procedure to handle graceful"},{"line_number":56,"context_line":" shutdown:"}],"source_content_type":"text/x-rst","patch_set":2,"id":"996dc98a_5438b11f","line":53,"in_reply_to":"365bd598_e90da9d7","updated":"2025-10-30 13:49:54.000000000","message":"You mean post-GIL removal I assume? A lock around increment/decrement functions should be fine.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"1cfd38493b1823fca038c4123bb3ce2cda2bcf4a","unresolved":true,"context_lines":[{"line_number":56,"context_line":" shutdown:"},{"line_number":57,"context_line":""},{"line_number":58,"context_line":" * Upon receiving SIGTERM, the RPC server is stopped to prevent new RPCs"},{"line_number":59,"context_line":" from reaching the shutting-down compute service."},{"line_number":60,"context_line":" * Wait for all in-progress RPCs to complete, either until the counter"},{"line_number":61,"context_line":" reaches zero or the configured timeout has occured."},{"line_number":62,"context_line":" * Once waiting is complete, continue with the normal shutdown sequence"}],"source_content_type":"text/x-rst","patch_set":2,"id":"394299e2_cc78c207","line":59,"updated":"2025-10-30 11:52:56.000000000","message":"This needs to be selective. E.g.\n* during instance creation there is multiple RPC calls back to the conductor to update the DB state of the instance. Those DB related RPC calls needs to succeed while new instance creations should not be picked up from the RPC bus.\n\n* during instance creation nova binds the neutron ports and waits for external events from neutron to signal that the port (vif) is plugges successfully. This external event is arriving via the RPC bus to the compute. This RPC cast needs to be received by the compute otherwise the instance creation will fail.\n\n* an in progress live migration consist of multiple RPC requests from the conductor and between the source and destination compute nodes. https://docs.openstack.org/neutron/latest/contributor/internals/live_migration.html#ovs-normal-plug-linux-bridge-macvtap-sr-iov","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":56,"context_line":" shutdown:"},{"line_number":57,"context_line":""},{"line_number":58,"context_line":" * Upon receiving SIGTERM, the RPC server is stopped to prevent new RPCs"},{"line_number":59,"context_line":" from reaching the shutting-down compute service."},{"line_number":60,"context_line":" * Wait for all in-progress RPCs to complete, either until the counter"},{"line_number":61,"context_line":" reaches zero or the configured timeout has occured."},{"line_number":62,"context_line":" * Once waiting is complete, continue with the normal shutdown sequence"}],"source_content_type":"text/x-rst","patch_set":2,"id":"4ea7a390_42cdccdf","line":59,"in_reply_to":"394299e2_cc78c207","updated":"2025-10-30 13:49:54.000000000","message":"Yup, this has always been the reason it\u0027s \"not that easy\". We would need to do something like making the increment-counter function re-queue the message or something like that, which has other problems of course.\n\nThe other option would be to have a global flag for \"trying to shut down\" and then have the last decrement function that hits zero actually terminate the RPC connection. Sort of a \"shut down when idle\" situation where we aren\u0027t deterministic about the termination, but the second we have a gap with nothing in process, we shut down.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":56,"context_line":" shutdown:"},{"line_number":57,"context_line":""},{"line_number":58,"context_line":" * Upon receiving SIGTERM, the RPC server is stopped to prevent new RPCs"},{"line_number":59,"context_line":" from reaching the shutting-down compute service."},{"line_number":60,"context_line":" * Wait for all in-progress RPCs to complete, either until the counter"},{"line_number":61,"context_line":" reaches zero or the configured timeout has occured."},{"line_number":62,"context_line":" * Once waiting is complete, continue with the normal shutdown sequence"}],"source_content_type":"text/x-rst","patch_set":2,"id":"c81ffc8c_62155483","line":59,"in_reply_to":"4ea7a390_42cdccdf","updated":"2025-10-30 15:41:17.000000000","message":"Yeag, we need to make our topics more granular and pick very carefully which one to stop and which one to contiunue for in-progress tasks. For example:\n 1. Keep reply queue up for response. This is important part for conductor to keep responding to compute. This is something we need to design/implement new at oslo.messasing. I am thinking that we can do that based on exchange type (direct exchange topics to not shutdown?)\n \n 2. no message is droped from normal queue. Currentlt we only have single topic to do everything (per worker like compute) and stoping that will drop many in-progress operation message which will come as new to that queue but they are part of in-progress operation. \n \n 3. It might be easy to track the operation omcpletion at sender level(say API server when it is rpc.call() or based on reply queue but for cast, it is not easy to know when in-progress operation as whole is completed or not.\n \nshut down when idle\" is good but we cannot be so optimistic of this situation, there are chances (high or medium, depending on usage on the cloud) that we can keep having the incoming requests. That way, most of the time we may be doing a shutdown (based on a timeout or so) instead of a graceful shutdown.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"0ef0ce6e7fe9f75640cd7cc53ab4ebeb1f380389","unresolved":true,"context_lines":[{"line_number":56,"context_line":" shutdown:"},{"line_number":57,"context_line":""},{"line_number":58,"context_line":" * Upon receiving SIGTERM, the RPC server is stopped to prevent new RPCs"},{"line_number":59,"context_line":" from reaching the shutting-down compute service."},{"line_number":60,"context_line":" * Wait for all in-progress RPCs to complete, either until the counter"},{"line_number":61,"context_line":" reaches zero or the configured timeout has occured."},{"line_number":62,"context_line":" * Once waiting is complete, continue with the normal shutdown sequence"}],"source_content_type":"text/x-rst","patch_set":2,"id":"8078a4e8_39939d95","line":59,"in_reply_to":"c81ffc8c_62155483","updated":"2025-10-31 08:54:24.000000000","message":"@dms@danplanet.com By \"global\" flag you mean an admin settable flag that signals to nova conductor / API to stop sending new lifecycle operations to a given compute? Like the openstack compute service set --disable but stricter? I think that could be combined with a compute level counter approach to make it safe. Still knowing which RPC nova can send to the compute and which should be rejected to the caller needs proper thinking to handle external events.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":true,"context_lines":[{"line_number":58,"context_line":" * Upon receiving SIGTERM, the RPC server is stopped to prevent new RPCs"},{"line_number":59,"context_line":" from reaching the shutting-down compute service."},{"line_number":60,"context_line":" * Wait for all in-progress RPCs to complete, either until the counter"},{"line_number":61,"context_line":" reaches zero or the configured timeout has occured."},{"line_number":62,"context_line":" * Once waiting is complete, continue with the normal shutdown sequence"},{"line_number":63,"context_line":" and resource cleanup."},{"line_number":64,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"5d132a3b_cf31ad4f","line":61,"range":{"start_line":61,"start_character":17,"end_line":61,"end_character":56},"updated":"2025-10-30 11:49:19.000000000","message":"note : there could be an impact with the current eventlet removal efforts, please ensure you use our futures interface for the timer.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":85,"context_line":" * - Time intensive case"},{"line_number":86,"context_line":" - take snapshot"},{"line_number":87,"context_line":" - CAST snapshot_instance"},{"line_number":88,"context_line":" - long wait for large upload, can be solved by timeout setting"},{"line_number":89,"context_line":" * - Cross-compute communication"},{"line_number":90,"context_line":" - migration"},{"line_number":91,"context_line":" - NA"}],"source_content_type":"text/x-rst","patch_set":2,"id":"14ccdb2c_8032e29f","line":88,"range":{"start_line":88,"start_character":7,"end_line":88,"end_character":67},"updated":"2025-10-30 15:41:17.000000000","message":"but that is the point that we want all long running tasks to finish before shutdown.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":91,"context_line":" - NA"},{"line_number":92,"context_line":" - varies case by case"},{"line_number":93,"context_line":""},{"line_number":94,"context_line":"Note on cross compute communication:"},{"line_number":95,"context_line":""},{"line_number":96,"context_line":"* Source compute - check current RPC, if last RPC is to source compute,"},{"line_number":97,"context_line":" it can be solved with counter approach"}],"source_content_type":"text/x-rst","patch_set":2,"id":"f257b446_6c616cc9","line":94,"updated":"2025-10-30 13:49:54.000000000","message":"Note that conductor is also in the mix here. Even a single-compute operation will make many trips to and from conductor to do its work, which means we need to handle all of them pretty much in the same way.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":true,"context_lines":[{"line_number":96,"context_line":"* Source compute - check current RPC, if last RPC is to source compute,"},{"line_number":97,"context_line":" it can be solved with counter approach"},{"line_number":98,"context_line":"* Destination compute - edge case when operator_1 chooses compute_B as"},{"line_number":99,"context_line":" destination and operator_2 disables compute_B (very short time window)"},{"line_number":100,"context_line":""},{"line_number":101,"context_line":"Alternatives"},{"line_number":102,"context_line":"------------"}],"source_content_type":"text/x-rst","patch_set":2,"id":"c2357902_d9912ebf","line":99,"updated":"2025-10-30 11:49:19.000000000","message":"I\u0027m afraid that for live-migrations, we should abort and revert it if we have a SIGTERM arriving so it should be a different decorator.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"0ef0ce6e7fe9f75640cd7cc53ab4ebeb1f380389","unresolved":true,"context_lines":[{"line_number":96,"context_line":"* Source compute - check current RPC, if last RPC is to source compute,"},{"line_number":97,"context_line":" it can be solved with counter approach"},{"line_number":98,"context_line":"* Destination compute - edge case when operator_1 chooses compute_B as"},{"line_number":99,"context_line":" destination and operator_2 disables compute_B (very short time window)"},{"line_number":100,"context_line":""},{"line_number":101,"context_line":"Alternatives"},{"line_number":102,"context_line":"------------"}],"source_content_type":"text/x-rst","patch_set":2,"id":"d6785909_66e4e2ee","line":99,"in_reply_to":"c2357902_d9912ebf","updated":"2025-10-31 08:54:24.000000000","message":"post copy live migration cannot be safely aborted without losing the running guest as the VM already runs on the target host during post copy without a way back due to dirtied memory on the target host.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":116,"context_line":" Disadvantages:"},{"line_number":117,"context_line":" * Requires context propagation between API and compute services."},{"line_number":118,"context_line":" * Complex to correlate RPC calls, as different API operations may"},{"line_number":119,"context_line":" trigger the same RPCs in varying sequences."},{"line_number":120,"context_line":""},{"line_number":121,"context_line":"2. Instance task state tracking:"},{"line_number":122,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"8e862461_bb36a56d","line":119,"range":{"start_line":119,"start_character":48,"end_line":119,"end_character":50},"updated":"2025-10-30 15:41:17.000000000","message":"other issue in that when API doing the rpc.cast() where we do not have response from server and API would not know when they are going to be picked form queue and completed. Also, operation can involve multiple queue for compute, conductor so basically request is scattered in multiple rpc calls and that is hard to track at API layer.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":134,"context_line":" Disadvantages:"},{"line_number":135,"context_line":" * Less granularity as specific operation details are lost."},{"line_number":136,"context_line":" * Risk of false positives if task_state is temporarily"},{"line_number":137,"context_line":" inconsistent."},{"line_number":138,"context_line":""},{"line_number":139,"context_line":"3. Threading based approach:"},{"line_number":140,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"178bb5c0_1e0b48b1","line":137,"range":{"start_line":137,"start_character":19,"end_line":137,"end_character":20},"updated":"2025-10-30 15:41:17.000000000","message":"I also thought about this idea (commented in another spec https://review.opendev.org/c/openstack/nova-specs/+/937185/comment/e260e4b3_fef88c0c/). It solve the purpose of tracking the in-progress tasks but does not handle to stop accepting the new requests. I thought of marking compute service with flag where we can stop sending the request to that host but that is something we need to discuss if that will cover:\n 1. stop requesting new request (for example, scheduler to not pick that host + more place from where request directly comes to compute)\n 2. it will not impact anything to complete the in-progress request.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"efe28416027d2ce23267ac665ee0d425284eabbd","unresolved":true,"context_lines":[{"line_number":134,"context_line":" Disadvantages:"},{"line_number":135,"context_line":" * Less granularity as specific operation details are lost."},{"line_number":136,"context_line":" * Risk of false positives if task_state is temporarily"},{"line_number":137,"context_line":" inconsistent."},{"line_number":138,"context_line":""},{"line_number":139,"context_line":"3. Threading based approach:"},{"line_number":140,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"5b141a82_168d64ed","line":137,"updated":"2025-10-30 11:49:19.000000000","message":"I\u0027m assuming we could maybe still check the task states","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":136,"context_line":" * Risk of false positives if task_state is temporarily"},{"line_number":137,"context_line":" inconsistent."},{"line_number":138,"context_line":""},{"line_number":139,"context_line":"3. Threading based approach:"},{"line_number":140,"context_line":""},{"line_number":141,"context_line":" * Use a dedicated eventlet.GreenPool for instance spawns."},{"line_number":142,"context_line":" * Manage concurrency via max_concurrent_builds configuration."}],"source_content_type":"text/x-rst","patch_set":2,"id":"11dddc82_ae547872","line":139,"range":{"start_line":139,"start_character":1,"end_line":139,"end_character":28},"updated":"2025-10-30 15:41:17.000000000","message":"I think this is the same approach which another spec proposed - https://review.opendev.org/c/openstack/nova-specs/+/937185/6/specs/2025.1/approved/graceful-compute-shutdown.rst","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":147,"context_line":" Disadvantages:"},{"line_number":148,"context_line":" * Adds complexity in thread management and tracking."},{"line_number":149,"context_line":" * Requires careful synchronization during shutdown handling."},{"line_number":150,"context_line":""},{"line_number":151,"context_line":"Data model impact"},{"line_number":152,"context_line":"-----------------"},{"line_number":153,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"04e33018_3afdff2e","line":150,"updated":"2025-10-30 13:49:54.000000000","message":"To me, the best idea has always been to make oslo.messaging have some way to shut down the main topic (i.e. `compute.$hostname`) and keep the rest open for outbound RPCs and replies. We could then establish a new RPC topic for inter-compute communication that we leave open in that scenario, which live migration and resize would actually use (something like `compute-internal.$hostname`). Doing something like this would let new requests stack up in rabbit while replies and inter-compute communications could continue until the operations (using a counter like you describe to track) are completed. I just have a hard time understanding how it will work reliably and deterministically otherwise.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"ba3cb81ed271cd40ebb714c9305f54765189948f","unresolved":true,"context_lines":[{"line_number":147,"context_line":" Disadvantages:"},{"line_number":148,"context_line":" * Adds complexity in thread management and tracking."},{"line_number":149,"context_line":" * Requires careful synchronization during shutdown handling."},{"line_number":150,"context_line":""},{"line_number":151,"context_line":"Data model impact"},{"line_number":152,"context_line":"-----------------"},{"line_number":153,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"ecedb9fe_9e44141f","line":150,"in_reply_to":"04e33018_3afdff2e","updated":"2025-10-30 15:41:17.000000000","message":"Yeah, I agree that best way to handle it at oslo.messaging level where we can separate the topics more granularly. Currently AMQP driver listen on two subscriptions[1] 1. \u0027exchange.topic\u0027 and 2. \u0027exchange.topic.server\u0027. I checking if we can add more more subrciption based on different topics or some other info/field (we can add new field if needed) in \u0027Target\u0027 object.\n\n[1] \nhttps://github.com/openstack/oslo.messaging/blob/88ec0a4446ce1c2514cd5f8058852d9d5c4e40dc/oslo_messaging/_drivers/amqpdriver.py#L813-L818\n\nI am still doing my research on this approach but not finished yet. but I agree that splitting the topics/listening from oslo.messaging side will be better approach.","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":242,"context_line":"Testing"},{"line_number":243,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d"},{"line_number":244,"context_line":""},{"line_number":245,"context_line":"* Use existing tests to test the new feature."},{"line_number":246,"context_line":"* Add a test for each of the scenarios, example:"},{"line_number":247,"context_line":" 1. Creates an instance and shuts down the compute"},{"line_number":248,"context_line":" 2. Snapshots instance and shuts down the compute"}],"source_content_type":"text/x-rst","patch_set":2,"id":"a6513017_fa022d3f","line":245,"updated":"2025-10-30 13:49:54.000000000","message":"What does this mean?","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"4f0959ae13eedab985ad489b3e725b947113bb77","unresolved":true,"context_lines":[{"line_number":245,"context_line":"* Use existing tests to test the new feature."},{"line_number":246,"context_line":"* Add a test for each of the scenarios, example:"},{"line_number":247,"context_line":" 1. Creates an instance and shuts down the compute"},{"line_number":248,"context_line":" 2. Snapshots instance and shuts down the compute"},{"line_number":249,"context_line":""},{"line_number":250,"context_line":"Documentation Impact"},{"line_number":251,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d"}],"source_content_type":"text/x-rst","patch_set":2,"id":"d49f4e18_1ce380de","line":248,"updated":"2025-10-30 13:49:54.000000000","message":"Because of the inter-compute communication and issues with shutting down RPC entirely, I\u0027m going to want to see some pretty good tests to be convinced, but I\u0027m not really sure how that\u0027s going to work in a fully synthetic environment...","commit_id":"ddf671d2020f9414fb466c660c70246a3c28aa3c"}]}