)]}' {"/PATCHSET_LEVEL":[{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"54478f7d_da1bdedd","updated":"2026-06-04 10:47:46.000000000","message":"Overall a well written spec with good arguments why the extra complexity is needed. I have one question about RPC compatibility during upgrade. If that is settled I\u0027m happy to approve this.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"}],"specs/2026.2/approved/bidirectional-rpc-liveness-handshake-for-compute-nodes.rst":[{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":12,"context_line":"continue to falsely report the node as healthy, rendering external monitoring"},{"line_number":13,"context_line":"tools blind to such issues."},{"line_number":14,"context_line":""},{"line_number":15,"context_line":"To resolve this, we propose replacing the current database update with a"},{"line_number":16,"context_line":"stateless, asynchronous RPC handshake between the compute node and"},{"line_number":17,"context_line":"nova-conductor. By forcing the heartbeat payload to travel through the"},{"line_number":18,"context_line":"node\u0027s main listener queue, this architecture ensures a compute node is only"}],"source_content_type":"text/x-rst","patch_set":2,"id":"bdc32e66_96a3bef2","line":15,"range":{"start_line":15,"start_character":28,"end_line":15,"end_character":37},"updated":"2026-06-04 10:47:46.000000000","message":"enhance / extend maybe?","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":false,"context_lines":[{"line_number":15,"context_line":"To resolve this, we propose replacing the current database update with a"},{"line_number":16,"context_line":"stateless, asynchronous RPC handshake between the compute node and"},{"line_number":17,"context_line":"nova-conductor. By forcing the heartbeat payload to travel through the"},{"line_number":18,"context_line":"node\u0027s main listener queue, this architecture ensures a compute node is only"},{"line_number":19,"context_line":"marked as \"up\" if it is actively capable of processing incoming RPC messages."},{"line_number":20,"context_line":""},{"line_number":21,"context_line":""},{"line_number":22,"context_line":"Problem description"}],"source_content_type":"text/x-rst","patch_set":2,"id":"5b1db15c_b9c609ce","line":19,"range":{"start_line":18,"start_character":28,"end_line":19,"end_character":77},"updated":"2026-06-04 10:47:46.000000000","message":"+1, this is the meat of the change, using the same RPC mechanism to get better detection.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":false,"context_lines":[{"line_number":45,"context_line":" nova services ended up in weird/inconsistent states (e.g. volume_attachment"},{"line_number":46,"context_line":" problems / weird server states / failing operations / operations ending in"},{"line_number":47,"context_line":" error states) rendering the control plane unstable and unable to recover"},{"line_number":48,"context_line":" itself."},{"line_number":49,"context_line":""},{"line_number":50,"context_line":"When this occurs, the background heartbeat often continues to run perfectly,"},{"line_number":51,"context_line":"falsely reporting the node as \"up\" to the rest of the control plane. The"}],"source_content_type":"text/x-rst","patch_set":2,"id":"1e29a5d6_615634ba","line":48,"updated":"2026-06-04 10:47:46.000000000","message":"Thanks for adding the specific error scenarios. This helps us understand why you want to change the current mechanism","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":80,"context_line":"Proposed change"},{"line_number":81,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d"},{"line_number":82,"context_line":""},{"line_number":83,"context_line":"Todo: graceful"},{"line_number":84,"context_line":""},{"line_number":85,"context_line":""},{"line_number":86,"context_line":"We propose introducing a stateless, self-validating three-way RPC handshake to"}],"source_content_type":"text/x-rst","patch_set":2,"id":"a085117c_07d15ec4","line":83,"updated":"2026-06-04 10:47:46.000000000","message":"Do you mean you want to say something about graceful shutdown here?","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"2bec285b56f0f15b7594228c846a903270609682","unresolved":true,"context_lines":[{"line_number":80,"context_line":"Proposed change"},{"line_number":81,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d"},{"line_number":82,"context_line":""},{"line_number":83,"context_line":"Todo: graceful"},{"line_number":84,"context_line":""},{"line_number":85,"context_line":""},{"line_number":86,"context_line":"We propose introducing a stateless, self-validating three-way RPC handshake to"}],"source_content_type":"text/x-rst","patch_set":2,"id":"42133e49_a72adb6d","line":83,"in_reply_to":"90062a8d_dab347b7","updated":"2026-06-23 19:29:25.000000000","message":"During shutdown, I think the marking/getting node as not healthy or down in this RPC handshake makes sense. Because the regular RPC server is down, the compute is not up for the new operations and is just finishing the in-progress one. That is why i feel doing this hadnshake via RPC regular server (\u0027compute\u0027 topic) make more sense.\n\nIf we want to use \u0027compute-alt\u0027 RPC server (which is not a bad idea to reduce the overahead on regular RPC server (if any)) then we can just stop this handshale when shutdown is initiated. When shutdown is initiated then compute knows about it and can stop the periodic updates including this one. I commented about it in upgarde impact discussion.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":39013,"name":"Philipp Dreesens","email":"philipp.dreesens@digits.schwarz","username":"phildree"},"change_message_id":"ea0d2daf5bb5506d60de0326518e9f6884eb3b48","unresolved":true,"context_lines":[{"line_number":80,"context_line":"Proposed change"},{"line_number":81,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d"},{"line_number":82,"context_line":""},{"line_number":83,"context_line":"Todo: graceful"},{"line_number":84,"context_line":""},{"line_number":85,"context_line":""},{"line_number":86,"context_line":"We propose introducing a stateless, self-validating three-way RPC handshake to"}],"source_content_type":"text/x-rst","patch_set":2,"id":"90062a8d_dab347b7","line":83,"in_reply_to":"a085117c_07d15ec4","updated":"2026-06-11 13:19:34.000000000","message":"Thx for pointing out my missed todo.\nTo be honest i don\u0027t fully remember but i guess the thought was that when the node is doing a graceful shutdown no new messages are accepted by the regular message queue (compute topic) so the heartbeat has to switch to the alt topic in case heartbeat updates to the db are necessary during this shutdown period at all.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":94,"context_line":"1. **The Echo Request (Compute):** The nova-compute ServiceGroup\u0027s threadgroup"},{"line_number":95,"context_line":" is triggered. Instead of directly requesting a DB update, it generates a"},{"line_number":96,"context_line":" local timestamp (``T_comp``) using a monotonic clock (e.g.,"},{"line_number":97,"context_line":" time.monotonic()) to ensure immunity to NTP drift or leap seconds. It then"},{"line_number":98,"context_line":" casts an ``echo_liveness_challenge`` to the conductors. *Payload:*"},{"line_number":99,"context_line":" ``{\"host\": \"compute-01\", \"timestamp\": T_comp}``"},{"line_number":100,"context_line":""},{"line_number":101,"context_line":"2. **The Echo Response (Conductor):** A stateless nova-conductor receives"},{"line_number":102,"context_line":" the request. It does not yet touch the database. It simply prepares a"}],"source_content_type":"text/x-rst","patch_set":2,"id":"d37583a5_11369622","line":99,"range":{"start_line":97,"start_character":69,"end_line":99,"end_character":50},"updated":"2026-06-04 10:47:46.000000000","message":"This needs an conductor RPC change","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":39013,"name":"Philipp Dreesens","email":"philipp.dreesens@digits.schwarz","username":"phildree"},"change_message_id":"ea0d2daf5bb5506d60de0326518e9f6884eb3b48","unresolved":false,"context_lines":[{"line_number":94,"context_line":"1. **The Echo Request (Compute):** The nova-compute ServiceGroup\u0027s threadgroup"},{"line_number":95,"context_line":" is triggered. Instead of directly requesting a DB update, it generates a"},{"line_number":96,"context_line":" local timestamp (``T_comp``) using a monotonic clock (e.g.,"},{"line_number":97,"context_line":" time.monotonic()) to ensure immunity to NTP drift or leap seconds. It then"},{"line_number":98,"context_line":" casts an ``echo_liveness_challenge`` to the conductors. *Payload:*"},{"line_number":99,"context_line":" ``{\"host\": \"compute-01\", \"timestamp\": T_comp}``"},{"line_number":100,"context_line":""},{"line_number":101,"context_line":"2. **The Echo Response (Conductor):** A stateless nova-conductor receives"},{"line_number":102,"context_line":" the request. It does not yet touch the database. It simply prepares a"}],"source_content_type":"text/x-rst","patch_set":2,"id":"66746b88_4864944e","line":99,"range":{"start_line":97,"start_character":69,"end_line":99,"end_character":50},"in_reply_to":"d37583a5_11369622","updated":"2026-06-11 13:19:34.000000000","message":"Acknowledged","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":102,"context_line":" the request. It does not yet touch the database. It simply prepares a"},{"line_number":103,"context_line":" targeted RPC client (server\u003decho_liveness_challenge.host) and casts a"},{"line_number":104,"context_line":" ``liveness_challenge_response`` directly to the specific compute node."},{"line_number":105,"context_line":" *Payload:* ``{\"timestamp\": T_comp}``"},{"line_number":106,"context_line":""},{"line_number":107,"context_line":"3. **Validation \u0026 DB Update (Main Listener):** The ``nova-compute`` main RPC"},{"line_number":108,"context_line":" listener thread pulls the response from its queue. It compares the returned"}],"source_content_type":"text/x-rst","patch_set":2,"id":"8280e18a_ba7206f4","line":105,"updated":"2026-06-04 10:47:46.000000000","message":"this needs an compute RPC change.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":39013,"name":"Philipp Dreesens","email":"philipp.dreesens@digits.schwarz","username":"phildree"},"change_message_id":"ea0d2daf5bb5506d60de0326518e9f6884eb3b48","unresolved":false,"context_lines":[{"line_number":102,"context_line":" the request. It does not yet touch the database. It simply prepares a"},{"line_number":103,"context_line":" targeted RPC client (server\u003decho_liveness_challenge.host) and casts a"},{"line_number":104,"context_line":" ``liveness_challenge_response`` directly to the specific compute node."},{"line_number":105,"context_line":" *Payload:* ``{\"timestamp\": T_comp}``"},{"line_number":106,"context_line":""},{"line_number":107,"context_line":"3. **Validation \u0026 DB Update (Main Listener):** The ``nova-compute`` main RPC"},{"line_number":108,"context_line":" listener thread pulls the response from its queue. It compares the returned"}],"source_content_type":"text/x-rst","patch_set":2,"id":"120e3728_b14b8ddc","line":105,"in_reply_to":"8280e18a_ba7206f4","updated":"2026-06-11 13:19:34.000000000","message":"Acknowledged","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":false,"context_lines":[{"line_number":179,"context_line":"of normal activity. To reduce unnecessary load on the message broker, a future"},{"line_number":180,"context_line":"iteration could optimize this mechanism by only triggering the active handshake"},{"line_number":181,"context_line":"if the compute node has not been processing incoming RPC messages for a"},{"line_number":182,"context_line":"specific duration (probably ``service_down_time`` would make sense here)."},{"line_number":183,"context_line":""},{"line_number":184,"context_line":"Alternatives"},{"line_number":185,"context_line":"------------"}],"source_content_type":"text/x-rst","patch_set":2,"id":"caad8a99_df975008","line":182,"updated":"2026-06-04 10:47:46.000000000","message":"I agree that we can consider this in the future. Also this somewhat relates to the healthcheck effort we paused for a while. So if you are interested to move this further in a future release then I suggest to also look at the spec: https://specs.openstack.org/openstack//nova-specs/specs/2025.1/approved/per-process-healthchecks.html","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"2bec285b56f0f15b7594228c846a903270609682","unresolved":false,"context_lines":[{"line_number":179,"context_line":"of normal activity. To reduce unnecessary load on the message broker, a future"},{"line_number":180,"context_line":"iteration could optimize this mechanism by only triggering the active handshake"},{"line_number":181,"context_line":"if the compute node has not been processing incoming RPC messages for a"},{"line_number":182,"context_line":"specific duration (probably ``service_down_time`` would make sense here)."},{"line_number":183,"context_line":""},{"line_number":184,"context_line":"Alternatives"},{"line_number":185,"context_line":"------------"}],"source_content_type":"text/x-rst","patch_set":2,"id":"d48ab194_851fa7a8","line":182,"in_reply_to":"caad8a99_df975008","updated":"2026-06-23 19:29:25.000000000","message":"++, i prefer to do it along with it but as we agree in PTG, let\u0027s keep it in side for now and if healthchecks effort is done before that then we will be good.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"310edc80_c54e5cc2","line":268,"updated":"2026-06-04 10:47:46.000000000","message":"We need to bump both the conductor and the compute RPC. We can assume that during the upgrade all the conductors are upgraded before the computes are upgraded. So when a new compute comes up it can already utilize the new conductor RCP version to send echo_liveness_challenge. But at that point the conductor might not be able to send a liveness_challenge_response via the compute RPC if there are old computes in the system. @sbauza@redhat.com @dms@danplanet.com Am I correct about this? or our RPC allows selectively sending the new message to new computes supporting the new version?","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":39013,"name":"Philipp Dreesens","email":"philipp.dreesens@digits.schwarz","username":"phildree"},"change_message_id":"ca862877ebfad67c0483cff3cd6fcc65824ed6fd","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"f7b01262_ec6f00f7","line":268,"in_reply_to":"20993b21_8dfb6068","updated":"2026-06-12 11:47:02.000000000","message":"I think I understand what you meant now @dms@danplanet.com, thank you for the clarification!\n\nJust to ensure I have the workflow mapped out correctly:\n\n1) Initiation (call): The compute node initiates the handshake by sending the echo_liveness_challenge to the conductor as a call.\n\n2) Conductor Evaluation \u0026 Fallback:\n- If the pin is too old (due to pending compute upgrades or conductors needing a restart for pin recalculation), the conductor hits the RPC pin limitation. Because the compute initiated this via a call, the conductor propagates this information back to the compute node as an error. The compute catches this error, knows the new workflow isn\u0027t ready, and gracefully falls back to the regular DB update method.\n- If the pin is updated, the conductor returns the call (acknowledging receipt) and then initiates the liveness_challenge_response back to the compute node as a standard cast, thereby covering the compute node\u0027s \"inbound\" RPC message path.\n\n3) Queue Validation: If the compute node received the successful call return, it now waits for the subsequent cast to arrive on its main compute topic listener queue before officially declaring itself \"up\" and updating the database (otherwise, as stated above, it falls back to the regular DB update method).\n\nAs an additional optimization, this call initiation would only need to be employed until it succeeds for the first time. Once a call returns successfully, the compute node knows the conductor\u0027s RPC pin has been fully updated. For all subsequent heartbeats, the compute node can switch to using a standard cast to initiate the challenge, avoiding the overhead of synchronous calls once the upgrade is confirmed complete.\n\n\nIf I have that right, I think this is a good solution to prevent the failure during rolling upgrades and still completely fulfills the spec\u0027s goal of validating both \"rx/tx\" of a compute node.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"648d1ba667b683e6abd6b89e6f3850c9fbdd9915","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"ea965d54_6eb40064","line":268,"in_reply_to":"2ffa070c_227e09b0","updated":"2026-06-04 15:35:34.000000000","message":"Since this is all initiated from a compute that is obviously new enough (else it would not have used the new RPC cast to conductor, I think we\u0027re fairly safe here). I think conductor will need to check and honor the compute RPC pin, which means it won\u0027t be able to do the second step if the pin is too low and will have to raise an error. Thus, we\u0027ll have to note that enabling this before all computes are upgraded will be problematic (and thus maybe compute should also refuse to let you set the config until it can tell all computes are upgraded). The problem will be that you have to upgrade all computes *then* restart all conductors so they recalculate their pin, and only then will it work. If you do enable it before that, then computes will silently not be able to report their liveness.\n\nIs there some reason we wouldn\u0027t consider a call-\u003ecall instead of cast-\u003ecast here? I think that will achieve the same result of making sure the conductor-\u003ecompute happens through the same queue as normal requests, but with the benefit of computes being informed by conductor that \"I\u0027m not allowed to do this update for you because I can\u0027t make the new call back to you because of the pin.\" That would eliminate the silent \"compute thinks it is using the new path but failing\" mode, and would also allow the compute to have this enabled from the start, and only start using it once the new call to conductor stops reporting that it\u0027s too early.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":7166,"name":"Sylvain Bauza","email":"sbauza@redhat.com","username":"sbauza"},"change_message_id":"7535a146166aa04528552430323cafb78dbf6ac0","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"2ffa070c_227e09b0","line":268,"in_reply_to":"310edc80_c54e5cc2","updated":"2026-06-04 13:15:27.000000000","message":"you\u0027re right, nothing in the RPC API selectively checks new computes. Either the operator can pin a specific RPC API version or when starting the RPC client, the conductor pins the API version at the oldest version for all the computes.\nhttps://github.com/openstack/nova/blob/278c6e305c3da085fd1c1e338e95ce4d63631272/nova/compute/rpcapi.py#L495\n\nThere are some alternatives (either, ors) :\n* we can document it that the conf option should only be enabled after all computes in the cell are upgraded. \n\n* we could add a conductor-side fallback : in the echo_liveness_challenge handler, the conductor checks can_send_version. If it can\u0027t send the response, it falls back to performing the DB update directly\n\n* we could provide an exception by the existing per-host version checking pattern: the conductor could look up the whole specific compute\u0027s service versions and stop if not.\n\nHonestly, I didn\u0027t had a lot of time for reviewing the whole spec, so maybe I\u0027m wrong for the alternatives.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"2bec285b56f0f15b7594228c846a903270609682","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"16510531_d5c6101f","line":268,"in_reply_to":"84d267cd_3362e348","updated":"2026-06-23 19:29:25.000000000","message":"\u003e I guess, but I don\u0027t see why you wouldn\u0027t just make it the same all the time. Compute makes a call (instead of a cast) to kick off the probe. If it returns then we do the new thing. The \"overhead of synchronous calls\" (of which compute makes a ton of all the time already) is not more important than \"the overhead of a more complex set of code\" IMHO.\n\nIf we see the overhead on RPC, we can do this handshake via another RPC server (\u0027compute-alt\u0027 topic), and we can easily stop the handshake when shutdown is initiated (when regular RPC server is down).","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"96568f19b9225687987dba92d592daa095cea641","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"20993b21_8dfb6068","line":268,"in_reply_to":"a577a882_e3406a7e","updated":"2026-06-11 16:40:45.000000000","message":"\u003e Regarding @dms@danplanet.com\u0027s call suggestion: Using a call for the actual handshake defeats the core purpose of this spec, because the return payload uses AMQP\u0027s temporary reply_to queue, completely bypassing the Compute node\u0027s main compute topic listener queue.\n\nI think you\u0027re missing my point. I understand that the reply to a call bypasses the main RPC queue that you\u0027re trying to test. I\u0027m not suggesting that you make a single call from compute to conductor to use that as the \"am I alive\" signal. I\u0027m suggesting you chain two calls together (or a call and a cast). A call to conductor, which then makes the call (or cast) back to the compute. The call from compute to conductor will fail if the RPC pin is too old to allow the new workflow, which will signal to the compute that it needs to fall back. If the call returns, we know that the new-style thing was kicked off and we should wait for it to come in via the regular RPC queue before declaring that we\u0027re up.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":39013,"name":"Philipp Dreesens","email":"philipp.dreesens@digits.schwarz","username":"phildree"},"change_message_id":"ea0d2daf5bb5506d60de0326518e9f6884eb3b48","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"a577a882_e3406a7e","line":268,"in_reply_to":"ea965d54_6eb40064","updated":"2026-06-11 13:19:34.000000000","message":"Just to make sure I completely understand the upgrade constraint: my initial idea of having updated computes use the new handshake while old computes use the legacy DB update doesn\u0027t work out-of-the-box because the Conductor\u0027s ComputeAPI client pins its global allowed RPC version to the oldest compute node in the cell.\n\nGiven that, it seems to me there are three potential paths forward:\n\n- Per-Host Versioning: Implement per-host RPC version checking on the Conductor so it can selectively send the liveness_challenge_response cast only to updated computes.\n\n- Conductor-Side Fallback: Implement @sbauza@redhat.com\u0027s alternative where the Conductor checks its global pin. If it is pinned to an older version and cannot send the return cast, the Conductor falls back to updating the DB directly on the compute\u0027s behalf.\n\n- Strict Documentation: As @dms@danplanet.com pointed out, just documenting \"enable after computes are upgraded\" is insufficient. We would have to explicitly document that operators must upgrade all computes, then restart all conductors to recalculate the pin, and only then enable the config.\n\n\n\nRegarding @dms@danplanet.com\u0027s call suggestion: Using a call for the actual handshake defeats the core purpose of this spec, because the return payload uses AMQP\u0027s temporary reply_to queue, completely bypassing the Compute node\u0027s main compute topic listener queue.\n\nHowever, maybe we could use a call strictly as a fallback? If the compute casts the challenge but doesn\u0027t receive the response in time, it could issue a secondary synchronous call to the conductor to check the cell\u0027s RPC pin status. Because the return value of a call bypasses the Conductor\u0027s outbound RPC version pin, the Conductor could successfully inform the compute of the mismatch. The compute node could then locally revert to the old DB update method until all compute nodes are updated and the cell\u0027s RPC pin supports the heartbeat mechanism.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"43444b090201ad62fcf3c7cb4ddb7627bd3ca1b6","unresolved":true,"context_lines":[{"line_number":265,"context_line":" completely safe. Deployments will utilize the legacy 1-way heartbeat during"},{"line_number":266,"context_line":" the upgrade process."},{"line_number":267,"context_line":"* **RPC Version Bump:** The ``nova-compute`` and ``nova-conductor`` RPC APIs"},{"line_number":268,"context_line":" require a version bump to support the two new \"ping-pong\" methods."},{"line_number":269,"context_line":"* **Rolling Upgrades:** If an operator enables the feature during a"},{"line_number":270,"context_line":" mixed-version window, older compute nodes will lack the new configuration and"},{"line_number":271,"context_line":" will continue sending legacy ``service_update`` messages from their timer"}],"source_content_type":"text/x-rst","patch_set":2,"id":"84d267cd_3362e348","line":268,"in_reply_to":"f7b01262_ec6f00f7","updated":"2026-06-12 14:01:48.000000000","message":"\u003e 2) Conductor Evaluation \u0026 Fallback:\n\u003e - If the pin is too old (due to pending compute upgrades or conductors needing a restart for pin recalculation), the conductor hits the RPC pin limitation. Because the compute initiated this via a call, the conductor propagates this information back to the compute node as an error. The compute catches this error, knows the new workflow isn\u0027t ready, and gracefully falls back to the regular DB update method.\n\nExactly.\n\n\u003e - If the pin is updated, the conductor returns the call (acknowledging receipt) and then initiates the liveness_challenge_response back to the compute node as a standard cast, thereby covering the compute node\u0027s \"inbound\" RPC message path.\n\nWell, it initiates the cast before returning, and if it returns (instead of errors with the RPC pin failure), then it knows it succeeded.\n\n\u003e 3) Queue Validation: If the compute node received the successful call return, it now waits for the subsequent cast to arrive on its main compute topic listener queue before officially declaring itself \"up\" and updating the database (otherwise, as stated above, it falls back to the regular DB update method).\n\nI\u0027m not sure how much synchronous behavior you\u0027re actually planning versus just writing here for the sake of explanation, but I don\u0027t think the compute node needs to do all this in a single step.\n\n1. Timer fires to update liveness\n2. Compute sets the \"time of last round-trip check\" variable\n3. Compute calls to conductor to initiate the round-trip\n4. If that failed, then update in DB as previous and stop\n5. If that succeeded, then the periodic just exits\n\nseparately:\n\n1. Liveness challenge comes into the compute (via cast from conductor in step 5 above)\n2. Compute compares the timing in the message to the timer set in step 2 above\n3. Compute updates its liveness in the DB if the two are within the acceptable window. If not, it doesn\u0027t and logs a warning.\n\n\u003e As an additional optimization, this call initiation would only need to be employed until it succeeds for the first time. Once a call returns successfully, the compute node knows the conductor\u0027s RPC pin has been fully updated. For all subsequent heartbeats, the compute node can switch to using a standard cast to initiate the challenge, avoiding the overhead of synchronous calls once the upgrade is confirmed complete.\n\nI guess, but I don\u0027t see why you wouldn\u0027t just make it the same all the time. Compute makes a call (instead of a cast) to kick off the probe. If it returns then we do the new thing. The \"overhead of synchronous calls\" (of which compute makes a ton of all the time already) is not more important than \"the overhead of a more complex set of code\" IMHO.\n\n\u003e If I have that right, I think this is a good solution to prevent the failure during rolling upgrades and still completely fulfills the spec\u0027s goal of validating both \"rx/tx\" of a compute node.\n\nTo be clear, I\u0027d prefer it just work this way and not be a \"during rolling upgrades\" difference. Having it do this all the time also provides a benefit of some discoverability. If unchecked casts going to a headless service (conductor) are getting dropped, consumed, not acted on, etc and there\u0027s no real indication of that happening other than some computes who never report they\u0027re alive will be super frustrating. Having the computes able to log \"conductor never responded to my liveness probe request\" or \"conductor is failing to send the liveness probe for an unexpected reason\" instead of just silence would be a feature I think. IIRC right now, we just fail with a stack trace in a (proxied) DB update call and while those with tribal knowledge know what it means, it\u0027s very obscure. But, at least it\u0027s there because we\u0027re doing it via a call.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"7176749036aaffae74764ed31dbd9c902d3ad444","unresolved":true,"context_lines":[{"line_number":357,"context_line":" assert the node is correctly marked down despite the active timer."},{"line_number":358,"context_line":" * Rolling Upgrades: Verify backward compatibility by running mixed versions"},{"line_number":359,"context_line":" of compute and conductor services."},{"line_number":360,"context_line":""},{"line_number":361,"context_line":"Documentation Impact"},{"line_number":362,"context_line":"\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d\u003d"},{"line_number":363,"context_line":""}],"source_content_type":"text/x-rst","patch_set":2,"id":"adb62e76_3c276e3a","line":360,"updated":"2026-06-04 10:47:46.000000000","message":"Change nova-next CI job configuration to use the new hearthbeat method implemented here so we can see it working during tempest as well.","commit_id":"14e7aac061ae6a4062e295d224c1da4215573308"}]}