)]}'
{"nova/virt/libvirt/host.py":[{"author":{"_account_id":11604,"name":"sean mooney","email":"smooney@redhat.com","username":"sean-k-mooney"},"change_message_id":"193c95ee78c45f6fbcc61814030636c8ca515295","unresolved":true,"context_lines":[{"line_number":294,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED"},{"line_number":295,"context_line":"            else:"},{"line_number":296,"context_line":"                # Failed or some other status we don\u0027t know about, so just"},{"line_number":297,"context_line":"                # opt to report the guest is paused."},{"line_number":298,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_PAUSED"},{"line_number":299,"context_line":"        elif event \u003d\u003d libvirt.VIR_DOMAIN_EVENT_RESUMED:"},{"line_number":300,"context_line":"            transition \u003d virtevent.EVENT_LIFECYCLE_RESUMED"}],"source_content_type":"text/x-python","patch_set":1,"id":"80b6a2a0_591c5cab","line":297,"updated":"2025-11-04 19:54:26.000000000","message":"my concern here with this approhc is this may reintoduce \nhttps://bugs.launchpad.net/nova/+bug/1788014","commit_id":"f16ce085b35ee110953923d3d38e386315282394"},{"author":{"_account_id":11604,"name":"sean mooney","email":"smooney@redhat.com","username":"sean-k-mooney"},"change_message_id":"a327c2fddd2fd829c96224c2961cd7a2abf1bf98","unresolved":true,"context_lines":[{"line_number":294,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED"},{"line_number":295,"context_line":"            else:"},{"line_number":296,"context_line":"                # Failed or some other status we don\u0027t know about, so just"},{"line_number":297,"context_line":"                # opt to report the guest is paused."},{"line_number":298,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_PAUSED"},{"line_number":299,"context_line":"        elif event \u003d\u003d libvirt.VIR_DOMAIN_EVENT_RESUMED:"},{"line_number":300,"context_line":"            transition \u003d virtevent.EVENT_LIFECYCLE_RESUMED"}],"source_content_type":"text/x-python","patch_set":1,"id":"d6f2abac_7fd26ff7","line":297,"in_reply_to":"80b6a2a0_591c5cab","updated":"2025-11-05 12:25:18.000000000","message":"so this does work in the happy path as expected.\nthis does break the unit test added by https://review.opendev.org/c/openstack/nova/+/594527\n\n\nthe catch 22 we have is as follows \n\nas reported in https://bugs.launchpad.net/nova/+bug/2128665  guest.get_job_info()\n\nwill block while libvirt is cleaning up the vm but the vm will already be running on the dest.\n\nbut if your using transparrent hugepages (whcih is a valid thing to do with vms in general)\nyou can triger this kernel bug that is only fixed in 6.13\nhttps://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing.com/\n\nthat severly degrades the performace fo the memory alllcoations and deallcoaitons\n\nso whiel cleanign the guest mememory shoudl be very quick it can take 10s of seconds\n\ni am wondering 2 things \n\n1 should we add a workaroudn flag to allows treating  VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED as EVENT_LIFECYCLE_MIGRATION_COMPLETED\n\n2 if not i was dicsusing this with zhan zhang and and they suggested creating a seperate callback just to activate teh netorkign early\n\nthis woudl mean spliting post live migriton in two first we woudl do port activation on \nVIR_DOMAIN_EVENT_SUSPENDED_MIGRATED or VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY\n\nand then do the rest of post live migrate only on VIR_DOMAIN_JOB_COMPLETED\n\nthat woudl need a spec as we woudl be changing the live migrtion workflow quite significnatly.\n\n\nthe concurn with just doing what i have in the poc patch is VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED can be sent if the migration erros and we will then resume on the current host\n\nso we woudl need to ensure in that case we properly fix the port bidnign in neutron\nwhich i obviouly have not done here.\n\nif we were to implement that today we woudl be adding a network downtim in the live migrtion failure case.\n\nso basically the workaround flag woudl allow operaotr who hit the kernel bug to workaround it in the happy path at the cost of some network downtime if the migration fails and we resume the vm on the current host. that is likely better then downtime when the migrtion succeeds as failure should be much much less frequetn\n\ni would like to know how other feel about this.\n\nshoudl we have a spec to design the more advanced migration workflow with network activation splict form the rest of post live migrate\n\nor shoudl we treat this as a bug with a workaround flag to allow operator to opt in to the earlyer activation of the port binding.\n\nalso importnat context this should not be required for ml2/ovn \nwe modifed ovn and ml2/ovn so that the prot can be activated at the ovn level via the RARP packet form qemu this should nto impact ml2/ovs either\n\nthe original bug is reported for calico which is not properly seting up networkign on the destination host until the port biding is active.\n\nthis is techinally not correct, it should configure networkign on all host with a port biding if the tap device is present adn the bind host or migrating_to refence the relevent host.\n\nthis coudl be impoved in calicos impletion to bring it inlien with ml2/ovs and ml2/ovn we had to explcity modify ml2/ovn and ml2/odl in the past to do this as this si commonly missed in third party ml2 drivers.\n\nall the neutron core driver supprot this behvior.","commit_id":"f16ce085b35ee110953923d3d38e386315282394"},{"author":{"_account_id":38560,"name":"Zhan Zhang","display_name":"Zhan","email":"zhanz1.ius@proton.me","username":"zhanz1","status":"Software Engineer @ Bloomberg"},"change_message_id":"4052077f74e31b6350d2995e34e1e63132430ac2","unresolved":true,"context_lines":[{"line_number":294,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED"},{"line_number":295,"context_line":"            else:"},{"line_number":296,"context_line":"                # Failed or some other status we don\u0027t know about, so just"},{"line_number":297,"context_line":"                # opt to report the guest is paused."},{"line_number":298,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_PAUSED"},{"line_number":299,"context_line":"        elif event \u003d\u003d libvirt.VIR_DOMAIN_EVENT_RESUMED:"},{"line_number":300,"context_line":"            transition \u003d virtevent.EVENT_LIFECYCLE_RESUMED"}],"source_content_type":"text/x-python","patch_set":1,"id":"af179a16_2637b1dc","line":297,"in_reply_to":"90a26e1c_9bd82c52","updated":"2025-11-07 20:32:19.000000000","message":"Created https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-in-live-migrations for this.","commit_id":"f16ce085b35ee110953923d3d38e386315282394"},{"author":{"_account_id":38560,"name":"Zhan Zhang","display_name":"Zhan","email":"zhanz1.ius@proton.me","username":"zhanz1","status":"Software Engineer @ Bloomberg"},"change_message_id":"39f2f491e09ee570fc46648cb560d30fe708907f","unresolved":true,"context_lines":[{"line_number":294,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_MIGRATION_COMPLETED"},{"line_number":295,"context_line":"            else:"},{"line_number":296,"context_line":"                # Failed or some other status we don\u0027t know about, so just"},{"line_number":297,"context_line":"                # opt to report the guest is paused."},{"line_number":298,"context_line":"                transition \u003d virtevent.EVENT_LIFECYCLE_PAUSED"},{"line_number":299,"context_line":"        elif event \u003d\u003d libvirt.VIR_DOMAIN_EVENT_RESUMED:"},{"line_number":300,"context_line":"            transition \u003d virtevent.EVENT_LIFECYCLE_RESUMED"}],"source_content_type":"text/x-python","patch_set":1,"id":"90a26e1c_9bd82c52","line":297,"in_reply_to":"d6f2abac_7fd26ff7","updated":"2025-11-05 14:37:51.000000000","message":"Thanks for looking into this, Sean. A couple of points I want to add is:\n\n- There can be other bugs/changes in the future (either in libvirt, QEMU, or  kernel) that causes an extra delay from where the VM on source is paused to when the migration is done, and the get_job_info may or may not hang. Yes it is due to THP/KSM this time, but who knows what it\u0027ll be in the future. So I think it would be better to modify Nova to not be affected by these unexpected behaviors and thus making it more resilient.\n\n- In terms of should we just simply treat VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED as EVENT_LIFECYCLE_MIGRATION_COMPLETED or should we take a step further to extract the networking side of things out of post live migration, I don\u0027t have strong opinions on this, as long as we agree that we can start (after addressing all of the failure cases) the port binding activation when VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED happens. I\u0027m happy to work on both.\n\n- I\u0027m also thinking making this configurable so that the operators can choose what to do that\u0027s best for their environment.","commit_id":"f16ce085b35ee110953923d3d38e386315282394"}]}
