)]}'
{"/PATCHSET_LEVEL":[{"author":{"_account_id":16688,"name":"Rodolfo Alonso","email":"ralonsoh@redhat.com","username":"rodolfo-alonso-hernandez"},"change_message_id":"b0d2fe85dc52369df421b20e2fb24079167df01c","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"ad1784a9_e4761859","updated":"2022-01-07 11:43:39.000000000","message":"This patch is actually hiding the issue reported in the bug. The metadata proxy won\u0027t be bind to any port, in this case the IPv6 metadata IP address. The fix for this problem is being addressed in https://review.opendev.org/c/openstack/neutron/+/820897, stopping the multiple subnet \"crosstalk\" that leads to the dadfailed status.","commit_id":"287c4780e7b77697baac3c1b0bbc495ff6f2f0ec"},{"author":{"_account_id":16688,"name":"Rodolfo Alonso","email":"ralonsoh@redhat.com","username":"rodolfo-alonso-hernandez"},"change_message_id":"54062e7de03f235769261786d4379703b3904329","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"ac64289f_10c0a2fe","in_reply_to":"65cc8c07_369742fe","updated":"2022-01-07 14:04:34.000000000","message":"I could \"fix\" the namespace connectivity but is not fixing what is broken here and the metadata agent won\u0027t have connectivity. As commented, this patch is hiding the real issue just to prevent the DHCP namespace to be recreated. NetworkCache is updated accordingly to the status of the DHCP namespace. If we have an error, as reported in the bug, the cache reflects this. If the metadata agent is not working, the DHCP namespace is not in healthy state.\n\n[1] is the proper fix for this error.\n\n[1]https://review.opendev.org/c/openstack/neutron/+/820897","commit_id":"287c4780e7b77697baac3c1b0bbc495ff6f2f0ec"},{"author":{"_account_id":6541,"name":"Kamil Madac","email":"kamil.madac@gmail.com","username":"kamil-madac"},"change_message_id":"dca966af94fb0bdcd8cc0340119b1347ae5df3a6","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"d0ba1f31_3ed5e8c6","in_reply_to":"ac64289f_10c0a2fe","updated":"2022-01-07 20:57:49.000000000","message":"Thanks for clarification, I got it now. You are ok, that when exception occurs let\u0027s delete namespace and after restart it\u0027ll be recreated. From developer point of view it makes totally sense. Problem I have is that it is not how it behaves and I do not see from the code that this is the intention of someone who wrote the code (no call of deletion of namespace after unusccesful call of update_isolated_metadata_proxy). DHCP namespace is not deleted when exception occurs or when dadfailed happens, but much much later, could be months after dadfailed issue.\n\nLet me please explain it from cloud Operator point of view, what is real behavior:\n\nTo trigger the bug, we needed to restart dhcp agent when existing namespace is already in dadfailed state. Then you could see that duplicate address Exception is raised. Even after exception was raised everything worked normally on ipv4 (metadata/dhcp/everything) level, as I wrote, for several months on some networks (depends on tenant activity). We have seen that exception in logs, but as we do not use ipv6 we did not bother with that as everything looked working good for several months. Then if in such network user creates VM (probably any modification of neutron port can triggers that), then dhcp agent silently deletes whole namespace. No message in INFO level of logging. Only after turning on debug level there is a message in log:\n\n2022-01-01 08:17:52.841 7 DEBUG neutron.agent.linux.dhcp [req-07bf64cd-521a-4b75-b6ca-47d7b13cc9ce - - - - -] Killing dnsmasq for network since all subnets have turned off DHCP: d86b1ff7-1517-41f4-86e8-a07c97b04556 reload_allocations /var/lib/kolla/venv/lib/python3.8/site-packages/neutron/agent/linux/dhcp.py:586\n\nAs you can see dhcp agent thinks that all subnets have disabled DHCP, but it is not true. Network has subnets with dhcp enabled.\n\nThen one user after another were creating incidents on support team, that VMs are not reachable as leases on VMs expired one by one. Even from debug logs we were not able to troubleshoot why namespaces  were deleted (subnets had dhcp enabled and were assigned to network). We had to add more log messages and found out, that NetworkCache is incorrectly populated (missing subnets) and therefore dhcp agent thinks that namespace should be deleted. So dhcp agent did not delete the namespace because of duplicate address exception, but several days/weeks/months later because we restart dhcp agent and NetworkCache was incorrectly populated.\n\nI completely understand that fixing the bug 820897 will solve that particular one bug with dadfailed interface, but this change-set fixes bug with NetworkCache update. Bug 820897 just caused to show that bug.\n\nMaybe it would be good to create dedicated bug for the behavior which I\u0027m describing here and we can discuss it there where I can attach real logs from production. What do you think? Would that be better way to go?\n\nSorry for so long response.","commit_id":"287c4780e7b77697baac3c1b0bbc495ff6f2f0ec"},{"author":{"_account_id":6541,"name":"Kamil Madac","email":"kamil.madac@gmail.com","username":"kamil-madac"},"change_message_id":"42e6a9e168b171816e2047e9a9f153bb1ca9284e","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"65cc8c07_369742fe","in_reply_to":"ad1784a9_e4761859","updated":"2022-01-07 12:52:39.000000000","message":"I do not think it hides the issue. Operator still can see exception in the log as ERROR and in this particular case ipv6 metadata won\u0027t work, so it won\u0027t be hidden. But bug will be localized on particular dhcp feature, like only ipv6 metadata connectivity in this case and nothing else.\n\nWhat it fixes is that dhcp namespace and connectivity to all existings VMs in such network won\u0027t be lost if any exception happens before update of NetworkCache, because NetworkCache will be always in correct state. So here, I\u0027m trying to fix incorrect NetworkCache records after raised exception. This could potentialy avoid more bugs, because who knows what else can happen when NetworkCache is incorrectly populated. Deletion of dhcp namespace can be only one demostration of such inconsistency, and I think is quite severe demonstration.","commit_id":"287c4780e7b77697baac3c1b0bbc495ff6f2f0ec"}]}
