)]}' {"/PATCHSET_LEVEL":[{"author":{"_account_id":9257,"name":"John Eckersberg","email":"jeckersb@redhat.com","username":"jeckersb"},"change_message_id":"c99e6a87ce287b501622a7029b21876f09c4a9f6","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":1,"id":"1f53948f_4b2ebaf8","updated":"2021-11-09 14:21:45.000000000","message":"Yeah after sleeping on this, I\u0027m thinking that by fixing one race I\u0027ve probably introduced another race. By immediately waking up the eventloop to call _hard_reset, we\u0027ve now got a situation where the eventloop can start cleaning up via _hard_reset, while at the same time the pyngus callback is still running.\n\nI didn\u0027t notice this in my initial testing because I was abusing the fact that the original race is easy to reproduce with nova-api/eventlet/wsgi. But I think because eventlet is being used it was artificially eliminating this second race since the eventloop greenthread couldn\u0027t run until the pyngus callback finished and yielded the eventlet scheduler.\n\nI\u0027ll think about this a bit more.","commit_id":"3128a5b2de62299664939bba8f59ca3b484e8c1f"},{"author":{"_account_id":28522,"name":"Hervé Beraud","email":"herveberaud.pro@gmail.com","username":"hberaud"},"change_message_id":"d4f795705003211808d6379a91525b28d9d9c57a","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"622228e1_918c849f","updated":"2021-11-09 13:08:07.000000000","message":"recheck\n\nHello,\n\nWith these changes sometimes tests timed out locally and on the gates.\nEspecially the `test_amqp_driver` testing module [1].\n\nEach time I run the test with your changes I can see the following traceback in the logs:\n\n```\nTraceback (most recent call last):\n File \"/usr/lib64/python3.8/threading.py\", line 932, in _bootstrap_inner\n self.run()\n File \"/home/hberaud/dev/redhat/upstream/openstack/oslo/oslo.messaging/oslo_messaging/_drivers/amqp1_driver/eventloop.py\", line 348, in run\n self._main_loop()\n File \"/home/hberaud/dev/redhat/upstream/openstack/oslo/oslo.messaging/oslo_messaging/_drivers/amqp1_driver/eventloop.py\", line 392, in _main_loop\n self._requests.process_requests()\n File \"/home/hberaud/dev/redhat/upstream/openstack/oslo/oslo.messaging/oslo_messaging/_drivers/amqp1_driver/eventloop.py\", line 268, in process_requests\n r()\n File \"/home/hberaud/dev/redhat/upstream/openstack/oslo/oslo.messaging/oslo_messaging/_drivers/amqp1_driver/controller.py\", line 1112, in _start_shutdown\n if self._active:\n File \"/home/hberaud/dev/redhat/upstream/openstack/oslo/oslo.messaging/oslo_messaging/_drivers/amqp1_driver/controller.py\", line 1334, in _active\n self._socket_connection.pyngus_conn.active)\nAttributeError: \u0027NoneType\u0027 object has no attribute \u0027active\u0027\n```\n\nThat\u0027s not the case without these changes. I can\u0027t reproduce this error without these changes, so I suppose something needs to be adapted on the testing side too.\n\nThe problem appear within `TestCyrusAuthentication.test_authentication_failure` [1][2] but I suppose the problem the real problem is more or less located in the method `TestCyrusAuthentication._authentication_test` which is called by this test.\n\nIt looks like to a race condition with the active connection.\n\n[1] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/tests/drivers/test_amqp_driver.py#L879\n[2] .tox/py38/bin/python -m testtools.run oslo_messaging.tests.drivers.test_amqp_driver.TestCyrusAuthentication.test_authentication_failure\n[3] https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/tests/drivers/test_amqp_driver.py#L855","commit_id":"3128a5b2de62299664939bba8f59ca3b484e8c1f"},{"author":{"_account_id":9257,"name":"John Eckersberg","email":"jeckersb@redhat.com","username":"jeckersb"},"change_message_id":"7c1ba65fc175ecea7ec8256d87f2eceb07faef6d","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"9711ce71_eb6f6cb6","updated":"2021-11-09 21:04:24.000000000","message":"Actually I think v2 may be enough to fix this. It looks like the tests were failing because _hard_reset gets called immediately, which in turn ultimately calls eventloop._SocketConnection.reset(), and that method sets self.pyngus_conn \u003d None.\n\nMeanwhile, Controller._active was checking self._socket_connection and self._socket_connection.pyngus_conn.active, but wasn\u0027t considering that pyngus_conn could be None. So I just added an additional check for that case. Running the unit tests a few times locally seems to pass consistently now. We\u0027ll see what Zuul has to say about it.","commit_id":"02a38f507d8f0c377a2ef468e3497b5e897f1b09"},{"author":{"_account_id":9257,"name":"John Eckersberg","email":"jeckersb@redhat.com","username":"jeckersb"},"change_message_id":"969f9853156c8fbf789f3d057c413b1cb59b6a57","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"0dd8ca8f_bd228ee9","updated":"2021-12-10 14:32:19.000000000","message":"Adding Stephen who can hopefully give another +2 here.\n\nFor some context, I have been running this in private CI for a week now. Without this patch, tempest fails 100% of the time due to the race triggering when nova-api reconnects after missing heartbeats (which hopefully will get fixed with https://review.opendev.org/c/openstack/oslo.messaging/+/810510 when I finish sorting it out). With this patch included, the reconnects work properly and tempest has been passing 100% for me.","commit_id":"02a38f507d8f0c377a2ef468e3497b5e897f1b09"},{"author":{"_account_id":9257,"name":"John Eckersberg","email":"jeckersb@redhat.com","username":"jeckersb"},"change_message_id":"3e81e8413641a1b3c786b3b69fa484a3e6e86534","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"61227e1a_1a201b4e","updated":"2021-11-10 14:53:42.000000000","message":"recheck\n\nLooks like everything mostly worked except a single gateway timeout on tempest","commit_id":"02a38f507d8f0c377a2ef468e3497b5e897f1b09"},{"author":{"_account_id":9257,"name":"John Eckersberg","email":"jeckersb@redhat.com","username":"jeckersb"},"change_message_id":"cc99eba2f0b434df12a33821bdb4f0cc890b7792","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"49ed3cea_a47faeff","updated":"2021-11-10 15:44:56.000000000","message":"recheck\n\ni think zuul was down and missed the previous recheck because it never got queued","commit_id":"02a38f507d8f0c377a2ef468e3497b5e897f1b09"},{"author":{"_account_id":28522,"name":"Hervé Beraud","email":"herveberaud.pro@gmail.com","username":"hberaud"},"change_message_id":"272f48f79b6ab6fa7dc9523a88f981118138208f","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":2,"id":"7f287579_96469b04","in_reply_to":"9711ce71_eb6f6cb6","updated":"2021-11-15 12:47:19.000000000","message":"Good catch. The V2 LGTM and tests works as expected locally and on the gates.\nThanks","commit_id":"02a38f507d8f0c377a2ef468e3497b5e897f1b09"}]}