)]}' {"/PATCHSET_LEVEL":[{"author":{"_account_id":21129,"name":"Alan Bishop","email":"abishopsweng@gmail.com","username":"ASBishop","status":"ex Red Hat"},"change_message_id":"9c3bf502b5d7a3b6bd864277a447fbe9aca58aec","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"b018e383_51e591fc","updated":"2024-12-02 22:14:06.000000000","message":"In addition to the -1 regarding the lock name, I\u0027m struggling to understand the root problem and how this patch would fix it.\n\nThe LP bug suggests a client keeps submitting requests to detach a volume because the it\u0027s taking too long (8 minutes!!!) for the initial request to complete. This is way longer than the typical API and RPC timeouts (default: 60s). As a side question, why are these disconnect requests taking so long?\n\nThe LP bug then goes on to state a lower driver lock in do_remove_volume_from_sg was acquired but never released. How is that happening? And how does adding another lock at a higher level resolve the problem? I don\u0027t see any other cinder driver using this sort of lock.","commit_id":"c379d849135d2eec709bb4f579e9696611a1638a"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"5fa0dcf84fd4f9b2613fe1fd4df947a5e40ef920","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"44618163_fc9541a7","in_reply_to":"b018e383_51e591fc","updated":"2024-12-03 07:43:31.000000000","message":"Hi Alan,\n\nThank you for the questions.\n\nThe RedHat support provided below explanation for multi detach requests:\n\"\n- Natively, OpenShift can reschedule PODs from one Worker node to another one for any reason.\n- if something goes wrong during the reschedule phase (e.g.: a problem on the target node or something else), the detach from the OCP nodes could be invoked (or retried) also more times and in short time to grant the application availability inside an acceptable time frame.\"\n\nIt looks like a RHOSP function as design. And we can\u0027t assume how short the interval of requests or how long the driver detach operation will take. If the driver detach takes shorter than the interval, multi requests are sequential operation and the driver can handle that. If the driver detach takes longer than the interval, like in this case, it turns into concurrent operation for the driver. That\u0027s why I added the global lock of volume id for attach/detach, to ensure for a volume, its attach and detach operation will be always sequential.\n\nThe customer repeated running backup/restore of several VMs in parallel. Usually the driver detach operation took less than 30 secs. In this specific case, one rest call to get masking views returned in 30 min. It loos like a temporary issue of PowerMax. The driver code 4.1.RH4 has no rest client timeout settings, which is addressed in 907768 and the default client timeout is 30sec [1].\n\nThe driver has 2 PowerMax storage groups. One is the default storage group (not exported), all newly created volumes are added into the default SG. The other one is the storage group exported to hosts. Attaching a volume is to move the volume from the default SG to the exported SG, while detaching is reverted. The SGs are shared by all volumes, hence, the drive has designed massive locks of SG to prevent race conditions when operating multiple volumes in parallel.\n\nIn this case, the 1st req has detached the volume successfully, which had moved the volume from the exported SG to the default SG. The 2nd request was started and waiting for the lock. It was supposed to do the same thing except the volume was already in the default SG. It acquired the lock on the default SG, and tried to acquired the same lock again. So, deadlock happened. \n\nThe existing driver locks are designed for operating multiple volumes concurrently.\nThe global lock of volume id added in this patch is for multiple attach/detach requests of one volume.\n\nThe multiple detach requests of one volume is quite a corner case for a cinder driver. I noticed cinder has designed to prevent race conditions of multiple requests in cinder api [2] and also in manager [3][4]. It looks like impossible for multiple requests to pass through cinder and arrive at the driver. However, it happened. As the last resort, the patch added the new lock.\nBTW, HPE driver has a similar lock [5][6].\n\n\n\n[1] https://review.opendev.org/c/openstack/cinder/+/907768\n[2] https://opendev.org/openstack/cinder/src/branch/master/cinder/volume/api.py#L792-L804\n[3] https://opendev.org/openstack/cinder/src/branch/master/cinder/volume/manager.py#L1487\n[4] https://opendev.org/openstack/cinder/src/branch/master/cinder/volume/manager.py#L1398\n[5] https://opendev.org/openstack/cinder/src/branch/master/cinder/volume/drivers/hpe/hpe_3par_fc.py#L170\n[6] https://opendev.org/openstack/cinder/src/branch/master/cinder/volume/drivers/hpe/hpe_3par_fc.py#L305","commit_id":"c379d849135d2eec709bb4f579e9696611a1638a"},{"author":{"_account_id":5314,"name":"Brian Rosmaita","email":"rosmaita.fossdev@gmail.com","username":"brian-rosmaita"},"change_message_id":"ee5edb654b5e827bab79a13becea92084398a1bc","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"00af34e4_62056228","updated":"2025-01-15 15:08:44.000000000","message":"As Rajat mentioned above, we discussed this patch at the weekly cinder team meeting. My position is that the change does not look incorrect, and since the dell team has tested it and is convinced that it avoids the race condition, the cinder project team should not hold up this patch.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":27615,"name":"Rajat Dhasmana","email":"rajatdhasmana@gmail.com","username":"whoami-rajat"},"change_message_id":"dd5cb190790e5ab109d6712cc8f994d35c3d1aa4","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"45175944_b32c8a72","updated":"2025-01-15 14:46:11.000000000","message":"Based on the discussion in today\u0027s cinder meeting[1] (January 15th), the team has reached the conclusion that this solution is acceptable as it only concerns the dell powermax driver and does not affect the core cinder code.\n\n[1] https://meetings.opendev.org/irclogs/%23openstack-meeting-alt/%23openstack-meeting-alt.2025-01-15.log.html#t2025-01-15T14:26:16","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":9236,"name":"Jon Bernard","email":"jobernar@redhat.com","username":"jbernard"},"change_message_id":"2529b1e69c2ff86356c1babc38febb47aad1aabc","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"b1c29b0b_942ed7df","updated":"2025-01-16 13:59:24.000000000","message":"Changes look okay to me. In this week\u0027s meeting we discussed the lock being too broad, but decided that it need not hold this patch up at the moment. As this change is isolated to the powermax driver only, I agree we can move forward.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":27615,"name":"Rajat Dhasmana","email":"rajatdhasmana@gmail.com","username":"whoami-rajat"},"change_message_id":"af9ae0adbfebdb8276edea8ef6b4fd38748af0cc","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"601a72a4_31ac31fa","updated":"2025-01-15 14:48:19.000000000","message":"I will still keep my -1 so we have something to look at if we face any issues in future related to this.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"0126b594e0af90a9c73b0197919358cbb1f358bd","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"6d53ab7a_bd2fccf7","updated":"2025-01-17 01:57:47.000000000","message":"Thank you all for the review!","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":27615,"name":"Rajat Dhasmana","email":"rajatdhasmana@gmail.com","username":"whoami-rajat"},"change_message_id":"1408b858fa3257c044183b3e4b589af2de0666a0","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"5768900e_9591f783","updated":"2024-12-13 09:00:23.000000000","message":"The idea of a global lock around attach/detach calls with the volume.id is reasonable but the problem here is the other locks deep in the operation are the main cause of this issue.\nI feel we should try to fix the critical section where the locks cause problem rather than a band-aid at the top layer of the operation.\nAs we have already identified in patch[1] that the attach/detach operation simultaneously trying to acquire the locks on the storage group is the main source of problem, I would rather like a solution that enhances that part of the code.\nAlso I don\u0027t like the idea of multiple locks stacked together which I\u0027m sure also adds overhead and also hampers code redability.\n\nWhat we probably can do is,\n1. Replace the multiple locks with a single lock (or minimal locks) that has all the required variables for the lock to not overlap between the attach/detach calls\n2. Minimize the code area where we put the lock -- since that part of the code will always run sequentially and can cause performance issues in the future\n3. Instead of volume ID, we should use a dell specific LUN identifier in the storage group (probably device ID?) for the locking\n\n[1] https://review.opendev.org/c/openstack/cinder/+/884957","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"6f0650603a5f8b3e149aa1422644aa23913d6e97","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"d8bfd7bd_44740706","updated":"2025-01-16 16:16:50.000000000","message":"recheck tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest Request timed out","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":31779,"name":"Jean Pierre Roquesalane","display_name":"happystacker","email":"jeanpierre.roquesalane@dell.com","username":"happystacker"},"change_message_id":"5b1c71f058608c76a12d2ea1860f0c8e23b8b7e5","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"e56fea01_4ae5878e","updated":"2024-12-20 08:30:13.000000000","message":"run-DellEMC PowerMAX CI","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"085723942bf6ec5a1e82390ec41a5a5c042fe12c","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"65ce02cc_19c9c72e","in_reply_to":"5768900e_9591f783","updated":"2025-01-06 16:20:05.000000000","message":"The problem is that one operation(request) tries to acquires the lock of the default storage group twice.\n\nThe call sequence leading to the deadlock is as below. In which, The value of {sg_name} is the same as {default_sg_name}.\n@coordination.synchronized(\"emc-vol-{device_id}\")\ndef remove_and_reset_members(\n def remove_volume_from_sg(\n def _cleanup_deletion(\n @coordination.synchronized(\"emc-sg-{sg_name}-{serial_number}\")\n def do_remove_volume_from_sg(sg_name, serial_number):\n def _multiple_vols_in_sg(\n def add_volume_to_default_storage_group(\n @coordination.synchronized(\"emc-sg-{default_sg_name}-{serial_number}\")\n def _move_vol_to_default_sg(default_sg_name, serial_number):\n\nIn this multi-detach req case, once 1st req acquired the {device_id} lock and completed detach, which moved the volume from the exported storage group to the default storage group. Then 2nd req entered the {device_id} critical section, but the volume was already detached (in the default storage group).\nThe {device_id} critical section didn\u0027t check if the volume was detached by other request, and tried to move the volume from the default group to the default storage group, therefore, acquired the lock of the default storage group twice.\n\nOne possible solution is to modify {device_id} critical section is to add the code logic to check the volume status, if detached, return.\nThe risk is {device_id} section is common for detach/attach/create/delete. The added check might introduce regression for other operations.\n\nInstead, this patch chooses to extend the critical section, because \u0027terminate_connection\u0027 check if the volume is detached before decide if necessary to call \u0027remove_and_reset_members\u0027.\nThe lock {volume.id} makes operations of the same volume mutex, but it doesn\u0027t impact concurrent operation of multiple different volumes.\nFor this multi-detach request case, once 1st req acquired volume.id lock and complete detach. All other req will find the volume is detached and return directly after entering the {volume.id} section.\n\nThe multiple-lock section below is not part of the call sequence to the deadlock:\n\t@coordination.synchronized(\"emc-mv-{parent_name}-{serial_number}\")\n\t@coordination.synchronized(\"emc-mv-{mv_name}-{serial_number}\")\n\t@coordination.synchronized(\"emc-sg-{sg_name}-{serial_number}\")\n\tdef do_remove_volume_from_sg(\nI agree it\u0027s better to refactor it as below to avoid the potential ordering thing, it\u0027s just not critical for this bugfix.\n\t@coordination.synchronized(\"emc-mv-{parent_name}-{serial_number}\",\n\t \"emc-mv-{mv_name}-{serial_number}\",\"emc-sg-{sg_name}-{serial_number}\")\n\nThe storage group is a shared resource for all volumes, and it should be only accessed by one request at a time.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"b6c80c7a43e208ad28611044fa52868ac0a90a73","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"ce7b1a1e_1930814f","in_reply_to":"601a72a4_31ac31fa","updated":"2025-01-16 12:30:40.000000000","message":"Understood. But we still need another +2 to merge.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"},{"author":{"_account_id":27615,"name":"Rajat Dhasmana","email":"rajatdhasmana@gmail.com","username":"whoami-rajat"},"change_message_id":"5bb5040f1371bd72b950f8260db5c1658db943a4","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"0993201f_599b9a52","in_reply_to":"ce7b1a1e_1930814f","updated":"2025-01-16 12:46:16.000000000","message":"I remember Jon Bernard agreed to be the second reviewer, I will ask him otherwise will merge this.","commit_id":"febe34fa6633ee9816e87e7b70bc9796f8d44380"}],"cinder/volume/drivers/dell_emc/powermax/fc.py":[{"author":{"_account_id":21129,"name":"Alan Bishop","email":"abishopsweng@gmail.com","username":"ASBishop","status":"ex Red Hat"},"change_message_id":"9c3bf502b5d7a3b6bd864277a447fbe9aca58aec","unresolved":true,"context_lines":[{"line_number":256,"context_line":" \"\"\""},{"line_number":257,"context_line":" pass"},{"line_number":258,"context_line":""},{"line_number":259,"context_line":" @coordination.synchronized(\u0027dell-{volume.id}\u0027)"},{"line_number":260,"context_line":" def initialize_connection(self, volume, connector):"},{"line_number":261,"context_line":" \"\"\"Initializes the connection and returns connection info."},{"line_number":262,"context_line":""}],"source_content_type":"text/x-python","patch_set":1,"id":"c94a340d_9334b7f8","line":259,"updated":"2024-12-02 22:14:06.000000000","message":"-1: this is not a good name for the lock, mainly because there are multiple dell_emc drivers.\n\nLook at other drivers and you\u0027ll see they use \"{self.driver_prefix}\", so you could do that assuming you configure self.driver_prefix \u003d \"powermax\" somewhere.","commit_id":"c379d849135d2eec709bb4f579e9696611a1638a"},{"author":{"_account_id":35759,"name":"Yian Zong","display_name":"Yian Zong","email":"yian.zong@dell.com","username":"yianzong"},"change_message_id":"354728384fec2588a211b7518a02119844c27f4c","unresolved":false,"context_lines":[{"line_number":256,"context_line":" \"\"\""},{"line_number":257,"context_line":" pass"},{"line_number":258,"context_line":""},{"line_number":259,"context_line":" @coordination.synchronized(\u0027dell-{volume.id}\u0027)"},{"line_number":260,"context_line":" def initialize_connection(self, volume, connector):"},{"line_number":261,"context_line":" \"\"\"Initializes the connection and returns connection info."},{"line_number":262,"context_line":""}],"source_content_type":"text/x-python","patch_set":1,"id":"e40d0fb0_2fe6e938","line":259,"in_reply_to":"c94a340d_9334b7f8","updated":"2024-12-03 10:00:01.000000000","message":"Done","commit_id":"c379d849135d2eec709bb4f579e9696611a1638a"}]}