)]}' {"/COMMIT_MSG":[{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"45f4c9db0fe033b86620458c88bd44272a0cecfe","unresolved":true,"context_lines":[{"line_number":6,"context_line":""},{"line_number":7,"context_line":"Reduce concurrency for tempest jobs"},{"line_number":8,"context_line":""},{"line_number":9,"context_line":"Recently many jobs are failing with timeout and one of the suggestion"},{"line_number":10,"context_line":"as per [1] is to reduce the concurrency of the jobs. Current concurrency"},{"line_number":11,"context_line":"is 6, setting it to 3 now."},{"line_number":12,"context_line":""}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"5e5a5716_b5921777","line":9,"range":{"start_line":9,"start_character":0,"end_line":9,"end_character":43},"updated":"2023-09-09 01:03:26.000000000","message":"I am not sure how job timeout is caused by the concurrency? concurrency increase improve the timeout as it run test in parallel in more worker. But yes as dan mentined it can add more oom issue and if there are oom issue then we can reduce it otherwise I think we are not debugging the issue in right direction instead just decrease the concurrency to hide the actual issues.","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"}],"/PATCHSET_LEVEL":[{"author":{"_account_id":8122,"name":"Cyril Roelandt","email":"cyril@redhat.com","username":"cyril.roelandt.enovance"},"change_message_id":"40ad1e5bd09f71871ebad7ccb6ed26299ff46ba9","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"aafd5bcd_ba59b9f3","updated":"2023-09-06 16:41:46.000000000","message":"So tempest-integrated-storage-enforce-scope-new-defaults failed with a timeout (due to a kernel panic, it seems). Is this unrelated to what we\u0027re trying to fix here?","commit_id":"2cf715d788c7ef4207ff37aac73701ffa32ecb5d"},{"author":{"_account_id":9303,"name":"Abhishek Kekane","email":"akekane@redhat.com","username":"abhishekkekane"},"change_message_id":"cbe0ea4ade3555a39e5e62f5b4aff0d0882ea483","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"3f7c3fd4_3458dc66","updated":"2023-09-07 07:52:44.000000000","message":"recheck\n\ndifferent timeout issue","commit_id":"2cf715d788c7ef4207ff37aac73701ffa32ecb5d"},{"author":{"_account_id":10459,"name":"Luigi Toscano","email":"ltoscano@redhat.com","username":"ltoscano"},"change_message_id":"e9b2016af96de1a2d58e8b0ee689c00b70020bf5","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"eedd7d65_fdb421cf","in_reply_to":"aafd5bcd_ba59b9f3","updated":"2023-09-07 07:54:07.000000000","message":"Yes, unrelated. The job changed here (and probably the commit message should mention the name of the job) is tempest-integrated-storage-import, while the job which failed, tempest-integrated-storage-enforce-scope-new-defaults, derives from tempest-integrated-storage and it is not affected by this change.","commit_id":"2cf715d788c7ef4207ff37aac73701ffa32ecb5d"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"45f4c9db0fe033b86620458c88bd44272a0cecfe","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"189a172f_3640c049","updated":"2023-09-09 01:03:26.000000000","message":"Honestly saying I disagree with the fact of timeout caused by concurrency and reducing it will fix the world. Reducing it is more of hiding the issues instead of debugging and fixing the actual problem","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"},{"author":{"_account_id":8122,"name":"Cyril Roelandt","email":"cyril@redhat.com","username":"cyril.roelandt.enovance"},"change_message_id":"a0e885ff661750accd2935c69d79289ec58bc833","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"3f4e7b9d_24d6ac1f","updated":"2023-09-07 14:46:45.000000000","message":"Ninja-approving this - we will revert it if this does not fix our issues.","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"efba621417bdfd31cb3127b420702d2e7adb2e81","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"d60f827c_48286bb2","updated":"2023-09-08 13:43:10.000000000","message":"The most recent failed run of this had an OOM and a job timeout. Reducing the concurrency usually *helps* with OOMs if pressure is high and makes timeouts *worse*. With only three workers, there is a much better chance that one worker will receive an outsized portion of the tests and thus the whole job reverts to effectively single-threaded, which takes longer and increases the chance for timeouts. In fact the tempest concurrency was *raised* to 6 to reduce timeouts for this exact reason. We lower it for some jobs that have a ton of stuff running and thus higher memory pressure, like the ceph job.\n\nIf you notice the worker profile from the job that timed out on PS2, the first worker was the only one doing anything for about a quarter of the runtime:\n\nhttps://4b3c52423d412e2d9836-485520e0725c5c3e0cb7e09d4d5f1a24.ssl.cf2.rackcdn.com/893855/2/gate/tempest-integrated-storage-import/082a19b/controller/logs/stackviz/index.html#/stdin/timeline\n\nAnd worker 2 was only busy for about 60% of the run. That\u0027s the sort of pattern where you get timeouts in the middle of tempest.\n\nCan we revisit some specific numbers (like via opensearch) after this patch has been in for a while to show that it\u0027s improving things and decide if we can/should revert? Some of us have spent a lot of time analyzing these failure patterns and that becomes a lot harder if every job on every project ends up with bespoke tuning knobs. If they\u0027re really needed then fine, but I\u0027m not sure what about the storage import jobs really needs a halving of the concurrency. Is there a specific theory about the reasoning here?","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"},{"author":{"_account_id":9303,"name":"Abhishek Kekane","email":"akekane@redhat.com","username":"abhishekkekane"},"change_message_id":"513600906e49176975590de00e433ca628835af9","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"12e3b980_c4b7ae3d","updated":"2023-09-08 04:42:58.000000000","message":"recheck db related failure","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"165a788fe8ec216f73cbbc5fc53d606a8c50c3da","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"3970c3b2_f9f59b1d","in_reply_to":"802d0f38_9de68fcf","updated":"2023-09-08 14:48:06.000000000","message":"Yep, the concurrency was increased not to address general job timeouts, but timeouts in the middle of running (and making progress in) tempest. In almost all of those cases, we\u0027ll see that we\u0027ve effectively dropped down to a single worker thread for running the tests because all the others have finished.\n\nFailures like the second one above are not load related - that is a case where the guest crashed which of course ties us up for a long time because it becomes unresponsive, can\u0027t detach the volume (because it\u0027s dead) and of course the test fails during cleanup.\n\nMost of the volume ones I\u0027ve chased down in depth also amount to things that aren\u0027t load related either, because we see those (and have seen those) for years due to things like waiting 60s for LVM locks, etc. Which resources in those other two cases do you think are exhausted? Not memory because there\u0027s no OOM...","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"},{"author":{"_account_id":10459,"name":"Luigi Toscano","email":"ltoscano@redhat.com","username":"ltoscano"},"change_message_id":"f17c89760c2d01d404fbcd88ff76bb4a0742025e","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":2,"id":"802d0f38_9de68fcf","in_reply_to":"d60f827c_48286bb2","updated":"2023-09-08 13:57:39.000000000","message":"The most recent failures of tempest-integrated-storage-import, which started this (see patch 1) are all about the job completing in time and failing on some tests for timeouts on some operations:\nhttps://ff7549a1676b7e0abf4f-a8638dce9de2f2c46b7008346a40a9c5.ssl.cf5.rackcdn.com/889054/2/check/tempest-integrated-storage-import/40c4ed1/testr_results.html\nhttps://d62e47aec6a5f4ffd041-3849b03cfcb04d12316556dd4e593ed3.ssl.cf2.rackcdn.com/889054/2/check/tempest-integrated-storage-import/3c61f6a/testr_results.html\nhttps://7d001b464d1c8a16de46-0757739227409b7b3e2b1ef597883c12.ssl.cf1.rackcdn.com/889053/2/check/tempest-integrated-storage-import/84a6ff6/testr_results.html\n\nSo this is not a case of a general job timeout. It is a case of resource reduction.","commit_id":"6d8f1248745632da090fde5e1319b70a50526563"}]}