)]}' {"/PATCHSET_LEVEL":[{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"c7844d89c448086bd54d74f502869e5a6f6bc6fd","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"f20b9daa_442a79f4","updated":"2024-07-26 10:38:27.000000000","message":"FYI, this is what we went with in the end: https://github.com/sapcc/placement/pull/5/files\n\nWe mocked away the `independent` for the functional tests, because running with SQLite does a rollback in at the end of the independent block, removing the previously-applied deletions.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"df422103957d5e103d6c2e6a429318ac5996f4fc","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"6b933166_fd50e539","updated":"2024-07-17 14:51:36.000000000","message":"I was looking into this the last hours and I\u0027m wondering about the following: On a retry, we create a new transaction to refresh the RPs in `replace_all()`. Then, in `_check_capacity_exceeded()` we create another new transaction to query the usage again. There\u0027s time between those 2 transactions, so the values we read in the second, usage-retrieving transaction might not belong to the RP generations we read in the first transaction.\n\nWe do query out the RP generations in `_check_capacity_exceeded()` but then do not use it at all and return the RP objects we were provided with initially instead of updating them.\n\nAfaics, this should not be a problem, because any update would let us run into another issue with generations. So updating the RPs in the same transaction as we get the usage in would only reduce the possibility of a pretty unlikely other race condition.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":9542,"name":"Pavlo Shchelokovskyy","email":"pshchelokovskyy@mirantis.com","username":"pshchelo"},"change_message_id":"4a3a652e6b7cc57dad89c25a0c048bf0d5ce62c1","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"c7bb5549_5fcb4518","updated":"2023-10-24 16:31:04.000000000","message":"I will try to locally adapt functional tests to run with proper SQL db instead of sqlite (as many other services do even for some unit tests) and see if it changes anything.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"16924ca8117d8581156df28f3fd58a2d5abfb6e9","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"5b01c82e_c4d1f3f1","updated":"2024-07-24 14:20:44.000000000","message":"If we read outside of our current transaction, shouldn\u0027t we then also see our current consumer\u0027s amount, too, in case we do a replacement of the consumer\u0027s resources?\n\nAfaics, we previously deleted the allocations of the consumers that we\u0027re trying to update allocations for (first step in `_set_allocations()`). We then checked capacity and afterwards. If capacity was available, we added the new allocations for the consumers.\n\nWith doing the capacity check outside of our transaction, we would basically skip the first \"delete the allocations of the consumers that we\u0027re trying to update allocations for\". Therefore, with this patch the behavior should change: we would report exceeded capacity when we previously wouldn\u0027t.\n\nTherefore, I think rolling back to a higher level and redoing everything would be the better option.\nAlternatively, we could exclude the allocations of our consumer from the query running in the independent transaction.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":9542,"name":"Pavlo Shchelokovskyy","email":"pshchelokovskyy@mirantis.com","username":"pshchelo"},"change_message_id":"dace8d8e38392bfc8cd9107ec8d13fbcc83d5585","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":1,"id":"272a1259_f7cde9d4","updated":"2023-10-24 16:29:38.000000000","message":"note on grenade failure\n\nclearly has nothing to do with placement.\nfailed because could not ssh into instance\n\ninstance was declared as ACTIVE at 11:57:40\nby the time the ssh conenction retries exhausted and instance console log was collected at 11:58:52, according to the said console log instance has not yet completed booting (has not reached the login prompt), the sshd has just started (next to last instance log line).\n\nslow instance/disk/cloud?","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":9542,"name":"Pavlo Shchelokovskyy","email":"pshchelokovskyy@mirantis.com","username":"pshchelo"},"change_message_id":"344d2ad6fa142071bee289a44ec818876bde80f1","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":1,"id":"5328afa1_c49115d6","updated":"2023-11-29 12:17:40.000000000","message":"somehow in our internal testing, this patch started to fail certain scenarios exactly in the same way it should\u0027ve prevented..\n\nexact scenario was smth like\n1. proper multinode env with 3 controller nodes and 3 computes\n a. small possibly related caveat - our envs have `host_subset_size \u003d 3` so essentially the scheduling is random on nova side.\n2. 3 compute nodes filled to 2/3 of capacity\n3. live-evacuate nodes one-by-one\n\nwith this patch, 3 of of 3 such runs failed at last node because it appears to be overprovisioned, and nothing can be migrated from it.\nw/o this patch, failures were gone.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"df422103957d5e103d6c2e6a429318ac5996f4fc","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":1,"id":"c3f208b9_b734e691","in_reply_to":"5328afa1_c49115d6","updated":"2024-07-17 14:51:36.000000000","message":"Do you have a theory how this could happen? I\u0027m trying to wrap my head around what you say and it doesn\u0027t make sense in any way. How can it even be consistently happening every time?\nCould it be related to the used DBMS?","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":26250,"name":"Johannes Kulik","email":"johannes.kulik@sap.com","username":"jkulik"},"change_message_id":"726f984d4c5dbab761d0adabd7e445466d2fa37d","unresolved":true,"context_lines":[],"source_content_type":"","patch_set":1,"id":"c8a86434_07ae64a9","in_reply_to":"c3f208b9_b734e691","updated":"2024-07-23 09:03:08.000000000","message":"Looking into this further, a lot of functional tests fail with this patch and the default sqlite backend. Having the `independent` transaction in `_check_capacity_exceeded()` seems to revert the deletion of all allocations for the consumer.\n```\nbefore\n\u003e /src/placement/placement/objects/allocation.py(168)_check_capacity_exceeded()\n-\u003e with db_api.placement_context_manager.reader.independent.using(ctx):\n(Pdb) get_all_by_consumer_id(ctx, allocs[0].consumer.uuid)\n[]\n(Pdb) c\nafter\n\u003e /src/placement/placement/objects/allocation.py(183)_check_capacity_exceeded()\n-\u003e missing_provs \u003d provider_uuids - provs_with_inv\n(Pdb) get_all_by_consumer_id(ctx, allocs[0].consumer.uuid)\n[\u003cplacement.objects.allocation.Allocation object at 0x7f0dea8cce50\u003e, \u003cplacement.objects.allocation.Allocation object at 0x7f0dea8ccdf0\u003e]\n```","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"}],"placement/objects/allocation.py":[{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"b4e1c00fb93401dcbee45e6af91e500c2fdf36b5","unresolved":true,"context_lines":[{"line_number":163,"context_line":" # database transaction when we reach here. We want to get an"},{"line_number":164,"context_line":" # up-to-date usage value in case a racing request has"},{"line_number":165,"context_line":" # changed it after we began an outer transaction."},{"line_number":166,"context_line":" with db_api.placement_context_manager.reader.independent.using(ctx):"},{"line_number":167,"context_line":" records \u003d ctx.session.execute(sel)"},{"line_number":168,"context_line":" # Create a map keyed by (rp_uuid, res_class) for the records in the DB"},{"line_number":169,"context_line":" usage_map \u003d {}"}],"source_content_type":"text/x-python","patch_set":1,"id":"4f038de5_5a754650","line":166,"range":{"start_line":166,"start_character":42,"end_line":166,"end_character":48},"updated":"2023-10-19 18:17:26.000000000","message":"Just a data point, but I tried using `writer` here and it got much closer to working ... but I don\u0027t understand why. With writer there are still 13 func test failures and a lot of them are where the test expected a concurrent update fail but didn\u0027t get one. I\u0027m not sure if that should be expected given the change, so it might be a dead end.\n\nAnyway, I wanted to mention that in case it sparks any other ideas about the problem. I\u0027ll continue looking into it because I want to understand what is wrong with this :P","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"da6ef018d7cf121bf3001c66360c10cd0909f4a7","unresolved":false,"context_lines":[{"line_number":163,"context_line":" # database transaction when we reach here. We want to get an"},{"line_number":164,"context_line":" # up-to-date usage value in case a racing request has"},{"line_number":165,"context_line":" # changed it after we began an outer transaction."},{"line_number":166,"context_line":" with db_api.placement_context_manager.reader.independent.using(ctx):"},{"line_number":167,"context_line":" records \u003d ctx.session.execute(sel)"},{"line_number":168,"context_line":" # Create a map keyed by (rp_uuid, res_class) for the records in the DB"},{"line_number":169,"context_line":" usage_map \u003d {}"}],"source_content_type":"text/x-python","patch_set":1,"id":"3215f74e_e7da47de","line":166,"range":{"start_line":166,"start_character":42,"end_line":166,"end_character":48},"in_reply_to":"34bf0bb9_5f026534","updated":"2023-10-21 01:32:45.000000000","message":"OK, apparently I was too tired last night and thought there was a bug in the existing path for updating consumers + setting allocations ... but it turned out to be an error I was making in my testing 😣 When I tested more things today I realized my error and I don\u0027t see any issues anymore.\n\nSorry for all the noise. I\u0027m going to mark this thread as Resolved to hopefully keep it out of the way.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"b61ab66a15e5103bb516c8af7319cf097226cfc6","unresolved":true,"context_lines":[{"line_number":163,"context_line":" # database transaction when we reach here. We want to get an"},{"line_number":164,"context_line":" # up-to-date usage value in case a racing request has"},{"line_number":165,"context_line":" # changed it after we began an outer transaction."},{"line_number":166,"context_line":" with db_api.placement_context_manager.reader.independent.using(ctx):"},{"line_number":167,"context_line":" records \u003d ctx.session.execute(sel)"},{"line_number":168,"context_line":" # Create a map keyed by (rp_uuid, res_class) for the records in the DB"},{"line_number":169,"context_line":" usage_map \u003d {}"}],"source_content_type":"text/x-python","patch_set":1,"id":"34bf0bb9_5f026534","line":166,"range":{"start_line":166,"start_character":42,"end_line":166,"end_character":48},"in_reply_to":"42826b13_3725c370","updated":"2023-10-20 07:42:27.000000000","message":"Update: I spent more time looking into this and reloading context and I think I may have found a bug in the present code, in the path between updating consumers and setting allocations; AFAICT, based on some local testing the consumer update and allocation setting may _not_ actually be happening in the same database transaction 🫤 If that\u0027s the case, my earlier concern about keeping the consumer update and allocation set in the same transaction would be moot and a fix for bug 2039560 wouldn\u0027t need to take that into account.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"10dc4768b6691035b533a4db8138d12039724cec","unresolved":true,"context_lines":[{"line_number":163,"context_line":" # database transaction when we reach here. We want to get an"},{"line_number":164,"context_line":" # up-to-date usage value in case a racing request has"},{"line_number":165,"context_line":" # changed it after we began an outer transaction."},{"line_number":166,"context_line":" with db_api.placement_context_manager.reader.independent.using(ctx):"},{"line_number":167,"context_line":" records \u003d ctx.session.execute(sel)"},{"line_number":168,"context_line":" # Create a map keyed by (rp_uuid, res_class) for the records in the DB"},{"line_number":169,"context_line":" usage_map \u003d {}"}],"source_content_type":"text/x-python","patch_set":1,"id":"42826b13_3725c370","line":166,"range":{"start_line":166,"start_character":42,"end_line":166,"end_character":48},"in_reply_to":"4f038de5_5a754650","updated":"2023-10-19 23:32:59.000000000","message":"(later) Please ignore my comment ^. Thinking about this more and remembering issues we\u0027ve seen before in the past with regard to the concurrent update retry, it seems like the \"best\" way to handle it would be to reraise `ResourceProviderConcurrentUpdateDetected` if we are already within a transaction and then have a similar retry logic around the outermost transaction.\n\nI remember thinking about this when I worked on the resource provider generation update and concluding (perhaps erroneously) that checking for a generation conflict in a separate transaction was \"safe\" and wouldn\u0027t expose us to partial updates or other problems.\n\nSo TBH I\u0027m thinking about whether we should just remove the independent transaction I had added for the resource provider generation and do the reraise and handle `ResourceProviderConcurrentUpdateDetected` higher up with retries.\n\nLast time I thought about it I thought it\u0027s kind of ugly but there are two ways that `replace_all()` is called, one where it\u0027s the outermost call and one where it is called within an already started database transaction. Given that, I think it\u0027s valid to handle retries in both scenarios, but of course it adds complexity.\n\nStill, at this point I\u0027m wondering if it might be better to go that route and that we probably should have done it that way last time we met this problem of retrying database transactions. Because what is happening now may be a whack-a-mole situation.","commit_id":"30a815c1172236f3b4abf3409d70dedf32cc5541"}]}