)]}' {"swift/common/manager.py":[{"author":{"_account_id":34930,"name":"Jianjian Huo","email":"jhuo@nvidia.com","username":"jhuo"},"change_message_id":"2ffc6c79136508b21f60d7214d5392875ec687b0","unresolved":true,"context_lines":[{"line_number":855,"context_line":" # background on the command line, I generally just run with"},{"line_number":856,"context_line":" # no-daemon anyway, but this is quieter"},{"line_number":857,"context_line":" try:"},{"line_number":858,"context_line":" proc.wait(30)"},{"line_number":859,"context_line":" except subprocess.TimeoutExpired:"},{"line_number":860,"context_line":" print(\u0027WARNING: killing long running daemon ran once: %r\u0027"},{"line_number":861,"context_line":" % proc.args)"}],"source_content_type":"text/x-python","patch_set":1,"id":"d13f2610_b38054b4","line":858,"updated":"2024-11-08 20:53:13.000000000","message":"some daemons call run_once within run_forever(), will run_once() be killed if one run takes longer than 30 seconds?","commit_id":"29ff44042bfe51cde0f4169d437dcf86f8411210"},{"author":{"_account_id":7847,"name":"Alistair Coles","email":"alistairncoles@gmail.com","username":"acoles"},"change_message_id":"3cee6455acddb2e4558d479e48fdca4d3b060a0f","unresolved":true,"context_lines":[{"line_number":855,"context_line":" # background on the command line, I generally just run with"},{"line_number":856,"context_line":" # no-daemon anyway, but this is quieter"},{"line_number":857,"context_line":" try:"},{"line_number":858,"context_line":" proc.wait(30)"},{"line_number":859,"context_line":" except subprocess.TimeoutExpired:"},{"line_number":860,"context_line":" print(\u0027WARNING: killing long running daemon ran once: %r\u0027"},{"line_number":861,"context_line":" % proc.args)"}],"source_content_type":"text/x-python","patch_set":1,"id":"aaf829d2_2114ad0a","line":858,"in_reply_to":"4b3973f6_3a284c7c","updated":"2024-11-12 10:34:13.000000000","message":"The tests I regularly see hang on feature/mpu seem to be waiting for a relinker process to finish\n\ne.g. ```\nNov 6 17:09:35 np0038989124 object-relinker[5563]: Starting relinker (cleanup\u003dTrue) using 1 workers: 17:09:35 11/06/24 GMT\nNov 6 17:09:35 np0038989124 object-relinker[49302]: [pid\u003d49302, devs\u003dsdb1,sdb5] Processing files for policy gold under /srv/1/node (cleanup\u003dTrue)\n```\n\n(no \"Finished relinker...\")\n\nfrom https://0481dfb82f52810b42aa-4f1778ef2522ea4afa6a2ea8762429d6.ssl.cf1.rackcdn.com/931517/17/check/swift-probetests-centos-9-stream/2bf19aa/storage1.log\n\nhttps://zuul.opendev.org/t/openstack/build/2bf19aa4244241228d48485588e76272\n\nIIUC the relinker is a standalone cli that isn\u0027t run by ``Manager``, so it may be the same underlying issue but I don\u0027t think this change is going to fix it.","commit_id":"29ff44042bfe51cde0f4169d437dcf86f8411210"},{"author":{"_account_id":15343,"name":"Tim Burke","email":"tburke@nvidia.com","username":"tburke"},"change_message_id":"9d61e430527f1d74576b924d3ee09142ac799ae4","unresolved":true,"context_lines":[{"line_number":855,"context_line":" # background on the command line, I generally just run with"},{"line_number":856,"context_line":" # no-daemon anyway, but this is quieter"},{"line_number":857,"context_line":" try:"},{"line_number":858,"context_line":" proc.wait(30)"},{"line_number":859,"context_line":" except subprocess.TimeoutExpired:"},{"line_number":860,"context_line":" print(\u0027WARNING: killing long running daemon ran once: %r\u0027"},{"line_number":861,"context_line":" % proc.args)"}],"source_content_type":"text/x-python","patch_set":1,"id":"c2aa6be3_f91b56bb","line":858,"in_reply_to":"aaf829d2_2114ad0a","updated":"2024-11-12 18:27:48.000000000","message":"Huh. Spurs several questions:\n\n1. Where\u0027s the `post_partition_hook` message? Previous run looks like\n```\nNov 6 17:09:33 np0038989124 object-relinker[5563]: Starting relinker (cleanup\u003dFalse) using 1 workers: 17:09:33 11/06/24 GMT\nNov 6 17:09:33 np0038989124 object-relinker[49271]: [pid\u003d49271, devs\u003dsdb1,sdb5] Processing files for policy gold under /srv/1/node (cleanup\u003dFalse)\nNov 6 17:09:33 np0038989124 object-relinker[49271]: [pid\u003d49271, devs\u003dsdb1,sdb5] Step: relink Device: sdb1 Policy: gold Partitions: 1/1\nNov 6 17:09:33 np0038989124 object-relinker[49271]: [pid\u003d49271, devs\u003dsdb1,sdb5] 1 hash dirs processed (cleanup\u003dFalse) (1 files, 1 linked, 0 removed, 0 errors)\nNov 6 17:09:33 np0038989124 object-relinker[5563]: Finished relinker (cleanup\u003dFalse): 17:09:33 11/06/24 GMT (0:00:00.088019 elapsed)\n```\nThe one that died ends with\n```\nNov 6 17:09:35 np0038989124 object-relinker[5563]: Starting relinker (cleanup\u003dTrue) using 1 workers: 17:09:35 11/06/24 GMT\nNov 6 17:09:35 np0038989124 object-relinker[49302]: [pid\u003d49302, devs\u003dsdb1,sdb5] Processing files for policy gold under /srv/1/node (cleanup\u003dTrue)\n```\nso where\u0027s the `Step: ... Device: ...` line? Makes me doubt that it\u0027s a case of https://github.com/eventlet/eventlet/issues/989 ...\n\n2. Why are we running with one worker? I thought the default was `auto`, which should scale with the number of drives?\n\n3. Why are we bothering to spawn just one worker? Why do we [special case `len(device_list) in (0, 1)`](https://github.com/openstack/swift/blob/2.34.0/swift/cli/relinker.py#L680) to run in the base manager process, but not `workers \u003d\u003d 1`?\n\nDefinitely warrants further investigation -- I think we\u0027re looking at multiple bugs :-/\n\nI think https://review.opendev.org/c/openstack/swift/+/934813 might be better/more targeted for the hang that I *have* been on a node for in CI. I\u0027ll keep rechecking https://review.opendev.org/c/openstack/swift/+/934350 on top of it to try to find more hangs.","commit_id":"29ff44042bfe51cde0f4169d437dcf86f8411210"},{"author":{"_account_id":15343,"name":"Tim Burke","email":"tburke@nvidia.com","username":"tburke"},"change_message_id":"3bb6856f2cc1704d9c202b15b18def9b99a0b166","unresolved":true,"context_lines":[{"line_number":855,"context_line":" # background on the command line, I generally just run with"},{"line_number":856,"context_line":" # no-daemon anyway, but this is quieter"},{"line_number":857,"context_line":" try:"},{"line_number":858,"context_line":" proc.wait(30)"},{"line_number":859,"context_line":" except subprocess.TimeoutExpired:"},{"line_number":860,"context_line":" print(\u0027WARNING: killing long running daemon ran once: %r\u0027"},{"line_number":861,"context_line":" % proc.args)"}],"source_content_type":"text/x-python","patch_set":1,"id":"4b3973f6_3a284c7c","line":858,"in_reply_to":"d13f2610_b38054b4","updated":"2024-11-08 22:58:37.000000000","message":"Good thing to check on! So [`Daemon.run_once`](https://github.com/openstack/swift/blob/2.34.0/swift/common/daemon.py#L54) is separate from [`Manager.once`](https://github.com/openstack/swift/blob/2.34.0/swift/common/manager.py#L320). `Manager` gets used by probe tests to spawn subprocesses which eventually call `run_once`/`run_forever`.\n\nThough `Manager` *also* gets used by `swift-init` -- maybe I should take a little more care to not impact operators, who might be using `swift-init` to run a targeted replication cycle or something. I think we at NVIDIA would probably use `swift-object-replicator` directly rather than `swift-init`, but IDK about what other people might do...\n\nI think we could add some new `wait_timeout` kwarg or something to not impact `swift-init`, but plumbing it into all the probe tests seems annoying -- there\u0027s like ~300 places we call `once` in probe tests...","commit_id":"29ff44042bfe51cde0f4169d437dcf86f8411210"}]}