)]}' {"/COMMIT_MSG":[{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"ccdc054c625459d4ddae25584076c42b3cf8bfdc","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_665c874b","line":32,"updated":"2020-02-25 21:09:39.000000000","message":"Your detailed commit message is appreciated, thank you!\n\nThis all makes sense but do you have any anecdotal data that making this change made an impact in your deployment? If so, it would be valuable to mention that in this commit message as well. We often rely on contributor\u0027s testing as it is not so easy for us to gather data for performance related improvements such as this.","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":29100,"name":"Jason Anderson","email":"jasonanderson@uchicago.edu","username":"jasonanderson"},"change_message_id":"b7ebf864d9e429977c5b923b7586343a4362df82","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_876c0b37","line":32,"in_reply_to":"1fa4df85_42ccc934","updated":"2020-02-26 21:11:41.000000000","message":"I tested this out and noticed two things:\n\n1. The performance of this resource update task has gotten way better in Train (and potentially Stein). I had originally filed the bug in Rocky and hadn\u0027t tested the original bug condition thoroughly again versus my fix. I had just tested that the fix didn\u0027t regress any other behavior. So this isn\u0027t as big of a deal as it used to be! I only saw the instances getting stuck for maybe 30 seconds to a minute.\n\n2. Fair locks did improve things and from my tests the instance claim was able to obtain the lock more or less as it arrived.\n\nSo, maybe we can abandon this change. The question is, does it make sense to make Nova use fair locks everywhere? Or just in the resource tracker?","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":29100,"name":"Jason Anderson","email":"jasonanderson@uchicago.edu","username":"jasonanderson"},"change_message_id":"8ae8f148be22fd205b0bb9445e19fd86eb0b901c","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_e903680c","line":32,"in_reply_to":"1fa4df85_665c874b","updated":"2020-02-25 21:12:54.000000000","message":"@melanie, sure--I have deployed it recently to a ~150 node deployment. I think I need to let it sit for a bit longer to make any claims as to how it impacts things. When I understand more about possible side effects I will update the request; I mostly wanted to get eyes on it early to see if there is any chance of the approach making it upstream, as I know it\u0027s a bit scary!","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":29100,"name":"Jason Anderson","email":"jasonanderson@uchicago.edu","username":"jasonanderson"},"change_message_id":"f2aa659cdd3f52d6fcfc3d3f1c07c4e7a6d3f670","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_7f85e6ac","line":32,"in_reply_to":"1fa4df85_69cd385c","updated":"2020-02-26 00:58:50.000000000","message":"@Dan, you raise some good points, and I had to look over both nova\u0027s code and oslo service/lockutils to get a better idea of why I have seen what I\u0027ve seen.\n\nFor context, the behavior I witnessed was that, if I was looking at logs, I would see a message about this periodic task starting. Then, I would see tons of messages slowly come in as all of the hosts were updating. At some point, my instance claim would arrive. This could take 20 minutes to an hour+ and seemed linear w/ the number of nodes; a 300-node cluster had ~2x delay than a ~150 node cluster.\n\nGiven that your sequence example makes total sense, I am wondering how I could see that. I looked through the oslo libraries and nova and noticed two things:\n\n1. By default, lockutils locks are not \"fair\"; when the lock is released, a random thread waiting on the lock is selected to take the lock. It is not a first-in, first-out situation. So, with more things waiting on a lock, there is a relatively low probability a given task will be next to proceed, even if it arrived immediately after. Nova does not override the default \"fair\" kwarg.\n\n2. From what I can tell, locks and greenthreads don\u0027t really mix very well. I can see there is some code to monkey-patch the threading libs such that they call greenthread under the hood, but from what I can tell, that monkey-patch is only applied when running tests. Is it possible that in a greenthread situation, and only one thread (as in nova-compute\u0027s case), lockutils is effectively not doing anything? It\u0027s hard to believe that. But at least in my nova-compute, for Ironic, there is only 1 thread.\n\nSo, I agree that under a sane threading situation, it doesn\u0027t gain much (although the not-fair locks are a bit of an issue). But perhaps this isn\u0027t a sane threading situation?","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"ac34cf2204180f16c87301c10f8ec87fedf9d889","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_42ccc934","line":32,"in_reply_to":"1fa4df85_7f85e6ac","updated":"2020-02-26 14:38:13.000000000","message":"\u003e 1. By default, lockutils locks are not \"fair\"; when the lock is\n \u003e released, a random thread waiting on the lock is selected to take\n \u003e the lock. It is not a first-in, first-out situation. So, with more\n \u003e things waiting on a lock, there is a relatively low probability a\n \u003e given task will be next to proceed, even if it arrived immediately\n \u003e after. Nova does not override the default \"fair\" kwarg.\n\nHmm, okay, this would explain the behavior you see then. Obviously if the locks aren\u0027t fair then, as you say, the instance claim isn\u0027t going to be handled on the next iteration of the periodic (since it\u0027s the next thing waiting each time).\n\nI know asyncio semaphores are fair, I dunno why ours aren\u0027t by default. If you pass the fair\u003dTrue to @synchronized, does your delay go away?\n\n\n \u003e 2. From what I can tell, locks and greenthreads don\u0027t really mix\n \u003e very well. I can see there is some code to monkey-patch the\n \u003e threading libs such that they call greenthread under the hood, but\n \u003e from what I can tell, that monkey-patch is only applied when\n \u003e running tests. Is it possible that in a greenthread situation, and\n \u003e only one thread (as in nova-compute\u0027s case), lockutils is\n \u003e effectively not doing anything? It\u0027s hard to believe that. But at\n \u003e least in my nova-compute, for Ironic, there is only 1 thread.\n\nNo, we should be fully monkey-patched in nova compute:\n\nhttps://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L60\n\n \u003e So, I agree that under a sane threading situation, it doesn\u0027t gain\n \u003e much (although the not-fair locks are a bit of an issue). But\n \u003e perhaps this isn\u0027t a sane threading situation?\n\nWe\u0027re not (real) threaded at all, as you say, and by design. The greenthreads do use the locks in order to handle thread switching around critical regions, there is just no actual *concurrency*.\n\nBut yes, the unfair-by-default is definitely why my sequence diagram isn\u0027t matching up with your experience. Can you try just setting fair\u003dTrue and testing for us?","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"fbb001e03b3775cdb3026df48a24e747b923cc6c","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_4a0e0e33","line":32,"in_reply_to":"1fa4df85_876c0b37","updated":"2020-02-27 14:42:52.000000000","message":"I think this is a fairly unique situation, specifically because nova wasn\u0027t designed to have this multiple node per compute host situation -- it was bolted on later. I would think that if you want to mark the RT lock fair in another change, that\u0027d be an easy sell and then someone (you, if you\u0027re interested) could survey the rest. Just quickly thinking about it, I can\u0027t really think of any other places it would matter a whole lot, other than perhaps our build limit semaphores, which form somewhat of a queue. But even still, those are all \"equal\" in importance, so not as much of a \"background task is starving real work\" situation as this one.","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"a9d279b16f81147bd629bc99c35274d8e15f5ccf","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_69cd385c","line":32,"in_reply_to":"1fa4df85_c99a2cc4","updated":"2020-02-25 21:44:09.000000000","message":"I\u0027m a little dubious about the gains, if any here. I know what you\u0027re going for and this combined with other changes may make some real gains. I think the only thing you accomplish by making this one change is potentially reducing some latency for a boot waiting for the semaphore while the periodic task finishes its update of some other node. So, the following sequence example for the single lock case:\n\n1. Periodic task starts updating node12, under lock\n2. A boot comes in targeted for node7, blocks on lock\n3. Periodic task finishes node12, releases lock, loops around to node13\n4. The boot is waiting in order, so it gets the lock, starts the boot on node7, then releases the lock\n5. The periodic for loop gets granted the lock and goes on to update node13\n\nSo, there\u0027s some latency introduced between 2 and 3, which this would cut out, but since you only have to wait for one node update, it should be a pretty small amount of absolute time. Since we lock and unlock for each node as we roll through the RT nodes, the time to wait is not longer as the number of nodes increases.\n\nIn order to make the periodic actually do more work, we need a threadpool (and conf knob for the size) and a change to how we do the resource updates so that we delegate all the nodes to independent workers in the pool, let them lock on nodename and actually run in parallel, etc.\n\nDepending on the situation, however, that may or may not actually help things. With a thousand ironic nodes on a single compute and a pool of decent size. It just means you\u0027re offering up a lot more resource to spend busy looping. On compute startup, that\u0027s worthwhile to reduce the time it takes to start up, but at steady-state, it\u0027s likely to be worthwhile. And, of course, ironic may or may not be able to actually handle that load faster in parallel. It would be a lot better to let nova shard the computes natively, as it can do, and reduce the burden on each compute service.\n\nSo, anyway, back to this change. In principle I guess this is not necessarily harmful (although I haven\u0027t reviewed in depth to decide if it\u0027s safe to do). However, unless I\u0027m missing some other interaction you\u0027re improving here, I don\u0027t think it gains us much. What am I missing?","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"7be5d59072771f30f565b410c93fedbde4140179","unresolved":false,"context_lines":[{"line_number":29,"context_line":"used, as function arguments are needed to compute the lock name."},{"line_number":30,"context_line":"Instead, the \"lock\" mechanism inside oslo.lockutils, which implements a"},{"line_number":31,"context_line":"context manager, is used to gate execution of the business logic."},{"line_number":32,"context_line":""},{"line_number":33,"context_line":"Change-Id: I05cef73aa71c97a1fa0f3dee0c28b2a2b00b97d6"},{"line_number":34,"context_line":"Related-Bug: #1864122"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"1fa4df85_c99a2cc4","line":32,"in_reply_to":"1fa4df85_e903680c","updated":"2020-02-25 21:19:21.000000000","message":"That sounds great Jason, thanks. I am aware of the problem of ironic deployments taking A Very Long Time (tm) to process as update because of the one to many mapping of nova-compute to ironic nodes + this particular lock and have long wondered if locking instead on hypervisor_hostname (ironic node name) would be a good way to solve it, but obviously that never happened.\n\nSo, I do wonder if there was a discussion and good reason we did not implement such a solution in the past, that I missed. I\u0027m adding Dan Smith to this review as I think he\u0027s our best bet for recalling whether there was any past discussion around this upstream and any reason not to go ahead and do this.\n\nI think this proposal makes sense otherwise and if there\u0027s no reason not to go ahead, I\u0027d support making this change.","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"}],"nova/compute/resource_tracker.py":[{"author":{"_account_id":29100,"name":"Jason Anderson","email":"jasonanderson@uchicago.edu","username":"jasonanderson"},"change_message_id":"655724900ca3196bc2ab2aade29cbe611c97a9cf","unresolved":false,"context_lines":[{"line_number":1710,"context_line":" def build_succeeded(self, nodename):"},{"line_number":1711,"context_line":" \"\"\"Resets the failed_builds stats for the given node.\"\"\""},{"line_number":1712,"context_line":" self.stats[nodename].build_succeeded()"},{"line_number":1713,"context_line":""},{"line_number":1714,"context_line":" def claim_pci_devices(self, context, pci_requests, instance):"},{"line_number":1715,"context_line":" \"\"\"Claim instance PCI resources"},{"line_number":1716,"context_line":""}],"source_content_type":"text/x-python","patch_set":1,"id":"1fa4df85_a9f7b0f5","line":1713,"updated":"2020-02-25 21:22:21.000000000","message":"This change I was not sure about, as I don\u0027t use this PCI functionality and it seems new. The problem was, I needed access to the nodename here, but unlike most of the functions, it is not passed in. However, the instance (should) have it. Fortunately, the caller is able to pass this argument, though it kind of makes the abstraction muddier :/","commit_id":"c1136ec63c7df564316304e01212b6c1835e8ca5"}]}