)]}' {"doc/source/developer/specs/nodepool-in-zuul.rst":[{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":228,"context_line":""},{"line_number":229,"context_line":"After a buildset reports with ``image-built: true``, Zuul will scan"},{"line_number":230,"context_line":"result data and for each artifact it finds, it will create an entry in"},{"line_number":231,"context_line":"ZooKeeper at `/zuul/images/\u003cimage_name\u003e/\u003cuuid\u003e`. Zuul will know"},{"line_number":232,"context_line":"not to emit any more `image-build` events for that image at this"},{"line_number":233,"context_line":"point."},{"line_number":234,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"580adb07_0d71fe2a","line":231,"range":{"start_line":231,"start_character":41,"end_line":231,"end_character":45},"updated":"2024-06-24 20:14:47.000000000","message":"Does switching to a uuid instead of a sequence imply we will have to scan all uuid\u0027s to determine which is oldest when it is time to delete the oldest image?\n\nI think this sort of change also implies some needed updates to status output as previously you can rely on sequence info to make sense of order but that will go away.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":27582,"name":"Simon Westphahl","email":"simon.westphahl@bmw.de","username":"simon.westphahl"},"change_message_id":"d57bcb051c6cd42a32cb62876f41452f382598df","unresolved":false,"context_lines":[{"line_number":228,"context_line":""},{"line_number":229,"context_line":"After a buildset reports with ``image-built: true``, Zuul will scan"},{"line_number":230,"context_line":"result data and for each artifact it finds, it will create an entry in"},{"line_number":231,"context_line":"ZooKeeper at `/zuul/images/\u003cimage_name\u003e/\u003cuuid\u003e`. Zuul will know"},{"line_number":232,"context_line":"not to emit any more `image-build` events for that image at this"},{"line_number":233,"context_line":"point."},{"line_number":234,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"ab09aeca_4b27447e","line":231,"in_reply_to":"40cda478_e605cec2","updated":"2024-06-25 06:09:41.000000000","message":"Sequence numbers can cause problems with backup and restore as with a \"fresh\" Zookeeper you can\u0027t simply import those sequence numbers. ZK will always try to start counting from 0 and there is no was to adjust the start index for the sequence numbers.\n\nPreviously we just created/deleted znodes to sync up to the required #, but that was very slow.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"9963cafc61cab73bd654acf799207bca3b51487c","unresolved":true,"context_lines":[{"line_number":228,"context_line":""},{"line_number":229,"context_line":"After a buildset reports with ``image-built: true``, Zuul will scan"},{"line_number":230,"context_line":"result data and for each artifact it finds, it will create an entry in"},{"line_number":231,"context_line":"ZooKeeper at `/zuul/images/\u003cimage_name\u003e/\u003cuuid\u003e`. Zuul will know"},{"line_number":232,"context_line":"not to emit any more `image-build` events for that image at this"},{"line_number":233,"context_line":"point."},{"line_number":234,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"5ebc39b9_4967570d","line":231,"in_reply_to":"580adb07_0d71fe2a","updated":"2024-06-24 21:04:01.000000000","message":"We actually already sort-of implemented this in current nodepool. This line is mostly just catching up to the reality.\n\nCurrent nodepool has an Age column in output for identifying newest/oldest image builds.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"fee12b220f2c2e71202b6a1725c932421cab0d72","unresolved":false,"context_lines":[{"line_number":228,"context_line":""},{"line_number":229,"context_line":"After a buildset reports with ``image-built: true``, Zuul will scan"},{"line_number":230,"context_line":"result data and for each artifact it finds, it will create an entry in"},{"line_number":231,"context_line":"ZooKeeper at `/zuul/images/\u003cimage_name\u003e/\u003cuuid\u003e`. Zuul will know"},{"line_number":232,"context_line":"not to emit any more `image-build` events for that image at this"},{"line_number":233,"context_line":"point."},{"line_number":234,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"40cda478_e605cec2","line":231,"in_reply_to":"5ebc39b9_4967570d","updated":"2024-06-24 21:16:07.000000000","message":"Acknowledged","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":301,"context_line":"* Parallel processing without explicit coordination: a single launcher might"},{"line_number":302,"context_line":" not be able to fully utilize a provider due to e.g. CPU or I/O constraints;"},{"line_number":303,"context_line":" by having multiple launchers processing requests for a provider, we can"},{"line_number":304,"context_line":" better use the available cloud resources."},{"line_number":305,"context_line":""},{"line_number":306,"context_line":"* More intentional request fulfillment: almost no intelligence goes"},{"line_number":307,"context_line":" into selecting which provider will fulfill a given node request; by"}],"source_content_type":"text/x-rst","patch_set":1,"id":"aafdeb55_b25b8b26","line":304,"updated":"2024-06-24 20:14:47.000000000","message":"I\u0027m not sure I understand this new bullet. This has been the case for nodepool for many years. The only limitation was process per provider which is covered in the previous bullet.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"fee12b220f2c2e71202b6a1725c932421cab0d72","unresolved":false,"context_lines":[{"line_number":301,"context_line":"* Parallel processing without explicit coordination: a single launcher might"},{"line_number":302,"context_line":" not be able to fully utilize a provider due to e.g. CPU or I/O constraints;"},{"line_number":303,"context_line":" by having multiple launchers processing requests for a provider, we can"},{"line_number":304,"context_line":" better use the available cloud resources."},{"line_number":305,"context_line":""},{"line_number":306,"context_line":"* More intentional request fulfillment: almost no intelligence goes"},{"line_number":307,"context_line":" into selecting which provider will fulfill a given node request; by"}],"source_content_type":"text/x-rst","patch_set":1,"id":"70695f10_a7a83659","line":304,"in_reply_to":"9b5def98_5729be6e","updated":"2024-06-24 21:16:07.000000000","message":"Acknowledged","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"9963cafc61cab73bd654acf799207bca3b51487c","unresolved":true,"context_lines":[{"line_number":301,"context_line":"* Parallel processing without explicit coordination: a single launcher might"},{"line_number":302,"context_line":" not be able to fully utilize a provider due to e.g. CPU or I/O constraints;"},{"line_number":303,"context_line":" by having multiple launchers processing requests for a provider, we can"},{"line_number":304,"context_line":" better use the available cloud resources."},{"line_number":305,"context_line":""},{"line_number":306,"context_line":"* More intentional request fulfillment: almost no intelligence goes"},{"line_number":307,"context_line":" into selecting which provider will fulfill a given node request; by"}],"source_content_type":"text/x-rst","patch_set":1,"id":"9b5def98_5729be6e","line":304,"in_reply_to":"aafdeb55_b25b8b26","updated":"2024-06-24 21:04:01.000000000","message":"Yes, but the unrevised spec has some locking such that effectively only a single launcher would be issuing cloud API calls for a given provider at a time (though any provider could do so, and they could rapidly switch to handle load). The revision will allow multiple launchers to issue API calls to a single cloud at the same time.\n\nThis is particularly important for folks that only use a single cloud (or even a single region) as the performance of a single launcher determines their max throughput. But even in opendev which achieves high throughput through parallelization across clouds, I expect us to be able to handle bursting better.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"4134c53f_00f6d065","line":333,"updated":"2024-06-24 20:14:47.000000000","message":"I think we should link to information on this hashing method as it appears to be complicated enough to warrant some more in depth info.\n\nMore specifically it appears that you can have a weighted variation of this algorithm, something that I think will be good to have here since different providers have built in weights (max-servers/capacity for nodes). But there are apparently different approaches with different drawbacks and understanding all of that is important.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"fdaee9665f7f316d00d3eabc6f707348e7bf9b20","unresolved":true,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"94b6406e_abc06682","line":333,"in_reply_to":"34b8bbc3_4ebc1b32","updated":"2024-06-24 21:46:25.000000000","message":"Good question!\n\nUp on lines 306-318 here on the new side (but in the original spec) it talks about assigning nodes to clouds more intentionally. So basically, in the new world, some random launcher (with no affiliation with a cloud) will look at a request and say \"this request for 2 ubuntu nodes shall be handled by rax-dfw\" presumably because it saw that rax-dfw can handle ubuntu nodes, and also has sufficient capacity to launch them, or whatever other criteria we decide to use. I imagine we will probably want to throw some randomization in there to avoid undesirable concentration.\n\nIn other words, the good news is that the process of cloud assignment of node requests is completely decoupled from any particular python process that\u0027s running. The bad news is that this spec (either version) doesn\u0027t go into detail about how to implement all that. But honestly, all we need to do is start with random.choice() to get going, then start improving from there. This spec gives us the framework for that.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"fee12b220f2c2e71202b6a1725c932421cab0d72","unresolved":true,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"b1e40169_f2195f28","line":333,"in_reply_to":"34b8bbc3_4ebc1b32","updated":"2024-06-24 21:16:07.000000000","message":"What is preventing a single cloud from handling all requests all the time then? The current system relies on zk request lock races which isn\u0027t ideal but does generally prevent a single provider from monopolizing things. This is important for keeping jobs away from relying on a single cloud provider by chance and then discovering that problem at some point in the future when you need all of your capacity.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"9963cafc61cab73bd654acf799207bca3b51487c","unresolved":true,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"34b8bbc3_4ebc1b32","line":333,"in_reply_to":"4134c53f_00f6d065","updated":"2024-06-24 21:04:01.000000000","message":"I agree about the link.\n\nThe weights don\u0027t need to come into play here because the hash will be used for the following tasks:\n* Deciding which launcher will process a request (and assign nodes to it)\n* Deciding which launcher will drive the state machine for a node\n\nNeither of those involve provider capacity, it\u0027s merely deciding which launcher does some arbitrary piece of work.\n\nKeep in mind that with nodepool-in-zuul, this whole process is much more like how Zuul processes requests than nodepool does today. There is normally no linkage* between a launcher and a cloud provider as there is today in nodepool, much in the same way that there is no zuul-scheduler assigned to the gate pipeline and another assigned to check. Any zuul scheduler can start processing any pipeline at any time (it just needs to lock it first). With nodepool-in-zuul, any launcher can start processing any request or any node at any time. What this spec revision does is it allows the launcers to decide what requests and nodes to process before locking them. It also allows them to process nodes from the same provider at the same time.\n\n* There may be a linkage between a launcher and a cloud provider, if a deployer decides that they need to locate a particular launcher in a particular cloud region due to network constraints. But that doesn\u0027t change the fact that the decision about what cloud should launch what node is made independently, possibly by a different launcher.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":27582,"name":"Simon Westphahl","email":"simon.westphahl@bmw.de","username":"simon.westphahl"},"change_message_id":"d57bcb051c6cd42a32cb62876f41452f382598df","unresolved":false,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"71e2ba32_84e59f17","line":333,"in_reply_to":"7ecfbb05_c49be334","updated":"2024-06-25 06:09:41.000000000","message":"Added a link to https://en.wikipedia.org/wiki/Rendezvous_hashing","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"ffac99e096997f585a1a6a7f69579daa04a383e4","unresolved":false,"context_lines":[{"line_number":330,"context_line":"Requests and nodes will be considered by a launcher based on a calculated"},{"line_number":331,"context_line":"score. For that we will use Rendezvous/HRW (highest random weight) hashing to"},{"line_number":332,"context_line":"build a priority list of candidate launchers. The launcher with the highest"},{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"}],"source_content_type":"text/x-rst","patch_set":1,"id":"7ecfbb05_c49be334","line":333,"in_reply_to":"94b6406e_abc06682","updated":"2024-06-24 21:53:45.000000000","message":"Acknowledged","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"},{"line_number":337,"context_line":"choosen hash function here needs to be fast and doesn\u0027t have to be a"},{"line_number":338,"context_line":"cryptographic hash function (e.g MurmurHash)."},{"line_number":339,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"1a278602_bd175c40","line":336,"range":{"start_line":336,"start_character":59,"end_line":336,"end_character":74},"updated":"2024-06-24 20:14:47.000000000","message":"We would probably want to hash on request not node to simplify keeping multinode requests from consuming all the available locks. Basically ensure we\u0027re doing our best to process multiple requests in parallel rather than a single request in parallel.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"9963cafc61cab73bd654acf799207bca3b51487c","unresolved":true,"context_lines":[{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"},{"line_number":337,"context_line":"choosen hash function here needs to be fast and doesn\u0027t have to be a"},{"line_number":338,"context_line":"cryptographic hash function (e.g MurmurHash)."},{"line_number":339,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"e9fcfb06_63f97cfc","line":336,"in_reply_to":"1a278602_bd175c40","updated":"2024-06-24 21:04:01.000000000","message":"I think the \"or\" in that sentence was due to the fact that there are two queues we\u0027re hashing here:\n\n* There is a queue of nodeset requests; they will be hashed to decide which launcher should handle each.\n* There is a pile of nodes to be built; they will be hashed to decide which launcher will drive their state machines.\n\nI don\u0027t think multinode requests will have a particular advantage here. The original request will be processed in order; then once there are multiple node records, the launchers should end up locking all of them. If there are 5 launchers with 5 node requests, each will process 1. If there are 5000 node requests, each should end up processing 1000. That\u0027s fairly similar to the behavior today, except that once you exceed the capacity of the providers today, there\u0027s no more work for that nodepool-launcher to do, so you could end up with a nodepool-launcher handling 500 nodes and another handling 1500. But a zuul-launcher will be able to evenly distribute them.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"ffac99e096997f585a1a6a7f69579daa04a383e4","unresolved":false,"context_lines":[{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"},{"line_number":337,"context_line":"choosen hash function here needs to be fast and doesn\u0027t have to be a"},{"line_number":338,"context_line":"cryptographic hash function (e.g MurmurHash)."},{"line_number":339,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"b9198842_2b49c89b","line":336,"in_reply_to":"8a2cc2d2_bbcbfd5f","updated":"2024-06-24 21:53:45.000000000","message":"Ah ok for some reason in my head I had that we were introducing something like a launch per cpu. But now rereading it I see that we are not. We\u0027re just doing our best to use more cpu time by spreading things out more generically.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"fee12b220f2c2e71202b6a1725c932421cab0d72","unresolved":true,"context_lines":[{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"},{"line_number":337,"context_line":"choosen hash function here needs to be fast and doesn\u0027t have to be a"},{"line_number":338,"context_line":"cryptographic hash function (e.g MurmurHash)."},{"line_number":339,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"c27a66fb_22bcc097","line":336,"in_reply_to":"e9fcfb06_63f97cfc","updated":"2024-06-24 21:16:07.000000000","message":"I think my main concern here is that we can effectively serialize requests for each build depending on the number of launchers that are available. With the old setup as long as you ran more than one launcher you could start building nodes for multiple builds at once. The upside to this is that builds needing fewer resources can get started more quickly because they can bypass bottlenecks when larger nodesets are needed for other jobs.\n\nI like this behavior because it helps incentivize developers to avoid just always using all the resources all the time.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"fdaee9665f7f316d00d3eabc6f707348e7bf9b20","unresolved":true,"context_lines":[{"line_number":333,"context_line":"score will lock and process a request or node."},{"line_number":334,"context_line":""},{"line_number":335,"context_line":"The the hash will consist of the unique launcher indentifiers (e.g. the"},{"line_number":336,"context_line":"hostnames from the component registry) and the UUID of the request or node. The"},{"line_number":337,"context_line":"choosen hash function here needs to be fast and doesn\u0027t have to be a"},{"line_number":338,"context_line":"cryptographic hash function (e.g MurmurHash)."},{"line_number":339,"context_line":""}],"source_content_type":"text/x-rst","patch_set":1,"id":"8a2cc2d2_bbcbfd5f","line":336,"in_reply_to":"e9fcfb06_63f97cfc","updated":"2024-06-24 21:46:25.000000000","message":"We may be missing each other on this one still.\n\nI think the thing to keep in mind is that in all three cases, all nodes (up to the limit of quota) are being built simultaneously:\n* Today a single nodepool-launcher process can drive hundreds or thousands of node building state machines in parallel (from many different requests).\n* In the original spec, a zuul-launcher would do exactly the same thing, except that they would be decoupled from the zuul-launcher that originally converted the request into building nodes.\n* In the revised spec, the same is true, but the method of deciding which launcher would drive that state machine is math instead of a zk lock race.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":364,"context_line":"The individual provider loop will:"},{"line_number":365,"context_line":""},{"line_number":366,"context_line":"* Iterate over every matching node (highest score) assigned to that provider in"},{"line_number":367,"context_line":" a `building` state"},{"line_number":368,"context_line":""},{"line_number":369,"context_line":" * If the node is locked by another launcher, continue with the next one"},{"line_number":370,"context_line":" * Lock the node (if not already locked)"}],"source_content_type":"text/x-rst","patch_set":1,"id":"5e5906ba_95234e89","line":367,"range":{"start_line":367,"start_character":4,"end_line":367,"end_character":14},"updated":"2024-06-24 20:14:47.000000000","message":"Should we have a new state to represent the pre building needs a provider to actually build the node state? I think for human representation of things in status output that distinction is likely to be helpful.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":27582,"name":"Simon Westphahl","email":"simon.westphahl@bmw.de","username":"simon.westphahl"},"change_message_id":"d57bcb051c6cd42a32cb62876f41452f382598df","unresolved":false,"context_lines":[{"line_number":364,"context_line":"The individual provider loop will:"},{"line_number":365,"context_line":""},{"line_number":366,"context_line":"* Iterate over every matching node (highest score) assigned to that provider in"},{"line_number":367,"context_line":" a `building` state"},{"line_number":368,"context_line":""},{"line_number":369,"context_line":" * If the node is locked by another launcher, continue with the next one"},{"line_number":370,"context_line":" * Lock the node (if not already locked)"}],"source_content_type":"text/x-rst","patch_set":1,"id":"2a1655ea_da3196cd","line":367,"range":{"start_line":367,"start_character":4,"end_line":367,"end_character":14},"in_reply_to":"5e5906ba_95234e89","updated":"2024-06-25 06:09:41.000000000","message":"Yep, I think that makes sense.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":367,"context_line":" a `building` state"},{"line_number":368,"context_line":""},{"line_number":369,"context_line":" * If the node is locked by another launcher, continue with the next one"},{"line_number":370,"context_line":" * Lock the node (if not already locked)"},{"line_number":371,"context_line":" * Drive the state machine"},{"line_number":372,"context_line":" * If success, update request"},{"line_number":373,"context_line":" * If failure, determine if it\u0027s a temporary or permanent failure"}],"source_content_type":"text/x-rst","patch_set":1,"id":"6a433d5a_94955758","line":370,"updated":"2024-06-24 20:14:47.000000000","message":"Re the above comment flip the state to building once the lock is held and just before we start to drive the state machine in the next step below.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":27582,"name":"Simon Westphahl","email":"simon.westphahl@bmw.de","username":"simon.westphahl"},"change_message_id":"d57bcb051c6cd42a32cb62876f41452f382598df","unresolved":false,"context_lines":[{"line_number":367,"context_line":" a `building` state"},{"line_number":368,"context_line":""},{"line_number":369,"context_line":" * If the node is locked by another launcher, continue with the next one"},{"line_number":370,"context_line":" * Lock the node (if not already locked)"},{"line_number":371,"context_line":" * Drive the state machine"},{"line_number":372,"context_line":" * If success, update request"},{"line_number":373,"context_line":" * If failure, determine if it\u0027s a temporary or permanent failure"}],"source_content_type":"text/x-rst","patch_set":1,"id":"4e170fa6_d7273899","line":370,"in_reply_to":"6a433d5a_94955758","updated":"2024-06-25 06:09:41.000000000","message":"Acknowledged","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":4146,"name":"Clark Boylan","email":"cboylan@sapwetik.org","username":"cboylan"},"change_message_id":"6e27d5e88992e0e01de0ce8ba5f12b9b929621f7","unresolved":true,"context_lines":[{"line_number":423,"context_line":""},{"line_number":424,"context_line":"Instead we need to handle rate-limits and API throttling in the respective"},{"line_number":425,"context_line":"drivers and adjust the request rate dynamically based on e.g. API response"},{"line_number":426,"context_line":"headers or errors."},{"line_number":427,"context_line":""},{"line_number":428,"context_line":""},{"line_number":429,"context_line":"Configuration"}],"source_content_type":"text/x-rst","patch_set":1,"id":"ff03cdcd_43170634","line":426,"updated":"2024-06-24 20:14:47.000000000","message":"In the case of OpenStack clouds I don\u0027t think there is a consistent standard for these sorts of errors/rate limits. I don\u0027t think openstack proper implements rate limits in its APIs so these errors come from proxies in front of the API services. The problem set is probably small enough to not be a huge problem, but may lead to a bunch of special casing over time.","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"},{"author":{"_account_id":1,"name":"James E. Blair","email":"jim@acmegating.com","username":"corvus"},"change_message_id":"9963cafc61cab73bd654acf799207bca3b51487c","unresolved":true,"context_lines":[{"line_number":423,"context_line":""},{"line_number":424,"context_line":"Instead we need to handle rate-limits and API throttling in the respective"},{"line_number":425,"context_line":"drivers and adjust the request rate dynamically based on e.g. API response"},{"line_number":426,"context_line":"headers or errors."},{"line_number":427,"context_line":""},{"line_number":428,"context_line":""},{"line_number":429,"context_line":"Configuration"}],"source_content_type":"text/x-rst","patch_set":1,"id":"c9f248f5_ff3cec94","line":426,"in_reply_to":"ff03cdcd_43170634","updated":"2024-06-24 21:04:01.000000000","message":"Good point. So we may need to design this to handle several different sources/manifestations of errors (whether it comes from headers or 4xx response text, etc).","commit_id":"e93bc2f53e7cf9276ab6e172b2e871881981f50a"}]}