)]}' {"/COMMIT_MSG":[{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"745d46c6d245a5b6c9e8c618b8ac37b1244bf70c","unresolved":false,"context_lines":[{"line_number":27,"context_line":" one conductor and one compute service running. If so then return"},{"line_number":28,"context_line":" Healthy otherwise Unhealthy"},{"line_number":29,"context_line":""},{"line_number":30,"context_line":"Response will return the dict of API|cell\u0027s DB|MQ|services with their status."},{"line_number":31,"context_line":""},{"line_number":32,"context_line":"Flow digrame and detailed response example"},{"line_number":33,"context_line":"- https://etherpad.opendev.org/p/nova-healthchecks"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"ff570b3c_e0832455","line":30,"updated":"2020-05-28 22:51:16.000000000","message":"Regarding the response, I think ideally we\u0027d want to have this be versioned (maybe similar do how we do the metadata API, simple datetime based version?) in case we need to evolve the response content in the future. Do you think versioning would be difficult to add?","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"745d46c6d245a5b6c9e8c618b8ac37b1244bf70c","unresolved":false,"context_lines":[{"line_number":32,"context_line":"Flow digrame and detailed response example"},{"line_number":33,"context_line":"- https://etherpad.opendev.org/p/nova-healthchecks"},{"line_number":34,"context_line":""},{"line_number":35,"context_line":"TODO: Auth part for various plugins"},{"line_number":36,"context_line":"TODO: extend the service checks"},{"line_number":37,"context_line":""},{"line_number":38,"context_line":"[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html#available-plugins"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"ff570b3c_957b8cbc","line":35,"updated":"2020-05-28 22:51:16.000000000","message":"I think we are *not* going to want to require any auth for health checks. Instead, dansmith had an idea on a way we could lazily populate a healthcheck-cache during normal auth\u0027ed requests from users, and then check only the cache when /healthcheck is hit (that way no slow performance for healthcheck queries).\n\nI\u0027ll ask Dan to come take a look at this patch and maybe we can work out how he could add caching to what you\u0027ve got here.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"6df6aecdafe160c6b929bce8ec9fe90a95db9859","unresolved":false,"context_lines":[{"line_number":32,"context_line":"Flow digrame and detailed response example"},{"line_number":33,"context_line":"- https://etherpad.opendev.org/p/nova-healthchecks"},{"line_number":34,"context_line":""},{"line_number":35,"context_line":"TODO: Auth part for various plugins"},{"line_number":36,"context_line":"TODO: extend the service checks"},{"line_number":37,"context_line":""},{"line_number":38,"context_line":"[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html#available-plugins"}],"source_content_type":"text/x-gerrit-commit-message","patch_set":1,"id":"ff570b3c_87f077cb","line":35,"in_reply_to":"ff570b3c_957b8cbc","updated":"2020-05-29 15:23:10.000000000","message":"that will be great. I will wait for Dan idea.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"e7a53b62b981918065853dba666aeca54812e236","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_1045ba15","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"updated":"2020-10-07 10:03:03.000000000","message":"One way out of this is to add a periodic task to the api worker that uses the RPC (and the DB) so that we always have traffic. However a periodic task might be problematic from wsgi deployment perspective.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"565a684dc59b7753f36816f1d782691a51c72ebb","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"7f6b1bfe_2acba71e","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"7f6b1bfe_aa469720","updated":"2020-10-13 15:35:52.000000000","message":"\u003e So to summarize the way forward:\n \u003e * Look into oslo_messaging to get info about bus connectedness to\n \u003e avoid only getting MessageTimeout _after_ the connectivity to the\n \u003e bus is restored.\n\nThis isn\u0027t what I\u0027m suggesting. Doing this would change our (and presumably other projects\u0027) behavior a bit by not retrying the connect/send forever, which probably isn\u0027t desirable. What I meant was some sort of thing we can check directly from the healthcheck. Something like \"hey o.msg, how many connection pool workers have active connections?\". If it\u0027s more than zero, then we are likely good. Changing the retry-connect-forever behavior just so we can record timeouts ourselves would also work, but it\u0027s a lot more change.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"404e40546bb3beb8ff95bb017165de186a18701d","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_f00da1e7","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_1045ba15","updated":"2020-10-07 15:15:16.000000000","message":"Yeah, we don\u0027t have periodics for api workers now, and I definitely don\u0027t want to add them, especially to generate traffic on an already busy message queue. This is why we don\u0027t have service record updates from api workers either. Using CERN as an example, every api worker process would need to check connection to \u003e70 rabbit brokers periodically, which is ... quite icky.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"1da2ba752df431489bfddf519f615c9d6509a440","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"7f6b1bfe_aa469720","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_22302b07","updated":"2020-10-13 15:23:47.000000000","message":"So to summarize the way forward:\n* Look into oslo_messaging to get info about bus connectedness to avoid only getting MessageTimeout _after_ the connectivity to the bus is restored.\n* Record MessagingTimeout as fault, but only report unhealthy status after a configurable amount of successive faults.\n* Investigate how to detect DB faults in a generic way.\n* Make the DB health and the message bus health two separate checks to allow configuring nova to report healthy status if the message bus is down but the DB is up. I.e. still server DB only request.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"ed8398b86f4f39e2aa67e0023a0ad37deaea176a","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_f1bd63d7","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_4e410e63","updated":"2020-10-12 13:47:33.000000000","message":"Just to be clear, I *do* recognize your finding there. This is why we said above that it sounds like we might need an *external* indication of connected-ness from oslo.messaging to implement that in a useful way (which may or may not exist today). Also, your point about never being able to recognize a failure is only true when we can\u0027t _connect_ to the bus. If we can, and we have something fail like we make a call to conductor and it never replies, then we *will* get the usual timeout error and could use that to factor into our decision that \"Hmm, maybe we\u0027re on the wrong bus\". But again, calling that a total failure would impact our ability to service DB-only requests.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"3453e2bb9d4052d6a49cd8f29bc8a3c4300bef16","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_982543b1","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_56ed5322","updated":"2020-10-09 16:02:36.000000000","message":"+1 to keeping this simpler and defaulting to OK and avoiding periodic tasks. A bit more detail below:\n\n+1 to returning OK if we don\u0027t know anything and FAIL only if we know we have failed a recent attempt to contact a cell DB or MQ. I\u0027m thinking maybe we could also have a threshold for FAIL like 3 failures in a row before switching to FAIL. And any success resets the FAIL flag to OK.\n\nSoft -1 on connection-to-rabbit status from o.msg (if it exists) because of the complexity for operators in a maintenance scenario as described by Dan.\n\n+1 to the reset to OK if the last FAIL was sufficiently in the past (time delta of elapsed time since flag set).","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"d60006eed915ecbc2637d913c2bc4a6c9f02672d","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_4e410e63","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_982543b1","updated":"2020-10-12 13:20:16.000000000","message":"If we don\u0027t want to generate our own healthcheck traffic then I think we should never send Unknown as a health check result. This just externalize the decision about or health to the caller, harpoxy. If we know that the only way to figure out our health is to get some normal traffic then just report OK and it will result in such traffic so we can get out of the unknown situation. (We can state in the response message that we are not sure about the health but the \u0027available\u0027 boolean [1] should be set to true).\n\nThen the question goes back to when should we report that we are unavailable. \n\nMy original thought was that if there is a recent unsuccessful request then we should report that we are unavailable. (And as far as I see this is the agreement above as well). For the message bus case we cannot implement that. As I commented before:\n\n\"\nI looked into the RPC part of this a bit today. I simply killed the rabbit process to trigger an error for a POST servers/{serverid}/remote-consoles REST request. The message bus connection does happen at the call() method call, but that call does not raise an exception back if the connection fails. Oslo messaging retries the call forever (at least in devstack with default config). Eventually the http client timeouts but the server side doesn\u0027t. I saw the same for cast() method too. \n\nWhen I started the rabbit then the nova RPC call() returned an oslo_messaging.exceptions.MessagingTimeout exception to the nova code. \n\" // PS1 Oct 2.\n\nBased on this behavior we cannot record failed RPC messages as the failure only reported back to nova code _after_ the message bus is recovered. \n\nWe can only record the successful RPC messages but from that information we can only deduce OK, or Unknown state (and we represent Unknown with OK). \n\n\n[1] https://review.opendev.org/#/c/731396/2/nova/api/openstack/healthchecks/api_service.py@55","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"73790b5a0a155150796dce967cb4bde4dbbc204f","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_56ed5322","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_98d8baf0","updated":"2020-10-08 14:51:48.000000000","message":"\u003e is not it deadlock? haproxy would not send any request until\n \u003e healthcheck return OK and healthcheck does not return OK until the\n \u003e first request (successful request).\n\nYeah, I think a few things are getting confused here, so let me try to lay out some logic that I think makes the most sense:\n\nIf we don\u0027t know anything, we should probably return OK until we know otherwise, so that we don\u0027t seal off our ability to use the API to recover from some failure or take actions during an upgrade. The API can still be useful for recovering from a failure scenario even if rabbit is disconnected, if we\u0027re doing something like updating aggregates or flavors during an upgrade or maintenance window. This activity doesn\u0027t require access to the message bus.\n\nIf we can get some sort of connection-to-rabbit status from o.msg, then maybe that would be good to query and report synchronously in the healthcheck, but it would preclude the above and require an operator to disable that, force the worker to look healthy, or start up a separate API worker for maintenance work.\n\nI think the only time we want to really return a failed healthcheck for mq-related issues is if we have failed to make an RPC call recently, incidental to a real request, then we can return FAIL in that case to indicate that maybe we\u0027re connected to rabbit, but the wrong rabbit, etc. Since this should be \"recent\" it means it should time out and start to claim \"I assume healthy\" after some timeout has passed, which makes it auto-heal, instead of requiring some api-resident polling loop to go back to healthy.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"d28df7ee4b048c932a20330d0f9f0e74a0245f48","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_98d8baf0","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_f00da1e7","updated":"2020-10-08 14:38:12.000000000","message":"is not it deadlock? haproxy would not send any request until healthcheck return OK and healthcheck does not return OK until the first request (successful request).\n\nwe can convert these NOT OK to Unknown status which ends up healthcheck will only be beneficial to know when things are up (safe to call) but not otherwise. \n\nThis could have ok for first or after-interval request where haproxy can try sending the request in Unknown status but if it does not return the failure status at all then what is the benefit healhcheck? haproxy or any other LB end up either sending requests all the time even API worker is down which is no improvement from the no healthcheck situation. or not sending the request for Unknown status which is not good as In Unknown status things can be up.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"eecfec249302ea02d5ce2fbd8cc3e5fa05cc9e57","unresolved":false,"context_lines":[{"line_number":14,"context_line":"or the last successful request is older than [default]/service_down_time"},{"line_number":15,"context_line":"then the healthcheck return NOK."},{"line_number":16,"context_line":""},{"line_number":17,"context_line":"Please note that the above behavior this healthcheck cannot be used from"},{"line_number":18,"context_line":"haproxy as a freshly restarted nova-api service would report it health"},{"line_number":19,"context_line":"as NOT OK until the first API request is forwarded to it. But haproxy"},{"line_number":20,"context_line":"would not forward any request to this nova-api as its health is NOT OK."},{"line_number":21,"context_line":"Also if an API worker would not get API request for a while then its"},{"line_number":22,"context_line":"health would switch to NOT OK and therefore haproxy will never send any"},{"line_number":23,"context_line":"new request to this worker."},{"line_number":24,"context_line":""},{"line_number":25,"context_line":"In the healthcheck middleware a healthcheck result is a boolean so we"},{"line_number":26,"context_line":"cannot model that the health of a service is unknown."}],"source_content_type":"text/x-gerrit-commit-message","patch_set":2,"id":"9f560f44_22302b07","line":23,"range":{"start_line":17,"start_character":0,"end_line":23,"end_character":27},"in_reply_to":"9f560f44_f1bd63d7","updated":"2020-10-12 16:44:49.000000000","message":"FWIW I had in mind the last scenario Dan describes here, where an RPC call/cast fails with a oslo_messaging.exceptions.MessagingTimeout, that\u0027s the typical kind of RPC failure I have seen. This can happen in an environment where there are \"intermittent\" issues like this but not a total failure, so that\u0027s why I was thinking about a counter for how many times something Bad happens before we call it NOK. Same thing with DBConnectionError and similar.","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"}],"etc/nova/api-paste.ini":[{"author":{"_account_id":8864,"name":"Artom Lifshitz","email":"notartom@gmail.com","username":"artom"},"change_message_id":"9c5e4c53737b056781f49bffb289d0e0a34fcc1f","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_ec3685b4","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"updated":"2020-05-28 18:37:46.000000000","message":"This is going to expose way more details than is safe for the oslo backend.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"2f63df28519725bf333873b3ad41f96880ae736f","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_2762643f","line":112,"updated":"2020-10-02 15:54:32.000000000","message":"nit: trailing whitespace","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"e86fb2eb40731a6c9ca31bba546c734d0e0f06a3","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_a97f1764","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_018ee577","updated":"2020-06-02 23:40:35.000000000","message":"I\u0027ve sent email to a small group of operators on Friday, pointing them to this patch for comment. No replies/comments yet unfortunately.\n\nI agree that the /healthcheck intended for use with load balancers needs to be fast response and unauth\u0027ed. The full system checks are likely useful for operators but I think either way they couldn\u0027t/shouldn\u0027t be part of this lightweight URL meant for load balancers. Idea: I wonder if they could live at /healthcheck/full maybe and use auth and versioning? I\u0027m not sure if that fits into the rules of how REST should be organized.\n\nBased on this, IMHO, it sounds like the best way forward here would be to:\n\n(1) Prune down this patch to be only a NovaAPIHealthCheck plugin that uses a CELL_LAST_SUCCESS type mechanism to cache healthy/not healthy and return only OK or NOT OK.\n\n(2) Move the nova system health checks into separate plugin(s) that are intended to live at different endpoint(s) and use versioning if their format is intended to be expanded in the future","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"5c8ef9db01188fd34f68e9a600a49fb12bad4cfa","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_9036ec02","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_1d3b1ac8","updated":"2020-05-29 20:38:29.000000000","message":"Thanks for confirming that gmann.\n\nBased on this, I\u0027d say we should set detailed\u003dFalse and I think that would actually work well in conjunction with the idea of scaling back this PoC to be a health check only of an API worker as described in Dan\u0027s comment.\n\nWe could simplify things a lot, have only one plugin that accesses CELL_LAST_SUCCESS unauth\u0027ed and returns OK or NOT OK without any more details.\n\nI think the thing that gives me pause is the idea of adding another endpoint to do a full system check vs using this one to do it. I want to ask penick for his opinion from an operator perspecive. I\u0027ll ping mnaser and belmoreira for their thoughts too on what they expect from /healthcheck and whether a full system check is best as a separate endpoint or same endpoint.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"6df6aecdafe160c6b929bce8ec9fe90a95db9859","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_1d3b1ac8","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_4c27a613","updated":"2020-05-29 15:23:10.000000000","message":"it seems it does expose the platform info, i run the API and got the same[1].\n\nNova plugins details are ok as we are not exposing any specific details but only dict of DB, cell, MQ is up or not. also services i am appending with name as int not with actual service id etc.\n\nIf we can fix on olso middleware side not to expose the platform things then I am in favour of making detaield as True by default else False.\n\n[1] L74 - https://etherpad.opendev.org/p/nova-healthchecks","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"1044ac24d4c5e2c6c745d169fe91ee751f482e5b","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_018ee577","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_8cfcf772","updated":"2020-06-02 15:02:24.000000000","message":"This healthcheck isn\u0027t intended to be beneficial to the user. It\u0027s not really even intended to be useful to the administrator. It\u0027s intended to be useful to the loadbalancer or k8s system to know whether this one thing is dead or not. By combining the simple \"is this one service worker healthy or not\" with \"is the whole nova system healthy, including rabbit and mariadb\" things that are looking at the result of thus will make bad decisions, such as restarting all API workers continually until it hits some threshold, all because rabbit has become wedged.\n\nAs I said in this review as well as in the discussion in the PTG, we can make statements about the health of the larger system by looking at all the pieces, but that is (a) not what this is for, and (b) currently possible with the existing admin-focused APIs we already expose. If we mix the two purposes, we get a healthcheck API that does not serve the need of a loadbalancer/k8s, which is a quick, unauthenticated check of the health of this one service that can be called with relatively high frequency and low impact.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"3364bd2102a3de28ef0e8bfcc705c1951f9eaa1b","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_d3718eaa","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_9036ec02","updated":"2020-05-29 20:43:31.000000000","message":"\u003e I think the thing that gives me pause is the idea of adding another\n \u003e endpoint to do a full system check vs using this one to do it. I\n \u003e want to ask penick for his opinion from an operator perspecive.\n \u003e I\u0027ll ping mnaser and belmoreira for their thoughts too on what they\n \u003e expect from /healthcheck and whether a full system check is best as\n \u003e a separate endpoint or same endpoint.\n\nRe-reading what I wrote, I\u0027m realizing maybe what we could do is have two plugins: one that is API worker check only and one that is a full system check. That way operators can choose wihch one they want via api-paste.ini. Note that the NovaSystemHealthCheck would be best to be versioned via datetime (try to keep it simple?) because what if operators ask for more checks and details in the future.\n\nAnd then the NovaAPIHealthCheck does the CELL_LAST_SUCCESS thing and returns only OK/NOT OK and is not versioned.\n\nJust brainstorming here.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"916c755276838a0b01fcdfb228c1d4bd26de292e","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_b3d21a5b","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_9036ec02","updated":"2020-05-29 20:52:10.000000000","message":"\u003e I think the thing that gives me pause is the idea of adding another\n \u003e endpoint to do a full system check vs using this one to do it. I\n \u003e want to ask penick for his opinion from an operator perspecive.\n \u003e I\u0027ll ping mnaser and belmoreira for their thoughts too on what they\n \u003e expect from /healthcheck and whether a full system check is best as\n \u003e a separate endpoint or same endpoint.\n\nYeah, I want to hear from them about the worker-vs-system thing.\n\nHowever, almost either way, I want to point out that an external script can already do pretty much everything that is done in these more complex plugins without an extra endpoint I think.\n\nI can already check service status, which necessarily goes across all the cells (efficiently in parallel with timeouts I would point out). So if the health check returns just cached per-cell MQ and DB connectivity info, and I\u0027m able to survey the services actively, are we adding much by an extra endpoint? Unless there\u0027s something missing, let them choose what they want to survey and how often by letting it be external. We can write them a tool as a template if that\u0027s helpful, but otherwise slimming the healthcheck down to just the health of the API worker seems like the right approach to me, at least for now.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"7444fddc88f133776c0b73dbf52e7bfd58cc7188","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_a0dacc13","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_95f8ac49","updated":"2020-05-28 23:39:30.000000000","message":"(later) OK I think I don\u0027t 100% understand whether setting detailed\u003dTrue is a problem. Does it expose platform details with the \u0027backends\u0027 set above or no? If not, then I\u0027d think it\u0027s up to the operator to ensure they don\u0027t use detailed\u003dTrue with certain \u0027backends\u0027 and we could include a warning in the code comment to help with that.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8313,"name":"Lajos Katona","display_name":"lajoskatona","email":"katonalala@gmail.com","username":"elajkat","status":"Ericsson Software Technology"},"change_message_id":"e3470e7bef548a883b1bd95f20268e1f1439b31a","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_4c27a613","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_a0dacc13","updated":"2020-05-29 12:56:32.000000000","message":"+1\nAs I see by default it can be False and document it, and give the option to the operator/deployer if they need this chatty output or not.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_47b13f8f","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_a0dacc13","updated":"2020-05-29 14:34:44.000000000","message":"When I looked at it previously I think there was way too much stuff enabled with detailed\u003dTrue even just in the base case (i.e. no backends). I think we probably at least want a comment above this that detailed\u003dTrue will expose a lot of internal details that they probably don\u0027t want. If a deployer is only running this behind a load balancer and the outside world can\u0027t even hit the healthcheck, then maybe that\u0027s okay, but it\u0027s a risk thing otherwise.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"3c5b501a3da45ec884c45bcf4ab57005872a155b","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_969ccb6a","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_a97f1764","updated":"2020-10-08 13:45:10.000000000","message":"I agree on making API worker","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"47db6bd1102ca3780bbe17984646613bb8d0f4ad","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_621ff55a","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_b3d21a5b","updated":"2020-06-01 15:23:10.000000000","message":"yeah, i agree on starting the API worker checks for sure. \n\nsystem checks plugins we can provide with cache mechanism so that people can enable it on demand basis. like openstack-helm or k8s liveness probs can use that directly.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":26458,"name":"Brin Zhang","email":"zhangbailin@inspur.com","username":"zhangbailin"},"change_message_id":"8c4d57170c639106211224a7ffe133edbf5d1ab5","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_8cfcf772","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_d3718eaa","updated":"2020-06-02 01:50:41.000000000","message":"\u003e And then the NovaAPIHealthCheck does the CELL_LAST_SUCCESS thing\n \u003e and returns only OK/NOT OK and is not versioned.\n \u003e \n \u003e Just brainstorming here.\n\nIf we just return OK or NOT OK, ASAICT, from the user’s perspective, it doesn’t make sense to me. As an user, I would like to know my platform what happened? When I perform an operation, it fails or cannot be accessed (I am a user, not a developer, do not ask me to use a shell or CLI), at this time I obtained the nova DB/MQ/Service through the API Information may be a good choice. If it is because the sensitive information does not want to be displayed to the user, I think we can default it to False and give the admin the authority to set \"detailed \u003d True\".","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"745d46c6d245a5b6c9e8c618b8ac37b1244bf70c","unresolved":false,"context_lines":[{"line_number":109,"context_line":"# as per your need for example: HAProxy can enable only disable_by_file"},{"line_number":110,"context_line":"# by replacing the below line with backends \u003d disable_by_file"},{"line_number":111,"context_line":"backends \u003d disable_by_file, Nova_DB_healthcheck, Nova_MQ_healthcheck, Nova_services_healthcheck"},{"line_number":112,"context_line":"detailed \u003d True "},{"line_number":113,"context_line":"disable_by_file_path \u003d /etc/nova/healthcheck_disable"}],"source_content_type":"text/x-properties","patch_set":1,"id":"ff570b3c_95f8ac49","line":112,"range":{"start_line":112,"start_character":11,"end_line":112,"end_character":15},"in_reply_to":"ff570b3c_ec3685b4","updated":"2020-05-28 22:51:16.000000000","message":"+1, we can\u0027t enable this. It exposes too much information, for example:\n\n\"platform\": \"Linux-4.2.0-27-generic-x86_64-with-Ubuntu-14.04-trusty\",\n\n[1] https://docs.openstack.org/oslo.middleware/latest/reference/healthcheck_plugins.html#oslo_middleware.healthcheck.Healthcheck","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"}],"nova/api/openstack/healthchecks/api_service.py":[{"author":{"_account_id":9708,"name":"Balazs Gibizer","display_name":"gibi","email":"gibizer@gmail.com","username":"gibi"},"change_message_id":"820a14cc5b8ba0518bbcbcb04ba6562abdf399d4","unresolved":false,"context_lines":[{"line_number":58,"context_line":" \"the last service restart\")"},{"line_number":59,"context_line":""},{"line_number":60,"context_line":" rpc_delta \u003d timeutils.utcnow() - rpc.LAST_SUCCESSFUL_RPC"},{"line_number":61,"context_line":" print(rpc.LAST_SUCCESSFUL_RPC)"},{"line_number":62,"context_line":" print(rpc_delta)"},{"line_number":63,"context_line":" # TODO(gibi): we might want a separate option defining the timeout for"},{"line_number":64,"context_line":" # rpc health"},{"line_number":65,"context_line":" if rpc_delta \u003c datetime.timedelta(seconds\u003dCONF.service_down_time):"}],"source_content_type":"text/x-python","patch_set":2,"id":"9f560f44_ad49198f","line":62,"range":{"start_line":61,"start_character":0,"end_line":62,"end_character":24},"updated":"2020-10-07 09:38:58.000000000","message":"drop this","commit_id":"3316b816306122b6e40b08d4e5e60f9e6a8d1a9a"}],"nova/api/openstack/healthchecks/db_health.py":[{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":41,"context_line":" def _check_cell_db(self, context):"},{"line_number":42,"context_line":" try:"},{"line_number":43,"context_line":" engine \u003d session.get_engine(context)"},{"line_number":44,"context_line":" engine.connect()"},{"line_number":45,"context_line":" # TODO(gmann)? we can try to access some table if needed"},{"line_number":46,"context_line":" # conn.execute(\"select * from aggregates\")"},{"line_number":47,"context_line":" engine.dispose()"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_62171147","line":44,"updated":"2020-05-29 14:34:44.000000000","message":"So, imagine a (real) case where I have four remote sites with compute nodes over a VPN. They sometimes come and go because they\u0027re remote and their connection is not perfect or I take one of the intermediate networks down for maintenance which causes packets to drop into a hole. This will take however long to time out, and stack sequentially if there are more than one. How long is the healthcheck timeout of the thing looking to nuke this API worker if it doesn\u0027t respond? Since all of my API workers will fall into the same trap, what will happen? Isn\u0027t the point of the healthcheck really to signal that this API worker is healthy? The status of the remote cells doesn\u0027t really mean anything about this API worker, yet there\u0027s a good chance that we\u0027ll signify that those health statuses are fatally linked.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":41,"context_line":" def _check_cell_db(self, context):"},{"line_number":42,"context_line":" try:"},{"line_number":43,"context_line":" engine \u003d session.get_engine(context)"},{"line_number":44,"context_line":" engine.connect()"},{"line_number":45,"context_line":" # TODO(gmann)? we can try to access some table if needed"},{"line_number":46,"context_line":" # conn.execute(\"select * from aggregates\")"},{"line_number":47,"context_line":" engine.dispose()"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_eacdb730","line":44,"in_reply_to":"ff570b3c_62171147","updated":"2020-05-29 17:39:30.000000000","message":"that is good point. I was expecting timeout things on oslo middleware side but it does not seem to. But this is one thing to add.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":43,"context_line":" engine \u003d session.get_engine(context)"},{"line_number":44,"context_line":" engine.connect()"},{"line_number":45,"context_line":" # TODO(gmann)? we can try to access some table if needed"},{"line_number":46,"context_line":" # conn.execute(\"select * from aggregates\")"},{"line_number":47,"context_line":" engine.dispose()"},{"line_number":48,"context_line":" return True"},{"line_number":49,"context_line":" except Exception as ex:"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_e224c1b1","line":46,"updated":"2020-05-29 14:34:44.000000000","message":"IMHO, this again allows an attacker to do too much to load the system. And if we do, we\u0027d need a query that is very fast and limited in size.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":43,"context_line":" engine \u003d session.get_engine(context)"},{"line_number":44,"context_line":" engine.connect()"},{"line_number":45,"context_line":" # TODO(gmann)? we can try to access some table if needed"},{"line_number":46,"context_line":" # conn.execute(\"select * from aggregates\")"},{"line_number":47,"context_line":" engine.dispose()"},{"line_number":48,"context_line":" return True"},{"line_number":49,"context_line":" except Exception as ex:"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_dd86c213","line":46,"in_reply_to":"ff570b3c_e224c1b1","updated":"2020-05-29 17:39:30.000000000","message":"yeah, I agree. i think no query is ok here and just return based on connection established.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":64,"context_line":" cell0_up \u003d False"},{"line_number":65,"context_line":" reason \u003d \"DB Healthy: Nova API + cell databases are UP\""},{"line_number":66,"context_line":" try:"},{"line_number":67,"context_line":" cells \u003d cell_mapping_obj.CellMappingList.get_all(admin_ctxt)"},{"line_number":68,"context_line":" except db_exc.DBError:"},{"line_number":69,"context_line":" LOG.warning(\u0027Cannot get cells info from API DB.\u0027)"},{"line_number":70,"context_line":" return pluginbase.HealthcheckResult("}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_47765fda","line":67,"updated":"2020-05-29 14:34:44.000000000","message":"We cache these in all the API workers so we don\u0027t have to look them up each time. I definitely don\u0027t think we should make this *more* heavyweight. It does mean than adding a new cell will go unnoticed for some amount of time, but that\u0027s the same for the whole of the API, and adding cells is exceptionally rare (an order of magnitude or two less than adding nodes). People already have to HUP their APIs to notice.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":64,"context_line":" cell0_up \u003d False"},{"line_number":65,"context_line":" reason \u003d \"DB Healthy: Nova API + cell databases are UP\""},{"line_number":66,"context_line":" try:"},{"line_number":67,"context_line":" cells \u003d cell_mapping_obj.CellMappingList.get_all(admin_ctxt)"},{"line_number":68,"context_line":" except db_exc.DBError:"},{"line_number":69,"context_line":" LOG.warning(\u0027Cannot get cells info from API DB.\u0027)"},{"line_number":70,"context_line":" return pluginbase.HealthcheckResult("}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_1df4ba86","line":67,"in_reply_to":"ff570b3c_47765fda","updated":"2020-05-29 17:39:30.000000000","message":"thanks, I did not about that cells list cache. I can use load_cells() here and new cell addition is ok as you mentioned that that is rare and once it show up in API then checking health is fine","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":71,"context_line":" available\u003dFalse,"},{"line_number":72,"context_line":" reason\u003d\u0027db Unhealthy: Nova API DB down\u0027)"},{"line_number":73,"context_line":" results[\u0027DB\u0027][\u0027API\u0027] \u003d \u0027up\u0027"},{"line_number":74,"context_line":" for cell in cells:"},{"line_number":75,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":76,"context_line":" if self._check_cell_db(ctxt):"},{"line_number":77,"context_line":" results[\u0027DB\u0027][cell.name] \u003d \u0027up\u0027"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_c719cf88","line":74,"updated":"2020-05-29 14:34:44.000000000","message":"In the case of CERN, this will hit 70 databases across multiple WAN links and I think across two international borders. Personally I think this is too heavy to be enabled by default and hit-able by people without any credentials. If I hammer this API, I can directly hammer the databases. I imagine I can exhaust the DB worker thread pool with no credentials and DoS the rest of the API, right?","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":71,"context_line":" available\u003dFalse,"},{"line_number":72,"context_line":" reason\u003d\u0027db Unhealthy: Nova API DB down\u0027)"},{"line_number":73,"context_line":" results[\u0027DB\u0027][\u0027API\u0027] \u003d \u0027up\u0027"},{"line_number":74,"context_line":" for cell in cells:"},{"line_number":75,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":76,"context_line":" if self._check_cell_db(ctxt):"},{"line_number":77,"context_line":" results[\u0027DB\u0027][cell.name] \u003d \u0027up\u0027"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_38970c93","line":74,"in_reply_to":"ff570b3c_c719cf88","updated":"2020-05-29 17:39:30.000000000","message":"yeah it is heavy thing. What we can do is disable it by default and if anyone want to check all cells then enable it via plugins.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"}],"nova/api/openstack/healthchecks/mq_health.py":[{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":44,"context_line":" results \u003d {\u0027MQ\u0027: {}}"},{"line_number":45,"context_line":" at_least_one_cell_up \u003d False"},{"line_number":46,"context_line":" try:"},{"line_number":47,"context_line":" mappings \u003d cell_mapping_obj.CellMappingList.get_all(admin_ctxt)"},{"line_number":48,"context_line":" # filter out the cell0 as there is no rabbitmq running there."},{"line_number":49,"context_line":" cell_mappings \u003d [cell for cell in mappings"},{"line_number":50,"context_line":" if not cell.is_cell0()]"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_223b59c8","line":47,"updated":"2020-05-29 14:34:44.000000000","message":"Same concern about listing these each time.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":69,"context_line":" reason\u003d\u0027MQ Unealthy: Nova API MQ down\u0027,"},{"line_number":70,"context_line":" details\u003dresults)"},{"line_number":71,"context_line":" # check cells MQ"},{"line_number":72,"context_line":" for cell in cells:"},{"line_number":73,"context_line":" results[\u0027MQ\u0027][cell.name] \u003d \u0027down\u0027"},{"line_number":74,"context_line":" try:"},{"line_number":75,"context_line":" transport \u003d msg_transport.get_transport(CONF,"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_c24c1d78","line":72,"updated":"2020-05-29 14:34:44.000000000","message":"Again, same concern about hitting 70 cells with new connections, in sequence and timeouts.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":69,"context_line":" reason\u003d\u0027MQ Unealthy: Nova API MQ down\u0027,"},{"line_number":70,"context_line":" details\u003dresults)"},{"line_number":71,"context_line":" # check cells MQ"},{"line_number":72,"context_line":" for cell in cells:"},{"line_number":73,"context_line":" results[\u0027MQ\u0027][cell.name] \u003d \u0027down\u0027"},{"line_number":74,"context_line":" try:"},{"line_number":75,"context_line":" transport \u003d msg_transport.get_transport(CONF,"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_38dc2c73","line":72,"in_reply_to":"ff570b3c_c24c1d78","updated":"2020-05-29 17:39:30.000000000","message":"yeah, I can divide the plugins based on cell vs API worker instead of DB, MQ and services. Explaining it in next file comment","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"}],"nova/api/openstack/healthchecks/services_health.py":[{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":45,"context_line":" results \u003d {\u0027services\u0027: {}}"},{"line_number":46,"context_line":" at_least_one_cell_up \u003d False"},{"line_number":47,"context_line":" try:"},{"line_number":48,"context_line":" mappings \u003d cell_mapping_obj.CellMappingList.get_all(admin_ctxt)"},{"line_number":49,"context_line":" # filter out the cell0 as there is no service running there."},{"line_number":50,"context_line":" cells \u003d [cell for cell in mappings"},{"line_number":51,"context_line":" if not cell.is_cell0()]"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_c2753d9a","line":48,"updated":"2020-05-29 14:34:44.000000000","message":"Same concern.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"cb4a99b8599ca37f06ca8d78f6784dc557e9fc71","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_e27201ad","line":66,"updated":"2020-05-29 14:34:44.000000000","message":"In the CERN case, you\u0027ve already iterated 70 cells twice for the previous two checks, and are now going to iterate \u003e10k nodes here. On each unauthenticated request to /healthcheck. That seems extreme to me.\n\nAlso, again, I don\u0027t think the healthcheck is really supposed to report on the health of the entire system, right? Isn\u0027t the point of the healthcheck to tell the loadbalancer or k8s that this one API endpoint is not healthy and should be restarted or excluded?","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"6a8b214f852b692f88cda25e62a05fd184aa56f6","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_8d309558","line":66,"in_reply_to":"ff570b3c_18fe8837","updated":"2020-05-29 18:32:53.000000000","message":"\u003e I agree on those points, For auth things, I am waiting for your\n \u003e idea of making it with auth from cache or something.\n\nI don\u0027t have anything super specific, other than that general concept of updating stats generically on authorized requests and returning them in the health checks. Here\u0027s one concrete thing we could try:\n\nIn context.target_cell() we yield the targeted context for the cell-targeted work we\u0027re going to do. We could change that bare yield to something like:\n\n yield cctx\n CELL_LAST_SUCCESS[cell_mapping.uuid] \u003d time.time()\n\nThat way nearly everything we do that involves talking to cells will go through that, and only update the timestamp of the last success with that cell if we don\u0027t raise in the child scope of the yield. Perhaps we should also record whether or not the last attempt succeeded or failed due to an MQ or DB error. Then in the health check, we can synthesize some health going based on what we know about the last thing we tried to do. If there are no requests at all for a long time, we could report success when indeed things are unhealthy, but as soon as a request comes in and fails, we\u0027ll report unhealthy and the actor can take action.\n\n \u003e For other parts, my idea is to give the health check for complete\n \u003e system also. But enabling that by default is really heavy things as\n \u003e you mentioned.\n\nWell, again, I\u0027d like to hear from people that they really expect this to be a systemic health report and not just that of the API worker. I\u0027d rather implement systemic health somewhere else, where we can just require auth and keep these lightweight unauth\u0027d checks purely to the health of the individual worker being checked. In past discussions about this the plan was to even implement a tiny http server on nova-compute so that we could check the health of it from something like k8s or whatever and that would definitely fit with my understanding about how these are to work.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8864,"name":"Artom Lifshitz","email":"notartom@gmail.com","username":"artom"},"change_message_id":"fcf92bc62ad8e7b38a89b5dbcfc2e5dc6b97d670","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_2a896fc9","line":66,"in_reply_to":"ff570b3c_18fe8837","updated":"2020-05-29 17:44:22.000000000","message":"\u003e If we combine the three cell plugins into one then we can save the multiple iterates over cell. The only disadvantage is people would not enable those separately like if I want to enable DB health checks only then it would not be possible.\n\nCan that be made configurable? So we\u0027d iterate only once, and depending on what checks are enabled in the config, verify those things for each cell? Or does middleware not have that kind of config flexibility?","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"6a8b214f852b692f88cda25e62a05fd184aa56f6","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_6d4541b4","line":66,"in_reply_to":"ff570b3c_2a896fc9","updated":"2020-05-29 18:32:53.000000000","message":"\u003e Can that be made configurable? So we\u0027d iterate only once, and\n \u003e depending on what checks are enabled in the config, verify those\n \u003e things for each cell? Or does middleware not have that kind of\n \u003e config flexibility?\n\nWe don\u0027t need configurability in the middleware, as we provide the code and can check our own config.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"47db6bd1102ca3780bbe17984646613bb8d0f4ad","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_5f38ceec","line":66,"in_reply_to":"ff570b3c_307720d7","updated":"2020-06-01 15:23:10.000000000","message":"thanks Dan, CELL_LAST_SUCCESS is really good idea and may be we can introduce it with configurable time to know how old that cache info is and refresh in healthchecks if needed.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4393,"name":"Dan Smith","email":"dms@danplanet.com","username":"danms"},"change_message_id":"6caa63e5938faadd2a8c99d1f689982fa2d5c8cf","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_627b757a","line":66,"in_reply_to":"ff570b3c_5f38ceec","updated":"2020-06-01 15:31:35.000000000","message":"Not, IMHO. If no requests are coming in to refresh the cache, then there really isn\u0027t much point in saying that the service isn\u0027t healthy enough to perform it\u0027s job. As soon as a request comes in that can\u0027t be satisfied due to something being unreachable, the health check would start to fail. I don\u0027t really see any reason not to just let that push the healthcheck results.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"a9a71a336176b81797ab48811363eba56d3ce91e","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_6281b5fb","line":66,"in_reply_to":"ff570b3c_627b757a","updated":"2020-06-01 15:49:41.000000000","message":"ok my concern was healthchecks saying it is healthy but when request comes and fail then healthchecks result become false info but yeah if no request for long time or no request started then we can say system is just ideal and healthchecks also not needed?","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":4690,"name":"melanie witt","display_name":"melwitt","email":"melwittt@gmail.com","username":"melwitt"},"change_message_id":"5c8ef9db01188fd34f68e9a600a49fb12bad4cfa","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_307720d7","line":66,"in_reply_to":"ff570b3c_8d309558","updated":"2020-05-29 20:38:29.000000000","message":"+1 to Dan\u0027s points here about the health check being an API worker health check. I have the same interpretation and I like the idea of the CELL_LAST_SUCCESS.\n\nThe initial proposal was an extremely simple \"reply 200\" that was simply telling that wsgi was working, which is the same thing you get when you visit /. An API worker can\u0027t fulfill requests if it can\u0027t talk to the DB and MQ, so those were identified as things to include in the health check.\n\nAnd we can do that using Dan\u0027s suggestion of building a DB/MQ health cache in the background that piggybacks on auth\u0027ed requests by end users. That way, when the load balancer calls /healthcheck without auth, all we have to do is check the cache and return OK or NOT OK.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":26458,"name":"Brin Zhang","email":"zhangbailin@inspur.com","username":"zhangbailin"},"change_message_id":"b51fbb661afe4d6a9285483cb3c82e53832ef092","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_6cd243d3","line":66,"in_reply_to":"ff570b3c_8d309558","updated":"2020-06-02 02:33:36.000000000","message":"\u003e Well, again, I\u0027d like to hear from people that they really expect\n \u003e this to be a systemic health report and not just that of the API\n \u003e worker. I\u0027d rather implement systemic health somewhere else, where\n \u003e we can just require auth and keep these lightweight unauth\u0027d checks\n \u003e purely to the health of the individual worker being checked. \n\nIf we can keep the systemic health report function, that will be good for use, AFAIK, most administrators want to receive the alarm information automatically reported by the system, so that they can monitor the status of the platform in real time. It would be very friendly if we could keep such an excuse.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"},{"author":{"_account_id":8556,"name":"Ghanshyam Maan","display_name":"Ghanshyam Maan","email":"gmaan.os14@gmail.com","username":"ghanshyam"},"change_message_id":"34b8ca322f5808fb8504468ae9688750e75766af","unresolved":false,"context_lines":[{"line_number":63,"context_line":" with nova_context.target_cell(admin_ctxt, cell) as ctxt:"},{"line_number":64,"context_line":" services_per_cell \u003d objects.ServiceList.get_all("},{"line_number":65,"context_line":" ctxt, disabled\u003dFalse)"},{"line_number":66,"context_line":" for service in services_per_cell:"},{"line_number":67,"context_line":" i +\u003d 1"},{"line_number":68,"context_line":" # Service status cache can be helpful here at least"},{"line_number":69,"context_line":" # for CONF.service_down_time"}],"source_content_type":"text/x-python","patch_set":1,"id":"ff570b3c_18fe8837","line":66,"in_reply_to":"ff570b3c_e27201ad","updated":"2020-05-29 17:39:30.000000000","message":"I agree on those points, For auth things, I am waiting for your idea of making it with auth from cache or something.\n\nFor other parts, my idea is to give the health check for complete system also. But enabling that by default is really heavy things as you mentioned.\n\nLet me refactor it and divide into API vs cell based plugins instead of DB/MQ and services. What I can do is:\n\n1. default healthchecks: very lightweight with only check API worker only. It can be used for the load balancer.\n\n2. provide new plugins for cells, which can be disabled by default and people can enable those if they need complete health of the system. it can be used for complete system health checks usage.\n - cell_db_healthchecks\n - cell_MQ_healthchecks\n - cell_service_healthchecks\nIf we combine the three cell plugins into one then we can save the multiple iterates over cell. The only disadvantage is people would not enable those separately like if I want to enable DB health checks only then it would not be possible.","commit_id":"501f3a1dc1d16c10ce8753945649da8ffafd4cff"}]}