Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Matt Darcy
    @ikonia
    the container complains ==> failed to parse /consul/config/config.json: invalid character ',' after top-level value
    I don’t see an invalid placement of a comma, more so when the same config is working on non-container deployments running 1.10 (and now 1.10.1)
    what am I missing, it looks like a json error rather than an actual consul error, but I’ve not touched the configs between versions
    (sorry should put that config in a gist to make it easier to read)
    Matt Darcy
    @ikonia
    got to be a user error I’m just not seeing
    Shantanu Gadgil
    @shantanugadgil
    @ikonia could you try cat file.json | jq . to see if that passes.
    I do see that the ending } is missing.
    Matt Darcy
    @ikonia
    I can, thank you
    odd, @shantanugadgil parse error: Expected value before ',' at line 1, column 224
    I don’t see where a value is missing, and I don’t get how this has ‘broken’ after no change
    (also never used jq before - so thanks for that)
    Matt Darcy
    @ikonia
    sorted, thank you, it was the ordering, jq was the saviour, great tip, thank you
    need to research if/when/how that file got changed
    Shantanu Gadgil
    @shantanugadgil
    @ikonia glad to help! 👍
    gc-ss
    @gc-ss

    Interested in hearing your thoughts

    https://discuss.hashicorp.com/t/how-to-use-prepared-query-in-retry-join/26943

    Outside using the way a prepared query can be accessed over DNS, is there a more direct way of putting in an expression in retry_join such that it uses prepared query in the same style of the way "provider" expressions work in retry_join?

    I want to try using a prepared query to self discover consul peers in retry_join (assuming the consul server resolving the prepared query knows about said peers)?

    4 replies
    sgtang
    @sgtang
    Hi all, we had Consul Connect working with Vault as an external CA for a few weeks up until over the weekend, on Saturday our proxies have stopped working, displaying 'Connection Refused' after the initial request. We can't get the minimal Nomad Consul Connect "countdash" example working, which was fine previously. We haven't had any configuration changes, the Vault CA periodic token is still valid and it seems like certs are still being generated and assigned to new proxies. We also can't find anything immediately obvious in metrics or logs, wondering if anyone has experienced something similar? We are on Consul 1.9.6, Nomad 1.0.1 and Envoy 1.16.4
    2 replies
    Matt Darcy
    @ikonia
    I’ve just rebuilt my consul test cluster, basic 3 quorum set nodes with around ~20 members - one of the members is a synlogy NAS box with consul 1.10.1 container running as a cluster member, (it’s an interesting member to have) the 3 consul raft servers are all showing an error from the IP address of the NAS box running the consul member container
    Jul 21 10:05:35 jake consul[7781]: 2021-07-21T10:05:35.282Z [ERROR] agent.server.rpc: unrecognized RPC byte: byte=71 conn=from=10.11.216.64:52022
    Jul 21 10:07:12 nog consul[5613]: 2021-07-21T10:07:12.173Z [ERROR] agent.server.rpc: unrecognized RPC byte: byte=71 conn=from=10.11.216.64:41866
    (from two different nodes - jake and nog)
    I’ve stopped the consul container on the NAS box, and the errors persist
    I can’t get a clear view on my head on what’s happening here, the logs show me that the connection is coming from 10.11.216.64 (the NAS box) - how is this happening with the consul agent not running
    secondly, I don’t understand what the RPC error actually is, as an RPC error this generic could mean many things
    just for completion the cluster leader is also getting the same error
    Jul 21 10:10:53 wesley consul[15814]: 2021-07-21T10:10:53.570Z [ERROR] agent.server.rpc: unrecognized RPC byte: byte=71 conn=from=10.11.216.64:52784
    Matt Darcy
    @ikonia
    on the nas I can even do a ‘consul leave’ to gracefully leave the cluster, and the 3 cluster raft servers are still getting the RPC error in their log
    Matt Darcy
    @ikonia
    ahhhhh I see the problem
    it’s prometheus running on the same node, querying the consul servers on the wrong port, consul is expecting RPC and prometheus is scraping with a standard TCP request, hence why it’s unrecognised RPC
    sgtang
    @sgtang
    Hi all, we've been testing out consul connect ca set-config to rotate between Vault CA endpoints gracefully. The issue is that while existing proxies work fine during the rotation process, new proxies can't seem to reference the new CA bundle until the Consul leader is restarted and an election is forced. Restarting the leader immediately after setting the config causes old proxies to break for a few minutes, however, so this isn't an option. Has anyone dealt with this before? We are on Consul 1.9.6, Envoy 1.16.4.
    David
    @david-antiteum
    Hi all, after upgrading a linux server from U16 to U18 (without touching consul) I'm having the error: Node name XXX is reserved by node YYY. I have tried a number of things (leave, force-leave, use the API to deregister and then register with the new ID..) without result. There is anyway to set the new ID in the cluster? Anything else? We are using consul 1.8.4 Thanks !
    Matt Darcy
    @ikonia
    what’s u16/u18 ?
    David
    @david-antiteum
    Ubuntu 16 -> Ubuntu 18
    Matt Darcy
    @ikonia
    does the node ID file show the same node ID as the conflict ?
    (did you not want to move to Ubuntu 20.04 ?
    David
    @david-antiteum
    where is the node-id file?
    We cannot upgrade to Ubuntu 20 :(
    Matt Darcy
    @ikonia
    in your datadir there is a file node-id
    David
    @david-antiteum
    Yes, same ID. The node-id has the new id.
    Stopping consul and editing node-id with the old value will solve the problem?
    David
    @david-antiteum
    Well, I did just that and now the problem is gone :)
    Thanks a lot @ikonia although I´m not sure if this was the correct way to fix the issue
    Matt Darcy
    @ikonia
    I woudn’t put money on it being the correct way…..
    but that file seems to cause lots of problem if the instance changes in some way / replaced with the same hostname
    Roi Ezra
    @ezraroi
    hi all, so we are running consul in heavy scale (around 12K nodes in a cluster). We see something very wired. We have nodes that exists in the nodes catalog but not in the consul members (serf). This is reflected also in the consul UI as those nodes appear not to have serf health check but they do appear in the ui. of course consul agent is running on those hosts and from the agent perspective all is fine, although it is not listed in the members of the cluster. Any help will be great, we are breaking our heads against the wall with this fo long time
    Pierre Souchay
    @pierresouchay
    You probably have to deregister those nodes using catalog deregister call. This kind of issues arise when cluster is too loaded usually and/or breaks.
    gc-ss
    @gc-ss
    @ezraroi What's the CPU/RAM/pressure metrics of the consul servers?
    Pierre Souchay
    @pierresouchay
    B''
    Roi Ezra
    @ezraroi
    image.png
    Thanks for your replays.
    @pierresouchay , our cluster is not that loaded from CPU or Memory perspective.Also calling the catalog deregister is problematic as those nodes are actually running and they think they are part of the cluster. The only valid solution we have found was restarting the node itself, but at our scale checking this is it not trivial. It feels like something is broken when syncing the serf members and catalog if we end up in such cases. @gc-ss Attaching CPU load on the server hosts