Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Matt Darcy
    @ikonia
    I woudn’t put money on it being the correct way…..
    but that file seems to cause lots of problem if the instance changes in some way / replaced with the same hostname
    Roi Ezra
    @ezraroi
    hi all, so we are running consul in heavy scale (around 12K nodes in a cluster). We see something very wired. We have nodes that exists in the nodes catalog but not in the consul members (serf). This is reflected also in the consul UI as those nodes appear not to have serf health check but they do appear in the ui. of course consul agent is running on those hosts and from the agent perspective all is fine, although it is not listed in the members of the cluster. Any help will be great, we are breaking our heads against the wall with this fo long time
    Pierre Souchay
    @pierresouchay
    You probably have to deregister those nodes using catalog deregister call. This kind of issues arise when cluster is too loaded usually and/or breaks.
    gc-ss
    @gc-ss
    @ezraroi What's the CPU/RAM/pressure metrics of the consul servers?
    Pierre Souchay
    @pierresouchay
    B''
    Roi Ezra
    @ezraroi
    image.png
    Thanks for your replays.
    @pierresouchay , our cluster is not that loaded from CPU or Memory perspective.Also calling the catalog deregister is problematic as those nodes are actually running and they think they are part of the cluster. The only valid solution we have found was restarting the node itself, but at our scale checking this is it not trivial. It feels like something is broken when syncing the serf members and catalog if we end up in such cases. @gc-ss Attaching CPU load on the server hosts
    Pierre Souchay
    @pierresouchay
    Deregister the node would not be an issue, the node would register again when anti-entropy would trigger (i would say around 10/15 minutes on such a big cluster). If it doez not, it means something is broken on the node itself, so restarting the node looks like the only valid option. Nothing weird on those nodes logs?
    Pierre Souchay
    @pierresouchay
    @ezraroi this could also be an ACL issue if you changed something (bit yoy should have something in the logs in such case)
    Roi Ezra
    @ezraroi
    @pierresouchay thanks. We are not using ACL. The anti-entropy should fixed the issue even if i dont deregister the node, right? This does not happen also
    Pierre Souchay
    @pierresouchay
    @ezraroi So, probably the agents without health info are broken for some reason... Do you have some logs on those agents? Try requesting the logs in debug mode using the consul monitor command on one of those agents
    John Spencer
    @johnnyplaydrums
    Hey folks - does anyone know how the hexidecimal value at the beginning of the service name that's returned from a dig SRV is generated? Is there anyway to know this value from a service running in consul (via env var, or some deterministic method). For example when you dig for an SRV record, you're given a dns name like ac1f8802.addr.dc1.consul. Where does that ac1f8802 come from, and is it possible to know that from within a service running in consul?
    Michael Aldridge
    @the-maldridge
    @johnnyplaydrums that's an IP address
    172.31.136.2
    John Spencer
    @johnnyplaydrums
    How can I go from IP address to the value? @the-maldridge
    Like if I'm a service and want to know that hostname, it looks like I can derive it from <hex_value>.addr.<dc>.<tld>. Is there an easy way to go from IP address -> hex value?
    gc-ss
    @gc-ss
    Math. You can try it out at: https://www.browserling.com/tools/ip-to-hex
    John Spencer
    @johnnyplaydrums
    I love Math.
    Thank you sir
    🙏
    gc-ss
    @gc-ss
    Willi Schönborn
    @whiskeysierra
    I didn't find anything in the docs, so I'm asking here. Does the transparent proxy support Consul's own DNS as well, instead of Kubernetes DNS? We're running multiple clusters, so Kubernetes DNS won't do any good for us. But we do have routeable pod IPs, which means two pods from different clusters can talk to one another.
    Spencer Owen
    @spuder
    Is there an automated way to upgrade consul_intentions < 1.9 to the new consul_config_entry syntax with terraform? I have hundreds of consul intentions and changing all these by hand is toing to take forever.
    Example
    # This was correct in version 2.10.0
    resource "consul_intention" "database" {
      source_name      = "api"
      destination_name = "db"
      action           = "allow"
    }
    
    # This is now the correct configuration starting version 2.11.0
    resource "consul_config_entry" "database" {
      name = "db"
      kind = "service-intentions"
    
      config_json = jsonencode({
        Sources = [{
          Action     = "allow"
          Name       = "api"
          Precedence = 9
          Type       = "consul"
        }]
      })
    }
    10 replies
    johnny101
    @johnny101:matrix.org
    [m]
    When running consul connect in Nomad with an envoy sidecar, consul agent and envoy sidecar container stderr logs show the following grpc permission related errors below. Anyone familiar with this or how to debug it?
    # From consul agent on the host (log level is trace):
    agent.envoy.xds: Incremental xDS v3: xdsVersion=v3 direction=request protobuf="{ "typeUrl": "type.googleapis.com/envoy.config.cluster.v3.Cluster"
    agent.envoy.xds: subscribing to type: xdsVersion=v3 typeUrl=type.googleapis.com/envoy.config.cluster.v3.Cluster
    agent.envoy.xds: watching proxy, pending initial proxycfg snapshot for xDS: service_id=_nomad-task-6227f408-bee9-77fa-529f-924164f42b80-group-api-count-api-9001-sidecar-proxy xdsVersion=v3
    agent.envoy.xds: Got initial config snapshot: service_id=_nomad-task-6227f408-bee9-77fa-529f-924164f42b80-group-api-count-api-9001-sidecar-proxy xdsVersion=v3
    agent.envoy: Error handling ADS delta stream: xdsVersion=v3 error="rpc error: code = PermissionDenied desc = permission denied"
    
    # From envoy stderr in the envoy sidecar container (log level is trace):
    DeltaAggregatedResources gRPC config stream closed: 7, permission denied
    gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed
    gRPC update for type.googleapis.com/envoy.config.listener.v3.Listener failed
    1 reply
    Daniel Hix
    @ADustyOldMuffin
    getting IO timeouts on consul snapshot restore, any ideas? the port 8300 is open and I can hit it from the leader container
    3 replies
    SBeard
    @etacalpha
    Has anyone setup a mongodb atlas connection via terminating gateway?
    12 replies
    Michael Aldridge
    @the-maldridge
    @blake is there a recommendation anywhere for how to distribute certificates to consul servers when running immutably?
    6 replies
    Gaurav Shankar
    @gauravshankarcan_gitlab
    having an issue " agent.server.memberlist.wan: memberlist: Failed to resolve consul-consul-server-1.dc1/2605::::::8302: lookup 2605:::::::8302: no such host" .. the issue is tthere is no brackets on the ipv6 like [2605:::]8302 . how do i introduce this in the wan lookup .. environment is openshift ipv6 cluster
    1 reply
    kkbe
    @kkbe

    hello. I have a working mesh gateway with wan federation. from both datacenters I can curl /v1/catalog/services?dc=<other-dc> and see the services running there and "consul members -wan" shows servers in both dcs
    however, services themselves (e.g. the socat example) cannot connect between the DCs
    The only errors I see in the consul logs are on the secondary DC where there are lots of warnings:
    Err :connection error: desc = "transport: Error while dialing dial tcp <internal ip of server in primary dc>:8300: i/o timeout"

    I outlined the issue here https://discuss.hashicorp.com/t/unable-to-connect-services-between-datacenters-despite-working-mesh-gateways/28721
    I would really appreciate any help as I'm completely stuck

    1 reply
    ryan-omni3
    @ryan-omni3
    Hi. What Java library are developers using now for accessing consul? consul-client has not had a release in a while.
    Matt Darcy
    @ikonia
    my home test lab (all running consul 1.10.1) is having an odd problem with one node - it seems to never truly join the cluster properly, I’ve just done a force-remove and then a join wich didn’t error, however the node is filled with errors / problems, I cannot understand the reasoning for the behaviour of this node. The status of the test cluster is as follows.
    blockquote Node Address Status Type Build Protocol DC Segment
    jake.no-dns.co.uk 10.11.216.234:8301 alive server 1.10.1 2 bathstable <all>
    nog.no-dns.co.uk 10.11.216.182:8301 alive server 1.10.1 2 bathstable <all>
    wesley.no-dns.co.uk 10.11.216.81:8301 alive server 1.10.1 2 bathstable <all>
    anton.no-dns.co.uk 10.11.216.165:8301 alive client 1.10.1 2 bathstable <default>
    archer.no-dns.co.uk 127.0.0.1:8301 alive client 1.10.1 2 bathstable <default>
    c8test2.no-dns.co.uk 10.11.216.207:8301 alive client 1.10.1 2 bathstable <default>
    dukat.no-dns.co.uk 10.11.216.194:8301 alive client 1.10.1 2 bathstable <default>
    garak.no-dns.co.uk 10.11.216.160:8301 alive client 1.10.1 2 bathstable <default>
    janeway.no-dns.co.uk 10.11.216.116:8301 alive client 1.10.1 2 bathstable <default>
    jarvis.no-dns.co.uk 10.11.216.4:8301 alive client 1.10.1 2 bathstable <default>
    lcars.no-dns.co.uk 10.11.216.2:8301 alive client 1.10.1 2 bathstable <default>
    lemon.no-dns.co.uk 10.11.216.5:8301 alive client 1.9.6 2 bathstable <default>
    paris.no-dns.co.uk 10.11.216.64:8301 alive client 1.10.1 2 bathstable <default>
    riker.no-dns.co.uk 10.11.216.6:8301 alive client 1.10.1 2 bathstable <default>
    ro.no-dns.co.uk 10.11.216.78:8301 alive client 1.10.1 2 bathstable <default>
    router.no-dns.co.uk 10.11.216.1:8301 alive client 1.10.1 2 bathstable <default>
    tpol.no-dns.co.uk 10.11.216.192:8301 alive client 1.10.1 2 bathstable <default>
    my first concern which I can find no reference to is why archer.no-dns.co.uk is being referenced on 127.0.0.1 rather than it’s true IP address like all the other nodes, config.json on all the nodes including archer displays the IP linked to the FQDN
    Aug 27 17:22:07 archer consul[1726]: 2021-08-27T17:22:07.419Z [WARN] agent.client.memberlist.lan: memberlist: Was able to connect to ro.no-dns.co.uk but other probes failed, network may be misconfigured
    Aug 27 17:22:08 archer consul[1726]: 2021-08-27T17:22:08.419Z [WARN] agent.client.memberlist.lan: memberlist: Was able to connect to riker.no-dns.co.uk but other probes failed, network may be misconfigured
    Aug 27 17:22:08 archer consul[1726]: 2021-08-27T17:22:08.969Z [WARN] agent.client.memberlist.lan: memberlist: Refuting a suspect message (from: archer.no-dns.co.uk)
    Aug 27 17:22:09 archer consul[1726]: 2021-08-27T17:22:09.420Z [WARN] agent.client.memberlist.lan: memberlist: Was able to connect to router.no-dns.co.uk but other probes failed, network may be misconfigured
    Aug 27 17:22:10 archer consul[1726]: 2021-08-27T17:22:10.421Z [WARN] agent.client.memberlist.lan: memberlist: Was able to connect to dukat.no-dns.co.uk but other probes failed, network may be misconfigured
    that’s my second concern that the node archer cannot talk to any other node (it can at a network level, it can ping, and nc connect on the appropraite ports as was as telnet to the right ports) and the same from other nodes, they can all talk to it
    my only assumption that there is some sort of consul network transport problem as at a network level the connectivity is there
    Matt Darcy
    @ikonia
    that error message appears to normally be there is no network/firewall connectivity, but I’ve tested this and it is %100 reachable between other nodes
    I’ve no idea why this cluster is being so odd with one node
    one of the cluster leaders, has these errors in in the consul log - which again makes no sense to me as it suggests the node archer.no-dns.co.uk is not a member of the cluster
    2021-08-27T17:26:16.642Z [WARN] agent.server.memberlist.lan: memberlist: Got ping for unexpected node 'archer.no-dns.co.uk' from=127.0.0.1:8301
    2021-08-27T17:26:17.144Z [WARN] agent.server.memberlist.lan: memberlist: Got ping for unexpected node archer.no-dns.co.uk from=127.0.0.1:56934
    2021-08-27T17:26:17.144Z [ERROR] agent.server.memberlist.lan: memberlist: Failed fallback ping: EOF
    Matt Darcy
    @ikonia
    is there a way to refute any possible network comms, the node that’s failing is a raspberry pi, with no software firewall, running on a switch port connected to all the other nodes in that list with one or two minor exceptions, so for most of the nodes there is nothing in between the nodes
    the only thing I can think of, is that the dead message is from an earlier time / failure of some sort, and because the mesage being refuted it’s stopping it fully joining the cluster, but that doesn’t explain why it’s the only node being referenced on 127.0.0.1
    pablo platt
    @pablopla_twitter
    Is it possible to connect a server on remote datacenter to the service mesh without federation and gateway?
    federation will force me to add another Consul cluster and the gateway will be another point of failure
    the remote data center has only a single server and adding a Consul and gateway just for one server is too much overhead
    6 replies
    josuemotte
    @josuemotte
    hello after a yum update on my environment ( push by automation ) I'm unable to recover my consul cluster , the only error I'm getting is the following : {"@level":"info","@message":"Request cancelled","@module":"agent.http","@timestamp":"2021-08-30T20:11:12.216779Z","error":"No cluster leader","from":"127.0.0.1:52082","method":"GET","url":"/v1/operator/raft/configuration"} and {"@level":"error","@message":"failed to make requestVote RPC","@module":"agent.server.raft","@timestamp":"2021-08-30T20:11:10.433318Z","error":"EOF","target":{"Suffrage":0,"ID":"bas25486-e913-f023-8493-91a46cab6f0a","Address":"10.10.1.4:8300"}}
    3 replies
    Shai Ben-Naphtali
    @shai
    What is the variable type of discovery_max_stale? Is it a binary? A string or an int?
    The docs don't mention this AFAIK
    1 reply
    Shai Ben-Naphtali
    @shai
    Found it in agent/config/config.go
        DiscoveryMaxStale                *string                  `json:"discovery_max_stale" hcl:"discovery_max_stale" mapstructure:"discovery_max_stale"`