Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Matt Darcy
    @ikonia
    any suggestions on why this member is flipping in and out of cluster ?
    all the member nodes are configuraed via the sampe puppet class and parameters so I know there are know typo’s between the configs
    one interesting thing which I had no idea about until now as I’m reading that after 2 unsucessful communications from the broken node, I see a 3rd failed attempt on what appears to be a random port
    Jun 21 07:53:55 odo consul[29811]: 2021-06-21T07:53:55.616Z [WARN] agent.client.memberlist.lan: memberlist: Got ping for unexpected node archer.no-dns.co.uk from=127.0.0.1:33718
    Matt Darcy
    @ikonia
    I get what the error messages are saying mostly, and it makes sense, the broken node is failing the healthchecks so dropping out of the cluster, the working node is responding that it’s getting requests from a non-member of the cluster, but I don’t understand why one node out of 22 is failing the health checks and flip/flopping it’s status
    the broken node has nothing wrong with it, resources free on cpu/ram/disk - not under load, no sign of the network flapping in any way etc, from the OS and consul config point of view, everything is fine
    the only other point of interest is that occasionally I get the warning Jun 21 08:00:48 archer consul[1720]: 2021-06-21T08:00:48.500Z [WARN] agent.client.memberlist.lan: memberlist: Refuting a suspect message (from: archer.no-dns.co.uk)
    from what I’ve read this is because the node left the cluster in an unhealthy state - I’m currently looking how to clean this up to remove confusion
    Matt Darcy
    @ikonia
    the other thing to note, which I have no idea how this is happening, if I do a ‘consul members’ on the nodes, all the nodes are talking using their network interface ip address (eg: on this test lan 10.11.216.x)
    however the single broken node that shows 127.0.0.1:3801
    eg: odo.no-dns.co.uk 10.11.216.91:8301 alive client 1.9.6 2 bathstable <default>
    paris.no-dns.co.uk 10.11.216.64:8301 alive client 1.9.6 2 bathstable <default>
    picard.no-dns.co.uk 10.11.216.151:8301 alive client 1.9.6 2 bathstable <default>
    archer.no-dns.co.uk 127.0.0.1:8301 failed client 1.9.6 2 bathstable <default>
    why would the ‘bad’ node be referencing 127.0.0.1
    Shantanu Gadgil
    @shantanugadgil

    @ikonia my gut feel (aka psychic debugging :grinning: ) looking at what you have described is "duplicate node" or "duplicate node id".

    Also, the agent going to "127.0.0.1" has happened for me in the past, when the agent lost contact with servers for too long.

    Matt Darcy
    @ikonia
    love some gut feels, not looked at duplicates as an idea, thanks
    Shantanu Gadgil
    @shantanugadgil
    the way I debugged this was:
    • shut down the node ...
    • restart Consul servers (so as to cleanup "left" nodes)
    • then wait and watch if the node with the same id comes back
    Matt Darcy
    @ikonia
    I’ve sort of done that (for differemt reasons\0
    I’m wrondering if I should do a force leave on the node and then see if it rejoins
    Shantanu Gadgil
    @shantanugadgil
    force leave would be worth a try ...
    else "rm -rf" the data dir for consul and reboot the node
    do you have any config magic to set machine node ids? or do you let them be auto generated?
    Matt Darcy
    @ikonia
    auto
    Shantanu Gadgil
    @shantanugadgil
    :+1:
    dhinakar2707
    @dhinakar2707

    Hi Team,

    I was trying to create tf state in GCP with version 0.15.4 but terraform validate command is throwing below error.

    erraform has been successfully initialized!
    $ terraform validate -var-file=environment/${CI_ENVIRONMENT_NAME}/variables.tfvars

    │ Error: Failed to parse command-line flags

    │ flag provided but not defined: -var-file

    For more help on using this command, run:
    terraform validate -help

    Cleaning up file based variables

    Please can someone help me, if something i'm missing or is there any change in 0.15.4 version

    Shantanu Gadgil
    @shantanugadgil
    what is the output of terraform validate -help ?
    btw: this is the Consul lobby :smiley:
    Michael Aldridge
    @the-maldridge
    @blake as you gaze into your crystal ball, should I wait to do a base image upgrade for another week in the hopes of consul 1.10 becoming GA by then?
    Blake Covarrubias
    @blake
    @the-maldridge We're planning to release Consul 1.10 tomorrow.
    Michael Aldridge
    @the-maldridge
    fantastic, I'll hold off until tomorrow
    Blake Covarrubias
    @blake
    @the-maldridge Consul 1.10 is out. :-)
    Michael Aldridge
    @the-maldridge
    Thanks!
    I'm actually in the middle of a vault rollout, images were built 4 minutes after the binaries were up on docker hub.
    cyan-singularity
    @cyan-singularity
    Has anyone used consul connect with horizontally scaling databases like Cassandra? If each Cassandra node is registered as a instance in consul, and a client calls localhost:connect_port, would this single connection work?
    6 replies
    Yoan Blanc
    @greut
    Hey, playing with GRPC checks, we are scratching our heads as it seems to require grpc.health.v1.Health, ignoring the value we are giving it.
    11 replies
    greg-hunt1
    @greg-hunt1
    I am getting a lot of metrics read error in my Consul on K8s with Connect and ACLs enabled. What is the best way to handle this, Should I setup Prometheus to run inside the mesh?
    6 replies
    raunak2004
    @raunak2004
    How to validate if the ipv6 is configured correctly for consul cluster or not I do see tagged address with the ipv6 address for the nodes call but does that mean the RPC is also ready to communicate over the same address?
    Alex Henning Johannessen
    @ahjohannessen

    I noticed that I sometimes get this in consul logs on consul servers and clients:

    "2021-06-30T08:10:26.559Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {192.168.20.43:8300 0 hp-03.als <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 192.168.20.43:8300: operation was canceled\". Reconnecting..."

    Any suggestion what this could be? Otherwise, the cluster seems healthy and all nodes are healthy. I use consul 1.10 on all nodes.

    My config for a server node looks like this:
        "acl": {
            "default_policy": "deny",
            "down_policy": "extend-cache",
            "enable_token_persistence": true,
            "enabled": true,
            "token_ttl": "30s",
            "tokens": {
                "agent": "<redacted>",
                "master": "<redacted>",
                "replication": "<redacted>"
            }
        },
        "addresses": {
            "dns": "127.0.0.1",
            "grpc": "127.0.0.1",
            "http": "127.0.0.1",
            "https": "127.0.0.1"
        },
        "advertise_addr": "192.168.20.41",
        "advertise_addr_wan": "192.168.20.41",
        "auto_encrypt": {
            "allow_tls": true
        },
        "bind_addr": "192.168.20.41",
        "bootstrap": false,
        "bootstrap_expect": 3,
        "ca_file": "/etc/consul/certs/consul-agent-ca.pem",
        "cert_file": "/etc/consul/certs/server.pem",
        "client_addr": "127.0.0.1",
        "connect": {
            "enabled": true
        },
        "data_dir": "/data/consul",
        "datacenter": "als",
        "disable_update_check": false,
        "domain": "consul",
        "enable_local_script_checks": false,
        "enable_script_checks": false,
        "encrypt": "<redacted>",
        "encrypt_verify_incoming": true,
        "encrypt_verify_outgoing": true,
        "key_file": "/etc/consul/certs/server-key.pem",
        "log_file": "/var/log/consul/consul.log",
        "log_level": "INFO",
        "log_rotate_bytes": 0,
        "log_rotate_duration": "24h",
        "log_rotate_max_files": 14,
        "performance": {
            "leave_drain_time": "5s",
            "raft_multiplier": 1,
            "rpc_hold_timeout": "7s"
        },
        "ports": {
            "dns": 53,
            "grpc": 8502,
            "http": 8500,
            "https": -1,
            "serf_lan": 8301,
            "serf_wan": 8302,
            "server": 8300
        },
        "primary_datacenter": "als",
        "raft_protocol": 3,
        "recursors": [
            "8.8.8.8",
            "8.8.4.4"
        ],
        "retry_interval": "30s",
        "retry_interval_wan": "30s",
        "retry_join": [
            "192.168.20.41",
            "192.168.20.42",
            "192.168.20.43"
        ],
        "retry_max": 0,
        "retry_max_wan": 0,
        "server": true,
        "tls_min_version": "tls12",
        "tls_prefer_server_cipher_suites": false,
        "translate_wan_addrs": false,
        "ui": false,
        "verify_incoming": true,
        "verify_incoming_https": false,
        "verify_incoming_rpc": false,
        "verify_outgoing": true,
        "verify_server_hostname": true
    Shai Ben-Naphtali
    @shai
    If I have the CheckID, is there a way I can "read" it and see what its configured to?
    I'm asking because I have a checkid error with "CheckID XYZ... does not have associated TTL" in the log
    but AFAIK, I've got a timeout set on all my checks
    Robert Goldsmith
    @far-blue
    hi all :) I'm seeing some very weird behaviour from my consul cluster where http-based health checks are failing with timeouts. It starts happening after a few minutes of a service being registered and won't stop. Seems to be happening with consul 1.9.6 and 1.10.0. Other kinds of checks seem ok. Anyone got any ideas?
    Shantanu Gadgil
    @shantanugadgil
    @blake @angrycub @anyone_else could you check this and help with suggestions, if any:
    https://discuss.hashicorp.com/t/generic-locker-to-give-identity-to-machines-in-an-aws-asg/26289
    Deni
    @deni64k
    Hi there, I am trying to set up a simple sidecar proxy with envoy. I got node-exporter and an http service (name=web) on separate machines. When I start both sidecars, I am checking with curl if I can reach node-exporter over the proxy, but envoy doesn't follow the upstream in connect.sidecar_service. Instead it forwards all http requests to the actual web service.
    When I set up the same configuration on the same machine -- it works fine.
    What am I doing wrong?
    Blake Covarrubias
    @blake
    @deni64k Which version of Consul are you using?
    Willi Schönborn
    @whiskeysierra
    With consul connect, is there a way to know which service made a request to my service? TLS is terminated, so I don't get the certificate but I also don't see any headers being injected which would tell me. Is there any configuration that would allow me to enable e.g. envoy's XFCC header?
    1 reply
    Shai Ben-Naphtali
    @shai
    Why are there no logs about ACL. Not even when using -log-level=trace ; I'm using v1.8.10
    gc-ss
    @gc-ss
    Anyone here uses HCP consul as the remote backend for their local (laptop/desktop/raspi) terraform runs?
    Blake Covarrubias
    @blake
    @gc-ss I’m using my local Consul cluster for TF state storage. HCP Consul should work the same. Do you have a particular question or issue about it?
    3 replies