Fixing Ansible error: "SSH Error: data could not be sent to remote host "127.0.0.1". Make sure this host can be reached over ssh"

I was building VM images for Google Cloud with Packer, and provisioning them with Ansible. Everything had been working in the morning, but in the afternoon one computer wasn’t working after I had upgraded Ansible with Homebrew. I was having a really tough time figuring out why Ansible and Packer were running fine on one computer, and not on the other. I was getting the following error:

googlecompute: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"127.0.0.1\". Make sure this host can be reached over ssh", "unreachable": true}

Investigating the error, I found it could indicate any number of problems, as Ansible is masking the underlying SSH error.

After checking the versions of Python, Ansible, and Packer on the two computers, I found one difference. On the computer that wasn’t working, when running ansible --version it had a config file listed:

$ ansible --version
ansible 2.3.2.0
  config file = /Users/<user>/.ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.13 (default, Dec 18 2016, 07:03:39) [GCC 	4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]

Opening that file:

$ cat ~/.ansible.cfg
[ssh_connection]
pipelining = True

I moved that file to another location so it wouldn’t be picked up by Ansible, and after rerunning Packer a second time, got this error:

googlecompute: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added '[127.0.0.1]:62684' (RSA) to the list of known hosts.\r\nReceived disconnect from 127.0.0.1 port 62684:2: too many authentication failures\r\nAuthentication failed.\r\n", "unreachable": true}

I then used the extra_arguments option for the Ansible provisioner to pass [ "-vvvv" ] to Ansible. I ran this on both computers and diffed the output. I saw that the dynamically generated key was being successfully provided on the working computer (after one other local key). On the failing computer I had many SSH keys that were being tried before I could get to the dynamic key, and I was getting locked out.

SSH servers only allow you to attempt to authenticate a certain number of times (six by default). All of your loaded keys will be tried before the dynamically generated key provided to Ansible. If you have too many SSH keys loaded in your ssh-agent, the Ansible provisioner may fail authentication.

Running ssh-add -D unloaded all of the keys from my ssh-agent, and meant that the dynamic key Packer was generating was provided first.

I hope this is helpful to someone else, and saves you from hours of debugging!

Postscript

I was very confused by seeing that my computer was trying to connect to 127.0.0.1, instead of the Google Cloud Platform VM. My best guess is that Packer/Google Cloud SDK proxies the connection from my computer to the VM.

Postscript 2

A reader writes:

A generally more robust way of handling this is setting IdentitiesOnly yes in the corresponding ssh_config.

This will make ssh only present the identity specified for the connection, while still allowing passphrase caching for many identities in the agent.