tldr; After upgrading an AKS cluster with Virtual Nodes, you need to disable and re-enable the Virtual Nodes Add-On for inter-cluster communication.

Microsoft has done an excellent job implementing Kubernetes through Azure Kubernetes Service (AKS).  Version upgrades are well managed, automatically adding upgraded nodes to transition with zero downtime.  While simple and generally seamless, the upgrade to 1.15.5 caused a networking failure with Virtual Nodes.

After the upgrade, Virtual Nodes immediately lost inter-cluster communications via private dns names.  Communication from standard nodes to virtual nodes and from virtual nodes to standard nodes all failed.  The virtual nodes did have full internet access and could access the standard cluster nodes via public addresses.

I started under the assumption that this was a DNS issue.  I Bashed into a virtual node pod that was running Ubuntu:

kubectl exec -it {POD NAME} -n {NAMESPACE} -- /bin/bash

Once connected, I tested DNS using a simple ping:

ping svc1.namespace1.svc.cluster.local

It was unable to resolve the host.

apt-get install dnsutils
nslookup svc1.namespace1.svc.cluster.local

It failed to resolve.  Then I attempted to force the DNS server (10.0.0.10 was the cluster dns server and the default on AKS):

nslookup svc1.namespace1.svc.cluster.local 10.0.0.10

It was unable to communicate with the DNS server.  It then occurred to me that this was a routing issue, not a DNS issue.  To validate the theory, I attempted to connect to the private IP addresses of other pods in the cluster – all failed.

Solution:

After performing an Azure AKS upgrade, uninstall, then reinstall the Virtual Node Add-On, which among other things, installs the current version of the aci-connector-linux deployment.

To disable the add-on via the Azure CLI:

az aks disable-addons \    
    --resource-group {RESOURCE GROUP NAME} \    
    --name {CLUSTER NAME} \    
    --addons virtual-node \    
    --subscription {SUBSCRIPTION ID}

Then, re-enable it:

az aks enable-addons \    
    --resource-group {RESOURCE GROUP NAME} \    
    --name {CLUSTER NAME} \    
    --addons virtual-node \    
    --subnet-name {Virtual Node Subnet Name} \    
    --subscription {SUBSCRIPTION ID}

Problem solved!

Have you experienced a similar problem?  I'd love to hear your thoughts and solutions.