Monday, August 21, 2017

Distributed SNAT - examining alternatives

Source NAT (SNAT) is a basic cloud network functionality that allows traffic from the private network to go out to the Internet. 

At the time of writing this blog, SNAT still has no agreed-upon mainstream distributed solution for OpenStack Neutron.  While Distributed Virtual Router (DVR) provides a distributed and decentralized solution to local connectivity for floating IP and simplified east-west VM communication, SNAT still needs to be deployed at a network node and remains a traffic bottleneck.

Figure 1

Figure 1 above has two deployed tenants, 'green' and 'orange'. SNAT is performed centrally at the network node.

Possible solutions

There are a number of proposed solutions to decentralize SNAT. Each solution has its own benefits and drawbacks. I am going to dive into two possible solutions that have been recently implemented in Dragonflow SDN Controller.

1. SNAT per {tenant, router} pair

The most straightforward solution is to perform SNAT at compute node router instance. 

However, while DVR deployment can easily copy the internal subnet router address across compute nodes, the router's IP address on the external network can not follow this scheme. Such a deployment will consume extra external address per {tenant, router} pair.

Figure 2

Maximum address consumption equals to:
   [# of compute nodes] x [# of tenants]

This problem may be somewhat mitigated by allocating the external IP in a lazy manner - only when the first VM of the requested tenant is deployed on a compute node that is scheduled by Nova.  In figure 2 external address was allocated only when VM2 of 'orange' tenant was deployed .

This model may be appealing for cloud deployments that have a very large pool of external addresses, or deployments going through an additional NAT beyond the cloud edge. 

However, the Neutron database would need to track all additional gateway router ports and external addresses would need to be assigned implicitly from a separate external address pool. 

A proof of concept implementation for this SNAT model base on Neutron stable/mitaka branch can be found here and is being discussed in this post. This implementation makes a few assumptions that would need to be removed in a future enhancement round, such as:
  • Explicit external address allocation per {router, compute node} pair that required client API modification instead of automated allocation.

2. SNAT per compute node

This second SNAT model we discuss reduces the number of external addresses to a single one per compute node and significantly optimizes network resources consumption, while improving latency and bandwidth of internet-bound traffic.

This model has at least one caveat - When several tenant VMs go out to the internet via the same IP, one tenant abusing an external service (e.g. gmail, fb) may cause blacklisting of the shared external IP, thus affecting the other tenants who share this IP. 

(This problem can be somewhat mitigated by only allowing this capability for "trusted" tenants, while leaving "untrusted" tenants to go via the SNAT node using their pre-assigned external IP).

Figure 3

In figure 3 we can see that the SNAT rule implemented by both tenants is masquerading multiple tenant VMs behind a single external address. On the returning traffic, reverse NAT restores the tenant IP/MAC information and ensures packets return to their rightful owners. 

In order to simplify the visualization of flow, figure 3 shows two routing entities.  In reality this translation could (and probably would) be performed by different routing tables within the same routing function.

Dragonflow implementation

SNAT per compute node model was proposed and implemented in quite elegant manner within the Dragonflow service plugin of Neutron. 

For those of you who are not familiar with its design, Dragonflow runs a tiny local controller on every compute node, that manages the local OVS and feeds off of a central database and a pub/sub mechanism.  The basic control flow is shown on figure 4 below.

Dragonflow uses OVS (Open Virtual Switchas its dataplane implementation, and controls its bridges using OpenFlow. The OVS bridges replace the native Linux forwarding stack. 
Figure 4
This is what happens when Nova requests Neutron to allocate a network port for a VM:
  1. Neutron server writes the new port information in its database and passes port allocation request to the ML2 Plugin
  2. The Dragonflow ML2 Driver writes the newly-created Neutron port information into its separate Dragonflow database (not the Neutron DB)
  3. The Dragonflow ML2 Driver then publishes a port update to the relevant compute nodes, where Dragonflow local controllers are running, using its own pub/sub mechanism 
  4. The Dragonflow local controller running on the compute node where the VM is scheduled for creation fetches the port information from the Dragonflow database or the published port update event, and passes it to the Dragonflow applications
  5. Every application that is registered for this specific event (local neutron port created) may insert/update OVS flows. For instance, the L2 application adds an OVS flow to detect the port's MAC address, and marks the packet to be sent to the relevant port attached to the OVS bridge.
  6. The new SNAT application installs a flow that uses OVS's NAT and connection tracking component to NAT, unNAT, and track the NATed connection. These features are available starting from OVS version 2.6.

SNAT application configuration

When the new SNAT application is enabled, Dragonflow's configuration has to be modified to reflect the host's 'external IP', i.e. the masquerading address of NATed traffic.

figure 5

In figure 5 we see the minimal configuration required to enable the SNAT application in Dragonflow:
  • Add ChassisSNATApp to Dragonflow application list (apps_list)
  • Configure proper external_host_ip address

Questions and comments are welcome.

Useful resources:


  1. Various departments of the State and Central Government organizations engaged in providing skill development training by identifying the available skilled. local ip address