Sunday, September 10, 2017

Kubernetes container services at scale with Dragonflow SDN Controller

Cloud native ecosystem is getting very popular, but VM based workloads are not going away. Enabling developers to connect VMs and containers to run hybrid workloads, means shorter time to market, more stable production environment and ability to leverage the maturity of the VM ecosystem.

Dragonflow is a distributed, modular and extendable SDN controller that enables to connect cloud network instances (VMs, Containers and Bare Metal servers) at scale. Kuryr allows you to use Neutron networking to connect the containers on your OpenStack cloud. Combining them allows to use the same networking solution for all workloads.

In this post I will  briefly cover both Dragonflow and Kuryr, explain how Kubernetes cluster networking is supported by Dragonflow and provide details about various Kubernetes cluster deployment options.

Introduction

Dragonflow Controller in a nutshell

Dragonflow adopts a distributed approach to solve the scaling issues for large scale deployments. With Dragonflow the load is distributed to the compute nodes running local controller. Dragonflow manages the network services for the OpenStack compute nodes by distributing network topology and policies to the compute nodes where they are translated into OpenFlow rules and programmed into Open vSwitch datapath.
Network services are implemented as Applications in the local controller.
OpenStack can use Dragonflow as its network provider through the Modular Layer 2 (ML2) Plugin.

Kuryr

Project Kuryr uses OpenStack Neutron to provide networking for containers. With kuryr-kubernetes, Kuryr project enables native Neutron-based networking for Kubernetes.
Kuryr provides solution for Hybrid workloads, enabling Bare Metal, Virtual Machines and Containers to share the  same Neutron network or to choose different routable network segments.


Kubernetes - Dragonflow Integration

To leverage Dragonflow SDN Controller as Kubernetes network provider, we use Kuryr to act as the container networking interface (CNI) for Dragonflow.


Diagram 1: Dragonflow-Kubernetes integration


Kuryr Controller watches K8s API for Kubernetes events and translates them into Neutron models. Dragonflow translates Neutron model changes into a network topology that gets stored in the distributed DB and propagates network policies to its local controllers that apply changes on open vSwitch pipeline.
Kuryr CNI driver binds Kubernetes pods on worker nodes into Dragonflow logical ports ensuring requested level of isolation.
As you can see in the diagram above, there is no kube-proxy component. Kubernetes services are implemented with the help of Neutron load balancers. Kuryr-Controller translates Kubernetes service into Load Balancer, Listener and Pool. Service endpoints are mapped to the members in the pool. See the following diagram diagram:
Diagram 2: Kubernetes service translation

Currently either Octavia or HA Proxy can be used as Neutron LBaaSv2 providers. In the Queens release, Dragonflow will provide native LBaaS implementation, as drafted in the following specification.

Deployment Scenarios

With Kuryr-Kubernetes it’s possible to choose to run both OpenStack VMs and Kubernetes Pods on the same network provided by Dragonflow if your workloads require it or to use different network segments and, for example, route between them. Below you can see the details of various scenario, including devstack recipes.  

Bare Metal deployment

Kubernetes cluster can be deployed on Bare Metal servers. Logically there are 3 different types of servers.

OS Controller hosts - required control service, such as Neutron Server, Keystone and Dragonflow Northbound Database. Of course, they can be distributed on number of servers.

K8s Master hosts - components that provide the cluster’s control plane. Kuryr-Controller is part of the cluster control plane.

K8s Worker nodes - hosts components that  run on every node, maintaining running pods and providing the Kubernetes runtime environment.

Kuryr-CNI is invoked by Kubelet. It binds Pods into Open vSwitch bridge that is managed by Dragonflow Controller.


If you want to try Bare Metal deployment with devstack, you should enable Neutron, Keystone, Dragonflow and Kuryr components. You can use this local.conf:


Nested (Containers in VMs) deployment

Another deployment option is nested-VLAN, where containers are created inside OpenStack VMs by using the Trunk ports support. Undercloud OS environment has all the needed components to create VMs (e.g., Glance, Nova, Neutron, Keystone, ...), as well as the needed Dragonflow configurations such as enabling the trunk support that will be needed for the VM to enable running Containers to use undercloud networking. The overcloud deployment inside the VM contains Kuryr components along Kubernetes Control plane components.


If you want to try nested-VLAN deployment with devstack, you can use Dragonflow Kuryr Bare Metal config with the following changes:
  1. Do not enable kuryr-kubernetes plugin and kuryr related services as they will be installed inside VM.
  2. Nova and Glance components need to be enabled to be able to create the VM where we will install the overcloud.
  3. Dragonflow Trunk service plugin need to be enable to ensure Trunk ports support.
Then create Trunk and spawn overcloud VM on the Trunk port.
Install overcloud, following the instructions as listed here.


Hybrid environment

Hybrid environment enables diverse use cases where containers, regardless if they are deployed on Bare Metal or inside Virtual Machines, are in the same Neutron network as other co-located VMs.
To bring up such environment with devstack, just follow the instructions as stated in the nested deployment section.

Testing the cluster
Once the environment is ready, we can test that network connectivity works among Kubernetes pods and services. You can check the cluster configuration according to this default configuration guide. You can run simple example application and verify the connectivity and configuration reflected in the Neutron and Dragonflow data model. Just follow the instructions to try sample kuryr-kubernetes application.

Resources


Monday, August 21, 2017

Openstack-Vagrant - Bringing Vagrant, Ansible, and Devstack together to deploy for developers

Introduction

OpenStack developers in general, and Dragonflow developers in particular, find themselves in need of setting up many OpenStack deployments (for testing, troubleshooting, developing, and what not). Every change requires testing on a 'real' environment.

Doing this manually is impossible. This is a task that must be automated. If a patch is ready, setting up a server to test it should take seconds.

This is where Openstack-Vagrant comes in.

More Details

In essence, Openstack-Vagrant (https://github.com/omeranson/openstack-vagrant) is a Vagrantfile (read: Vagrant configuration file) that sets up a virtual machine, configures it and installs all the necessary dependencies (using Ansible), and then runs devstack.

In effect, Openstack-Vagrant allows you to create a new OpenStack deployment by simply updating a configuration file, and running vagrant up.

Vagrant

Vagrant (https://www.vagrantup.com/) allows you to easily manage your virtual machines. They can be deployed on many hosts (e.g. your personal PC, or several lab servers), with many backends (e.g. libvirt, or virtual box), with many distributions (e.g. Ubuntu, Fedora). I am sticking to Linux here, because that's what's relevant to our deployment.

Vagrant also let's you automatically provision your virtual machines, using e.g. shell or Ansible.

Ansible

Ansible (https://www.ansible.com/) allows you to easily provision your remote devices. It was selected for OpenStack-Vagrant for two main reasons:
  1. It is agent-less. No prior installation is needed.
  2. It works over SSH - out of the box for Linux cloud images.
Like many provisioning tools, Ansible is idempotent - you state the outcome (e.g. file exists, package installed) rather than the action. This way the same playbook (Ansible's list of tasks) can be replayed safely in case of errors along the way.

Devstack

Every developer in OpenStack should know devstack (https://docs.openstack.org/devstack/latest/). That's how testing setups are deployed.

Really In-Depth

Let's review how to set-up an OpenStack and Dragonflow deployment on a single server using Openstack-Vagrant.

  1. Grab a local.conf file. The Dragonflow project has some with healthy defaults (https://github.com/openstack/dragonflow/tree/master/doc/source/single-node-conf). At the time of writing, redis and etcd are gated. I recommend etcd, since it's now an Openstack base service (https://github.com/openstack/dragonflow/blob/master/doc/source/single-node-conf/etcd_local_controller.conf)
    • wget https://raw.githubusercontent.com/openstack/dragonflow/master/doc/source/single-node-conf/etcd_local_controller.conf
  2. Create a configuration for your new virtual machine. A basic example exists in the project's repository (https://github.com/omeranson/openstack-vagrant/blob/master/directory.conf.yml).
    • machines: 
        - name: one
          hypervisor:
            name: localhost
            username: root
          memory: 8192
          vcpus: 1
          box: "fedora/25-cloud-base"
          local_conf_file: etcd_local_controller.conf
      
  3. Run vagrant up <machine name>
    • vagrant up one
  4. Go drink coffee. You have an hour. 
  5. Once Ansible finishes its thing, you can log into the virtual machine with vagrant ssh or vagrant ssh -p -- -l stack (to log in directly to the stack user). Once as the stack user, devstack progress is available in a tmux session.
    • vagrant ssh -p -- -l stack
    • tmux attach

How Can It Be Better? 

 There are many ways we can still improve Openstack-Vagrant. Here are some thoughts that come to mind:
  1. A simple CLI interface that creates the configuration file and fires up the virtual machine.
  2. Use templates to make the local.conf file more customisable.

Conclusion

With Openstack-Vagrant, it is much easier to create new devstack deployments. A deployment can be fired in under a minute, and it will automatically boot the virtual machine, update it, install any necessary software, and run devstack

Policy based routing with SFC in DragonFlow


One of the coolest new Pike release features in Dragonflow is support for Service Function Chaining. In this post I'll give a short intro on the topic and share some details on how we implemented it, and what it's good for.

A quick intro

A network service function is a resource that (as the name suggests) provides a service, which could be an IDS, a Firewall, or even a cache server (or anything else that works on the network data path).

In a traditional network environment, packets are forwarded according to their destination, i.e. when Server A wants to send a packet to Server B, it puts Server B's address on packet destination field.  That way, all the switches between the servers know how to forward the packet correctly.




Now consider that you want to steer this traffic through an IDS and a Firewall. Server A will still put Server B as the destination for its packets.  
One popular way to accomplish this is to place A and B within different subnets. This will allow us to use a router to route the packets through our IDS and firewall.




However, such an approach complicates the network quite a bit, requiring each 2 communicating servers to be placed within separate subnets, and causing all their traffic to go through routers (slower and more expensive).  Moreover, all packets will be routed the same way, even if you want to only apply IDS on HTTP traffic.  This headache scales with the number of servers, and it can quickly become a configuration hell.


In the SDN world we should be able to do better


SFC introduces the concept of service chains. 

There are two aspects to a service chain:


Classification

What traffic should be served by the chain.  For example, outbound TCP connections with destination port 80 from subnet 10.0.0.0/16


Service path

Which service functions (and in what order) should be applied to the packet.  For example, a firewall, then IDS, then local HTTP cache

With this in mind, a forwarding element is enhanced to handle service function chains, so everything can be deployed in a more intuitive way:





SFC in OpenStack

OpenStack Neutron supports SFC through the networking-sfc extension. This extension provides a vendor-neutral API for defining service function chains. 

The basic model is composed of 4 object types:

PortChain

Represents the whole service function chain.  It is composed of FlowClassifiers, and PortPairGroups where the former specifies the subset of traffic for which this port chain applies, and the latter specifies what service functions need to be applied.

FlowClassifier

Specifies what packets should enter the specific chain.  The classification is done by matching against packet's fields.  Some of the fields that can be specified are:

  • Source/destination logical ports
  • IP(or IPv6) source and dest CIDRs
  • Protocol types and port numbers
  • L7 URLs

PortPairGroup

Represents a step in the service function chain.  The model aggregates all port pairs that can be used to perform this specific step.

PortPair

Represents a service function instance.  It specifies what port we need to forward our packet into to apply the service and what port the resulting packet will emerge at.


TCP/80 egress traffic of Server A will go through the port chain above. Blue arrow shows possible path of classified traffic, red shows path of not classified traffic.


What goes on the wire

We solved all our issues a few paragraphs above by adding a mysterious SFC forwarder element.  How does it make sure that packets traverse the correct path? 
Usually, packets that need to be serviced by a service function chain are encapsulated and a service header is added to the packet:




The service header is used to store information needed to steer the packet along the service chain (usually, what chain is performed, and how far along the chain are we). Two of the trending choices for service protocols are MPLS and NSH.

With this metadata on the packet, the forwarder can easily decide where packet should be sent next. The service function themselves will receive the packet with the service header and operate on the encapsulated packet.


A packet classifier at SFC forwarder. If the service function supports service headers, the packet is sent in encapsulated form (right). If the service function does on support service headers, a proxy must be used.

The left side of the above figure depicts service protocol unaware function, a function that expects ingress packets to be without any encapsulation. Dragonflow's forwarding element will act as proxy when the function is service unaware.

Dragonflow drivers

In Pike release we have added SFC drivers to Dragonflow, the drivers aim to implement the classification and forwarding elements. The initial version supports:
  • MPLS service chains (the only protocol supported by networking-sfc API)
  • both MPLS aware and unaware service functions

In Dragonflow, we manage our own integration bridge to provide various services in a distributed manner. We implemented service function chaining in a as such. Each Dragonflow controller is a fully capable SFC forwarding element, so a packet does not need to travel elsewhere, unless the service function itself is not present on the current node.

Take SFC for a spin

 Easiest way to get a working environment with Dragonflow + SFC is to deploy it in a devstack.  this is the local.conf I used to deploy it:


[[local|localrc]]
DATABASE_PASSWORD=password
RABBIT_PASSWORD=password
SERVICE_PASSWORD=password
SERVICE_TOKEN=password
ADMIN_PASSWORD=password

enable_plugin dragonflow https://github.com/openstack/dragonflow
enable_service q-svc
enable_service df-controller
enable_service df-redis
enable_service df-redis-server
enable_service df-metadata

disable_service n-net
disable_service q-l3
disable_service df-l3-agent
disable_service q-agt
disable_service q-dhcp

Q_ENABLE_DRAGONFLOW_LOCAL_CONTROLLER=True
DF_SELECTIVE_TOPO_DIST=False
DF_REDIS_PUBSUB=True
Q_USE_PROVIDERNET_FOR_PUBLIC=True
Q_FLOATING_ALLOCATION_POOL=start=172.24.4.10,end=172.24.4.200
PUBLIC_NETWORK_NAME=public
PUBLIC_NETWORK_GATEWAY=172.24.4.1

ENABLED_SERVICES+=,heat,h-api,h-api-cfn,h-api-cw,h-eng

ENABLE_DF_SFC=True
enable_plugin networking-sfc git://git.openstack.org/openstack/networking-sfc

IMAGE_URL_SITE="http://download.fedoraproject.org"
IMAGE_URL_PATH="/pub/fedora/linux/releases/25/CloudImages/x86_64/images/"
IMAGE_URL_FILE="Fedora-Cloud-Base-25-1.3.x86_64.qcow2"
IMAGE_URLS+=","$IMAGE_URL_SITE$IMAGE_URL_PATH$IMAGE_URL_FILE

Distributed SNAT - examining alternatives

Source NAT (SNAT) is a basic cloud network functionality that allows traffic from the private network to go out to the Internet. 

At the time of writing this blog, SNAT still has no agreed-upon mainstream distributed solution for OpenStack Neutron.  While Distributed Virtual Router (DVR) provides a distributed and decentralized solution to local connectivity for floating IP and simplified east-west VM communication, SNAT still needs to be deployed at a network node and remains a traffic bottleneck.


Figure 1

Figure 1 above has two deployed tenants, 'green' and 'orange'. SNAT is performed centrally at the network node.


Possible solutions

There are a number of proposed solutions to decentralize SNAT. Each solution has its own benefits and drawbacks. I am going to dive into two possible solutions that have been recently implemented in Dragonflow SDN Controller.


1. SNAT per {tenant, router} pair

The most straightforward solution is to perform SNAT at compute node router instance. 

However, while DVR deployment can easily copy the internal subnet router address across compute nodes, the router's IP address on the external network can not follow this scheme. Such a deployment will consume extra external address per {tenant, router} pair.


Figure 2

Maximum address consumption equals to:
   [# of compute nodes] x [# of tenants]

This problem may be somewhat mitigated by allocating the external IP in a lazy manner - only when the first VM of the requested tenant is deployed on a compute node that is scheduled by Nova.  In figure 2 external address 172.24.4.2 was allocated only when VM2 of 'orange' tenant was deployed .

This model may be appealing for cloud deployments that have a very large pool of external addresses, or deployments going through an additional NAT beyond the cloud edge. 

However, the Neutron database would need to track all additional gateway router ports and external addresses would need to be assigned implicitly from a separate external address pool. 

A proof of concept implementation for this SNAT model base on Neutron stable/mitaka branch can be found here and is being discussed in this post. This implementation makes a few assumptions that would need to be removed in a future enhancement round, such as:
  • Explicit external address allocation per {router, compute node} pair that required client API modification instead of automated allocation.

2. SNAT per compute node

This second SNAT model we discuss reduces the number of external addresses to a single one per compute node and significantly optimizes network resources consumption, while improving latency and bandwidth of internet-bound traffic.

This model has at least one caveat - When several tenant VMs go out to the internet via the same IP, one tenant abusing an external service (e.g. gmail, fb) may cause blacklisting of the shared external IP, thus affecting the other tenants who share this IP. 

(This problem can be somewhat mitigated by only allowing this capability for "trusted" tenants, while leaving "untrusted" tenants to go via the SNAT node using their pre-assigned external IP).

Figure 3

In figure 3 we can see that the SNAT rule implemented by both tenants is masquerading multiple tenant VMs behind a single external address. On the returning traffic, reverse NAT restores the tenant IP/MAC information and ensures packets return to their rightful owners. 

In order to simplify the visualization of flow, figure 3 shows two routing entities.  In reality this translation could (and probably would) be performed by different routing tables within the same routing function.


Dragonflow implementation

SNAT per compute node model was proposed and implemented in quite elegant manner within the Dragonflow service plugin of Neutron. 

For those of you who are not familiar with its design, Dragonflow runs a tiny local controller on every compute node, that manages the local OVS and feeds off of a central database and a pub/sub mechanism.  The basic control flow is shown on figure 4 below.

Dragonflow uses OVS (Open Virtual Switchas its dataplane implementation, and controls its bridges using OpenFlow. The OVS bridges replace the native Linux forwarding stack. 
Figure 4
This is what happens when Nova requests Neutron to allocate a network port for a VM:
  1. Neutron server writes the new port information in its database and passes port allocation request to the ML2 Plugin
  2. The Dragonflow ML2 Driver writes the newly-created Neutron port information into its separate Dragonflow database (not the Neutron DB)
  3. The Dragonflow ML2 Driver then publishes a port update to the relevant compute nodes, where Dragonflow local controllers are running, using its own pub/sub mechanism 
  4. The Dragonflow local controller running on the compute node where the VM is scheduled for creation fetches the port information from the Dragonflow database or the published port update event, and passes it to the Dragonflow applications
  5. Every application that is registered for this specific event (local neutron port created) may insert/update OVS flows. For instance, the L2 application adds an OVS flow to detect the port's MAC address, and marks the packet to be sent to the relevant port attached to the OVS bridge.
  6. The new SNAT application installs a flow that uses OVS's NAT and connection tracking component to NAT, unNAT, and track the NATed connection. These features are available starting from OVS version 2.6.

SNAT application configuration

When the new SNAT application is enabled, Dragonflow's configuration has to be modified to reflect the host's 'external IP', i.e. the masquerading address of NATed traffic.

figure 5

In figure 5 we see the minimal configuration required to enable the SNAT application in Dragonflow:
  • Add ChassisSNATApp to Dragonflow application list (apps_list)
  • Configure proper external_host_ip address


Questions and comments are welcome.

Useful resources: