Ready to learn cloud networking, AWS networking, or networking BGP, then this BGP Tutorial is for you.
Would you like to learn about cloud networking and BGP? If so, then you are in the right place! I started my career as a network engineer and I have 25 years of experience in networking, and I love BGP, I use it everywhere. In fact, I have spent over 10,000 hours working on BGP, so it is near and dear to my heart.
Introduction to BGP
Our topic today is BGP, which is one of those critical skills for cloud architects. BGP should be part of the network training for everyone working in the cloud computing community because it is the primary routing protocol used in the cloud. We will review BGP, and then we will review why we use BGP, and then we will move on to how BGP works; lastly, we will cover the various ways of tuning BGP. In this first part, we are going to go over what it is and how it works, and towards the end, we are going to go over how to tune BGP for traffic engineering, and that is where the fun begins, but we must get through this to be able to understand that.
What is BGP? BGP is a routing protocol. Routers are network devices, and they have lots of interfaces. Routers have interfaces going in all different directions, and what happens is the router is basically a computer with a bunch of network cards in it. The router builds a map of the network which tells the routers, “To get to this destination, use this interface. To get to this destination, use this interface. To get here, go down there,” and that’s what the router does, it directs your traffic through the network. It makes your traffic go from point A to point B, routers are critical.
Now routers need to know how to get your traffic from point A to point B. Imagine this, in the old days, you had a map that only had one path to the destination, and you just took it. That would be what we would call a static route, a static route is where you manually tell the router how to get to this location. Now, there’s a problem with that, if I only know one way to go somewhere, and the road is blocked, I can’t get there, so that’s no good. So static routing is related to manually telling a router to reach point A to point B.
Let’s say I am going to my friend’s house for dinner, let’s use this example to talk about a dynamic routing protocol. In this example, I am going to drive to my friend’s house with GPS (of course in reality I probably would not need to). Along the way one of the roads is blocked and my GPS says, “Rerouting,” and it directs me to take a side road. Then it tells me to take another side road, and it reroutes around the traffic, and I get to my destination. This is essentially what routers do. When we talk about routing protocols, they are basically a way for routers to dynamically learn how to get to every destination and build a map of the network. A routing protocol is like your GPS and is basically a dynamic way for routers to learn what we call network layer reachability information or networks.
These routing protocols help the routers build a map of the network in a dynamic manner, and they can reroute. If you had a static route from one location to another, and the location went down, you are stuck. By adding a dynamic routing protocol you improve availability and performance of your systems. There are two kinds of routing protocols. There are interior gateway routing protocols, and there are exterior gateway routing protocols. Interior gateway routing protocols are used inside of an organization, say, Go Cloud Architects. I have OSPF running on my internal network, which is my interior gateway protocol.
As an example, I want to connect to AWS, and I am talking about AWS BGP. They have an interior gateway protocol running in their networks too. It is going to be OSPF or intermediate systems to intermediate systems. Service providers turned-on label switching or some tag switching, some form of MPLS, and they have probably turned on some RSVP signaling, and they are also running something called IBGP. But that is for their internal network. They are going to connect to external organizations using BGP, because BGP is an exterior gateway protocol. When organizations connect to other organizations, they use BGP because BGP is designed to connect external organizations. Now, let’s talk a little bit about BGP protocol.
BGP uses TCP port 179
At the basic level, understand that BGP uses TCP port 179. A TCP connection is established, which means it is unicast. Almost every other routing protocol is multicast, where a router sends a “hello” to every other router; and establishes an adjacency on their own, but not BGP. BGP is unicast, in fact you need to tell BGP who you want it to connect to, unlike the other routing protocols. BGP is unicast, TCP port 179. Why do I keep reiterating TCP port 179? Because, if you have a firewall and BGP is traversing through that firewall, and you do not allow TCP port 179; your BGP session will never be established, and if it is, it will get torn down pretty darn quick.
So, you need to know BGP, TCP port 179. If you are studying for any AWS certification, particularly the AWS Advanced Networking, you may see that as an exam question. Organizations use exterior gateway protocols because they are scalable. BGP is used for the following type of scenario. When I connect to the internet with a big internet-facing router, and I connect to 10 different internet service providers (which is the kind of thing I did my whole career); if each internet service provider is giving me three-quarters of a million routes, that routing and its routing information base is going to hold seven-and-a-half million routes.
Think about that, that is how scalable this protocol is, to hold that much information. That is why organizations use this. Now pretend you are AWS or Google using BGP, and you have hundreds of thousands of clients. You need something that is going to scale to that degree. Understand that is why these organizations use BGP. To understand BGP, traffic engineer and do all the cool things we need to do, we need to talk about the messages between them. Now, we are going to get a little bit in the network engineering weeds here because we must. My apologies, I generally love to talk about leadership things, but we have got to be deep tech here, so we are going to do it.
I am going to put on my tech hat, and I will be using it for the day. Let’s review the messages that are exchanged between BGP peers. BGP peers exchange messages, and that is how they know to update the map; you know, like your GPS. There are four messages we will talk about. We are going to talk about the open message, the keep alive message, the update message, and the notification.
Realistically speaking, BGP router one and BGP router two establish a TCP connection and send an open message. In this open message is a negotiation over version number. For example, BGP version four versus version three, they talk about the autonomous system number, and every organization using BGP must have their own autonomous system number that identifies the organization.
For example, I will lhave one, and AWS will have another, and I will give them my autonomous system number when I connect to them, and they will give me their autonomous system number when I connect to them. It is part of the neighbor relationship. There is something called the hold time, which is how long these routers will wait before they receive a keep alive to determine if things are good or bad, and then something called the BGP Identifier, which is just the IP address of the neighbor. Now we are going to talk about keep alives, and a keep alive is a message that’s sent in between two BGP neighbors. It’s a health check. It keeps the TCP connection open.
Keep Alive Message
Let’s take this out of BGP to a place that most cloud architects are more familiar with, things like load balancers. A load balancer basically takes things in and it splits the load among multiple servers, and the load balancer says to the servers, “Are you there?” And the servers say, “I’m here.” As long as the servers respond, “I’m here,” they stay and they’re used, but if the load balancer reaches out to a server, “Are you there?” And nobody responds, the load balancer determines it’s not there, it gets marked as unhealthy and gets removed from the rotation. BGP does the same thing and it’s done it for the last 30 years.
That is basically the whole point of the keep alive, “Are you there?” And after a number of keep alives are missed, the router gets removed from the neighbor-adjacency, the session is torn down, and all the routes learned are gone because they assume if you don’t respond, that neighbor’s dead, and that’s why it works just like a health check. Now let’s talk about the next kind of message, which is called an update message.
So, you have a TCP neighbor established, and you have learned a new route that have to tell your neighbor. That is what the update message is. It is basically saying, “I’ve learned a new route,” or “I’ve learned a route went away,” that is the update session.
An Update Message basically gives you the prefix or the subnet that you were trying to reach, or your aggregate route, or whatever is trying to reach. It gives you path attributes, things like your next hop, or origin, your AS path. So that is realistically what is in an update message. The last message we are going to talk about and then thankfully we will be done with messages is something called a notification message.
Basically, if something goes wrong, you get a notification message. What are these messages, one more time? TCP session is established, there’s an open message sent between them, basically negotiate your parameters. Then a keep alive is a health check, just think of it that way, “Are you there?” Then an update message is like, “I’ve got a new route, and here’s how I learned it.” Or, “I’ve got a new route, where I lost a route because the neighbor went down.”
All those things are part of it, and a notification message is, “Hey, something’s wrong. This is not a good thing. The BGP session must be closed.”
BGP Finite State
The next part of our deep, technical discussion on BGP is how BGP forms a neighbor relationship. Now I know this is deep, but this is particularly important in cloud network training and let’s face it, if you want to work in networking and cloud computing, you just need to know BGP. We will have more fun in the later parts.
Source: Cisco Press, https://www.ciscopress.com/articles/article.asp?p=2756480&seqNum=4
When the BGP forms a neighbor relationship we have a couple of states. We have an idle state, a connect state, an active state, an open sent, an open confirmed state, and an established state. I am going to try and get through these as quickly and simply as possible. But the reason we are doing this, is when you establish a BGP neighbor relationship and sometimes things do not work, you must troubleshoot. These are the states that you are going to be looking at when troubleshooting. This is part of the BGP Finite State Machine. If you know this, you will be able to fix it, and if you do not, you will not be able to troubleshoot it.
Idle and Connect States
So idle state. When a BGP router comes online it is idle. Meaning it is waiting for a TCP session to come up, and if anything were to go wrong, it will revert to the idle state. So, if it’s been up for a while and now it is idle state, something went wrong. After it goes from idle state, it goes to connect state, and realistically speaking this is where the TCP connection is forming, and an open message gets sent to the neighbor. Now if all goes right, this is great, what happens is it transitions to open sent and the next phase is what is supposed to happen.
Active and Open Sent States
Now, if anything were to go wrong when you are trying to establish this connection, it is going to transition to something called active state. When it is in active state it will attempt to retry the connection. If it does not make it, it is going to go back to idle state. So, connect state is either going to go well and transition to open sent. Or it is not going to go well, and it is going to transition to idle state. Idle state is where you begin, connect state is where you go, and hopefully things go well. Again, if everything worked, the router transitions into open sent, which basically says the open message has been sent, and it is waiting to hear an open message from its neighbor.
Open Confirm State
Also, during this time, if everything goes well, a keep alive message is sent. Now if things don’t go well, we’ve got a problem, but let’s assume they do. We are going to enter the state of open confirm, and this is where you have received your message and you are sending your keep alive. Of course if anything goes wrong it’s going to transition back to idle. Now we have reached the good part, established state.
When you are established, you know your BGP neighbor relationship is good. Messages are going to be exchanged, and this is good, this is party time. Now if the TCP session gets torn down it is going to transition back to idle. Now you know basically the BGP message types, and we talked about the BGP finite state machine, we have done some network skills training. Now we must talk about some BGP attributes.
BGP attributes are basically things or characteristics about our route. One of the characteristics is going to be the origin, where did you learn the route from? Now if you learned it from your neighbor via BGP, that’s one type, if you learn it from your IGP, that’s another type, and if you learned it via another way, that’s the third.
What happens is BGP is going to prioritize the origin upon what it sees. So if in your routing table you learn it via your IGP, say, OSPF, that’s going to be preferred. Now if you learn it via BGP it’s going to be second preferred, because OSPF is more reliable in certain cases than BGP. Now, if you learn the route via something else, maybe someone redistributed a connected or static route somewhere and you are not sure, or it does not have complete information, that is going to be called incomplete, and this is the last choice.
Routers will prefer the routes learned by the IGP, like OSPF. Then the EGP next, and then incomplete route. Now when we are looking at BGP, it is a path vector routing product. This means it’s going to tell you… for example it traversed AS645001 645002, and 65003 before getting to me, 6534 … (making up AS numbers in my head as I’m preparing this). I will see which AS the routes went, so that is going to tell me how many AS hops. Generally speaking, routers prefer the path with the least number of autonomous system hops, because it feels closer. Understand that AS path refers to how you learned the route; which autonomous systems did it come from? Sure, AS path is preferred as a rule.
The next thing that we will talk about is the next hop attribute, that is where you are seeing your traffic. To reach this next thing, go to router 10.1.2.3. That is my next hop IP address, my neighbor adjacency or whatever you would like to call it. This is critically important, because if the next hop is not reachable, even if you learn the route via BGP, it will not be placed in your routing table. Which means you will not be able to reach it. It is super important that the next hop is reachable. And when you are troubleshooting BGP, and you see it, but it is not in your routing table, chances are your next hop is not reachable. This happens all the time.
Another attribute to mention is weight. Weight was a Cisco proprietary means to influence your outbound traffic for forever. But it is now supported by AWS and some other organizations as well. Weight is another attribute to manipulate your traffic. For routers that support the weight attribute, the higher the weight the more proffered the route.
BGP Path Selection
The last thing we are going to talk about before we get to the tuning fun; is how does BGP select a path? BGP path selection process is as follows. BGP prefers the path with the largest weight. If the weights are equal, prefer the route with the highest local preference, that’s the next thing, and the algorithm. Now if the local preferences are the same, prefer the route that was originated locally on the router. If the local preferences are the same, then prefer the route with the shortest AS path.
And if the AS path is the same, now prefer the paths with the lowest origin code. This is where we talk about an IGP being better than an EGP versus better than incomplete. Now if the origin codes are the same, then prefer the route with the largest MED, or multi-exit discriminator. And if the MEDs are the same, then prefer an EBGP route versus an IBGP route. Now at this point we’re getting far in the process and BGP just has to find some way to choose a preferred path, so we’re going to get into some funny factors now.
If the routes are still equal, prefer the route with the shortest path to the BGP next hop. This is determined by the lowest IGP metric, meaning how far it is to get to the next hop IP address. And if the routes are still equal, prefer the router that told you first. This kind of improves stability and that keeps things from flopping up and down. And then now we are getting into funny business, if all the routes are still equal, prefer the route advertised … that you learn from the router with the lowest IP address.
Halfway Point, get ready for the fun part
So, we have covered a lot so far and all of this is networking for cloud computing. Whether it is AWS BGP, or GCP BGP, this is critical cloud architect skills, and I view network skills training for the cloud architect to be some of the most important things to know. Why? Because the cloud is a virtualized network in a data center, so you must know networking If you want to be a cloud infrastructure architect, this becomes incredibly important.
To recap; we have covered what is BGP, why is BGP used, and how BGP works. We reviewed the protocol, the finite state machine, the BGP methods types, and how BGP selects a path. In this next part, we are going to talk about optimizing BGP. But we are going to do a couple of quick refreshments to make sure that everybody still remembers the previous material.
Key Items to Remember
Remember that BGP is a TCP based path vector protocol. This means it sets up a TCP session and it lets you know the path or the number of autonomous systems that are traversed. BGP uses TCP Port 179. It is important to remember that because you need to make sure that the firewalls permit TCP Port 179. BGP is used due to its scalability and tunability, and that is what we are going to talk about next. BGP is always used to connect to external organizations. Because when you are on the cloud, for example, with AWS BGP, you are connecting to AWS first. They are not inside of your organization. They will therefore be a different autonomous system, and that is what we use in BGP. Because we are connecting to external organizations.
Let’s rehash the path selection process. Then we get into the fun of tuning BGP based upon the tenants of this path selection process. What is the path selection process? First and foremost, choose the route with the largest weight. And if the weight is not set, choose the route with the largest local preference. And then, choose the route that were originated on the router, meaning it originated here versus learning somewhere else. Then choose the path with the shortest number of autonomous system hops, and we will show you how you can tune that. Then choose the path with the lowest ORIGIN code, meaning like IGP versus EGP, versus incomplete. Then choose the path with the lowest MED, and we’ll show you how we do that. And then, it goes to things like choosing an EBGP route over an IBGP route, because it is typically preferable. And then we get into some older things where we are talking about the neighbor with the lowest metric and lowest IP address. That is where we are going to stop here.
You must have something that can win the election to provide the best path, and that is why we have all these metrics. Now, we are going to talk about the things that we are typically going to be tuning to improve BGP performance. Now, let’s say you have two connections to AWS. If you have a direct connection, theoretically their direct connections are highly available. And what they mean by that is you are connecting to a virtual router and an availability zone, but this is great.
What happens if your direct connection fails, then you have a problem. Or if your router fails, you have another problem. So, when you have real high availability needs, you are going to use two direct connections and then VPN backup, and you are going to be running BGP on both direct connections. Now, you could block one path and only use one, which they typically teach on basic networking things like in the AWS advanced networking curriculum. But if you are going to be a really good cloud architect, and you’re used to doing design, you should be able to load-share across multiple links, bring better performance, better systems and not have any type of out of order packets. We are going to teach you how you do that.
Now, BGP routing is bi-directional. What do I mean by bi-directional? I must tell my upstream neighbors about me, and they must tell me about them. If they do not tell me about them, I will not be able to reach them. And even if they can reach me, my traffic cannot go back to them. With BGP, we are always talking about bi-directional routing. Now, in the examples we are going to use, we are only going to show it from one side. Why are we going to show it from one side? We want to make it as simple as possible We must start somewhere. So, let’s take a particular example.
Leaking Specific Route
The easiest way to load-share across multiple links is to do it with routing outside of BGP. What do I mean by this? If you can look at the diagram that we show you below, note that we have two links. You will see that the organization’s data center is using the CIDR range of 172.16.0.0/15. Now realistically speaking, that is multiple sub-nets that we can place on here. Let’s say for example, we have two routers in our data center and on the top router, we desire to tell the cloud by the way, to reach the 172.16.0.0 sub-net, take the top link. And then, what if we tell the cloud provider on the bottom link to take 172.17.0.0/16? Use the bottom link. What have we done? We create a specific link on the top and a specific route on the top. And guess what, the top link will be used to reach 172.16.0.0/16, and the bottom link is chosen to use 172.17.0.0/16, because we link those routes.
Everything will work perfectly until we lose a router, or until we lose a link. Now, whichever sub-net we have not shared will not be reachable. What do you do when you create an environment like this? You do the following: As you can see in this diagram, what I do is I send the specific link on both sides, and I send a summary link or the CIDR range on both links. And what will happen is the CIDR range is not going to be used, because all the traffic is going to be using the more specific routes. And if we do not have enough specific groups, we will have to modify one of those CIDR range. The point here, is in the top link, we have a specific one, and on the bottom link, we have a specific one, and we have an aggregate or summary address, which provides reachability for everything.
So, look carefully, 172.16.0.0 and 188.8.131.52 both /16s can be summarized into 184.108.40.206/15. By sending that, we are always going to work. The top link goes away, the bottom link will have the more specific route, and the summary route, so everything is reachable. This is typically the simplest and most elegant way of using BGP to load-share across redundant links.
Now, the next simplest thing is to adjust the weight (reference graphic below). What does this really mean? If you are on the receiving end, as you take in routes, you can say increase the weight. And if you increase the route for weight for a certain sub-net, that will become the primary path. So, what do we do here? On the AWS side, we increase the weight for the more specific sub-net, and we increase the rate on the bottom link for the more specifics of that. And on the summary links, we kept the weight even. Therefore, we prioritized 172.16.0.0/16 traffic going on the top link. And 172.17.0.0/16 traffic going on the bottom link, as you can see in this diagram. And if either link or router were to go away, no big deal; the summary route or the aggregate route, or the CIDR range will take care of your network layer reachability information.
Now let’s say we do not want to use the weight and want to do it another way. We could take these incoming routes, match them, and then raise the local preference for one site for a route that we want to take and do the same thing for the other route. So, look at what we do in this diagram below. In this diagram, we take our primary sub-net on the top link, the 172.16.0.0/16 that we want to use, we change the local preference, we increase it to 200. And we do the same thing on the bottom link for the 172.17.0.0, we raise the local preference for that specific sub-net. We keep the standard local preference for everybody else. And that way, more specific sub-nets are going to be used, and when one of those links goes, your traffic will take the aggregate route; simple, elegant, effective.
Prepending AS Path
I mentioned BGP as a path vector protocol. What happens is a path is added every time we learn a route, and it traverses an internet service provider, or an autonomous system along the way, or a customer organization. For example, let’s say we have a simple BGP connection between our data center and the cloud. And, for example, our data center’s autonomous system is 64523. Let’s say that is our autonomous system number. When the cloud is going to receive it, they are only going to receive one hop, or 64523 of the autonomous system number, and it is going to receive that on the same link book, both top and bottom. But what if we wanted to make one route look ugly and another route look good? We can do that. We would take the route that we choose not to use, and we would make it look less attractive by prepending or adding another AS path.
What if we just prepend it, if our AS is 64523, and we add in or prepend another 64523, as you can see, we did in the graphic below? Now, that link looks extra hard to reach, for example. That is why we do these kinds of things. We make one look really good, and we make another link look really bad. Then we are always in a position where we can provide packets getting in the correct order, getting to their destination and load-sharing.
The last way that we are going to show you how to do this is by manipulating the MED or Multi Exit Discriminator. Realistically speaking, we can lower the MED and by lowering the MED, we can make a path preferred. As you can see on this diagram below, we reduced the MED for the primary route we want to pass, 172.0.0/16. And the summary route we kept at the same at 172.17.0.0/15 of a hundred. By doing that, the 172.16.0.0/16 will be preferred on the top link. Now on the bottom link, we made the 172.17.0.0/16, we reduced the MED. So, that is going to be the preferred length. And on the aggregate route or the other side, we have not changed anything. By doing this, we made it very clear, preferred path, backup path.
BGP is a marvelous routing protocol. People use BGP because it is tunable. You can easily engineer traffic, just like we showed you how to do it. We showed you a few ways to do it. We showed you how to do it by linking a more specific route. We showed you how to tune BGP by manipulating the weight. We showed you how to tune BGP by changing the local preference. We showed you how to change BGP, or optimize its rating by prepending an autonomous system path, and we showed you how to do it by changing the MED. We have covered a lot about AWS BGP and BGP routing protocol in general, as well as BGP for cloud computing and all kinds of cloud networking concepts in the article.