MRC, manual technique, disaster recovery, S. Gjessing, A. Kvalbein, load distribution, optical networks, Kuala Lumpur, Malaysia, storage resources, Datacenter Disaster Recovery, link failures, Cloud computing, network topology, backup, transit traffic, area, 1Electrical Engineering Department, automatic technique, recovery scheme, pp, advantages and disadvantages, National Fiber Optic Engineers Conference, Link utilization, International Conference, Computer Communications, International Journal of Communication Systems, natural disasters, datacenter, Challenges Mohamed A. El-Serafy1, Faculty of Engineering, Alexandria University, T. Cicic, Configurations
Proc. of the InterNational Conference
on Computer & Communication Engineering 2014 (ICCCE 2014) 23-25 September 2014, Kuala Lumpur, Malaysia
Multiple Routing Configurations for Datacenter Disaster Recovery Applicability and Challenges Mohamed A. El-Serafy1, El-Sayed A. El-Badawy1,3,4, Moustafa H. Aly2,3, and Ibrahim A. Ghaleb1,4
Department, Faculty of Engineering, Alexandria University, Alexandria, Egypt 2Arab Academy for Science, Technology and Maritime Transport, Alexandria, Egypt 3Member of the Optical Society of America (OSA) 4Member IEEE
Abstract: Cloud Services
based on datacenter networks are becoming very important. Datacenters host computing and storage resources, and these resources are served to customers through a network of datacenters. Protecting such a network from disasters like earthquakes, hurricanes and terrorist attack
s is crucial. In this paper, we propose the use of multiple routing configurations (MRC) IP fast reroute recovery process for datacenter disaster recovery. A demonstration of how this recovery scheme can be applied is shown. Also, we discuss the impact of the MRC recovery process on the post failure load distribution over network links and how this impact can be minimized. We propose a manual technique to minimize the impact of MRC and compare it to an automatic technique called modified MRC. Both the advantages and disadvantages of the two techniques are outlined. We conclude that a newer technique is needed that overcome the above mentioned disadvantages. The new technique should not manipulate link weights to achieve good load distribution. Keywords: Datacenter networks; disaster recovery; network resilience; multiple routing configurations; IP fast reroute; multi topology routing; network protection. I. Introduction The role of the Internet is continuously increasing and many technical, commercial and business transactions are carried out by millions of users who exploit a set of network applications . An increasing fraction of computing and storage is migrating to a planetary cloud of datacenters . cloud computing
has emerged as a popular computing environment that provides computing power
and storing space for small to large enterprises, within minimum cost . The cloud consists of geographically distributed mega datacenters connected by a high capacity network
. In such a network, different services are replicated over multiple datacenters, so that a user request can be served by any datacenter that supports the specified service . To meet the demands set by the high volume of traffic between datacenters, optical networks are ideally suited, given their high bandwidth and low-latency characteristics . Traditionally, network protection against single-link failures is ensured by providing a backup path to the same destination (i.e., datacenter), which is link-disjoint to the primary path. This scheme has been refined by the introduction of a backup datacenter, thus adding protection
against failures of a single datacenter . However, this protection scheme
fails to protect against disasters located in an area that contains both primary and backup resources (either network links or datacenters) . Multiple failures generally occur due to natural disasters such as earthquake, hurricane, tsunami, tornado, etc. or human-made disasters such as weapons of mass destruction (WMD) and electromagnetic pulse
(EMP) attacks . Such disasters affect specific geographic areas, as a result, a set of collocated nodes and links go down simultaneously e.g., the 2011 earthquake and tsunami in Japan and the 2008 China Sichuan earthquake caused massive damage to telecom networks in large geographic areas. These events indicate that it is crucial to study disaster protection mechanisms for communication networks . From a networking point of view, disasters like the ones just mentioned have two key characteristics. First, a large number of nodes can be down at the same time. This makes many traditional protection mechanisms unsuitable, since they are often designed to protect against single failures, and most methods focus on link failures only. Second, the failing nodes are geographically near each other, giving poor connectivity in the disaster area . Given the reduced availability and the increased need for communications near and in the disaster area, a disaster recovery scheme should aim at treating the affected area isolated from the rest of the network. Since the affected area of the network must be considered unreliable, other parts of the network should not be dependent on this area for routing or traffic forwarding . At the same time, the remaining resources in the affected area and between the affected area and the rest of the network, is likely to be scarce, and put under heavy pressure. These communication resources should therefore be available for intra-area traffic and traffic originating or terminating in the area, not transit traffic . In such networks, path protection against network failures is a must. A link or a node failure is typically followed by a period of routing instability. During this period, packets may be dropped due to invalid routes and this has an adverse effect on real time applications . Traditionally, such disruptions have lasted for periods of at least several seconds. They have been shown to occur frequently and are often triggered by external routing protocols .
Recent advances in routers have reduced this interval to under a second for carefully configured networks using link state interior gateway protocols (IGPs). However, new Internet services that are classified as real time applications are very sensitive to periods of traffic loss . Addressing these issues is difficult because the distributed nature of the network imposes an intrinsic limit on the minimum convergence time that can be achieved. However, there is an alternate approach, which is to compute backup routes that allow the failure to be repaired locally by the router detecting the failure without the immediate need to inform other routers of the failure. In this case, the disruption time can be limited to the small time taken to detect the adjacent failure and invoke the backup routes. This approach is called IP fast reroute . Link state IGPs use reconvergence process that is characterized by being reactive and global. IP fast reroute is proactive and local in nature. One of the IP fast reroute approaches is MRC . MRC was built on the concept of resilient routing layers (RRL)  and . Previously, RRL was proposed as a disaster recovery scheme to protect the network against large scale disasters . In this paper, we propose MRC as a proactive disaster recovery scheme that can be applied to datacenter networks. The impact of the MRC recovery process on the post failure load distribution over network links and how this impact can be minimized is also discussed. We propose a manual technique to minimize the impact of MRC and compare it to an automatic technique called modified MRC. Both the advantages and disadvantages of the two techniques are outlined. We conclude that a newer technique is needed that overcome the above mentioned disadvantages. The new technique should not manipulate link weights to achieve good load distribution. The rest of this paper is organized as follows: in Section 2, the idea behind MRC is outlined. In Section 3, MRC for datacenter disaster recovery is presented. In section 4 a comparison between two techniques used to achieve good load distribution of traffic across network links after failure while using MRC is discussed and finally findings are concluded in Section 5.
detects the failure can safely forward the incoming packets towards the destination. This is achieved for each link and node failure, and regardless of whether it is a link or a node failure. MRC assumes that, the network uses shortest path routing like open shortest path first (OSPF) and destination based hop-by-hop forwarding . In a configuration that is resistant to the failure of a node n or link l, link weights are assigned so that traffic routed according to this configuration is never routed through node n or over link l. In MRC, node n and link l are called isolated in a configuration, when no traffic is routed through n or l according to this configuration . After generating the backup configurations, a standard link state routing protocol like OSPF is run over the generated configurations to construct a loop free configuration specific forwarding table to all destinations in the network. When a packet is forwarded according to a configuration, it is forwarded using the forwarding table calculated based on that configuration. Figure 1 shows packet forwarding process from a node's perspective .
II. IP Fast Reroute using MRC
MRC is a proactive and local IP fast reroute scheme which allows recovery in the range of milliseconds. MRC allows packet forwarding to continue over preconfigured alternative next hops immediately after the detection of the failure. Using MRC as a first line of defense against network failures, the normal IP re-convergence process can be put on hold. This process is then initiated only as a consequence of non-transient failures . MRC guarantees recovery from any single link or node failure, which constitutes a large majority of the failures experienced in a network . The main idea of MRC is to use the network graph and the associated link weights to produce a small set of backup network configurations. The link weights in these backup configurations are manipulated so that the node that
Fig. 1. Flow chart of a node's packet forwarding process using MRC . When a router detects that a neighbor can no longer be reached through one of its interfaces, it does not immediately inform the rest of the network about the connectivity failure. Instead, packets that would normally be forwarded over the failed interface are marked as belonging to a backup configuration, and forwarded on an alternative interface towards its destination. The packets must be marked with a configuration identifier. So, the routers along the path know which configuration to use .
MRC does not affect the failure free original routing. When there is no failure, all packets are forwarded according to the original configuration, where all link weights are normal. Upon detection of a failure, only traffic reaching the failure will switch configuration. All other traffic is forwarded according to the original configuration as normal . A detailed explanation of the MRC algorithm can be found in . III. MRC for Datacenter Disaster Recovery Once a node running MRC detects a failure towards the final destination, there exists a backup configuration that will forward the traffic to its final destination on a path that avoids the failed element . The disastrous events often strike one or more particular areas. These areas can be states, cities, campuses or buildings. An area that has been struck by a disaster will experience a decreased amount of available communication resources. On the contrary the need for communication to and from that area may increase. Therefore, communications that are transiting this area should find other routes . To accomplish the isolation of whole areas and not only single nodes, it is suggested to consider localized nodes as one area (i.e. a node represents an area with respect to the recovery scheme) . Figure 2 shows the European COST239 network . Each area consists of several intra-area nodes which offer different connections to other nodes in other areas.
We apply the MRC recovery algorithm to the above COST239 network to demonstrate the applicability of the algorithm to protect against disastrous events that affect datacenter networks. The following is considered in our simulation model
: · The European COST239 network denoted by C0. · Number of MRC backup configurations n = 5 (C1, C2, C3, C4, C5). · OSPF is the IGP routing protocol. The output of the algorithm is five backup configurations. OSPF is used as the IGP to build the forwarding table for each backup configuration. Hereafter, the five backup configurations are shown. The isolated areas are denoted by dotted circles. The isolated links are denoted by dotted lines, where they are given very high weights. No transit traffic will pass through the isolated areas or isolated links. Figure 3 shows the first backup configuration where area 1 and area 7 are isolated along with their links. Considering a disaster event striking area 1, some intra-area nodes may survive and hence still offer connectivity to other areas. The MRC will guarantee that the traffic originated and terminated in area 1 will still be routed from or to that area, while the traffic transiting area 1 will be rerouted around the affected area. This is accomplished by routing the affected traffic according to the backup configuration C1. Traffic not originally passing area 1 will still be routed according to the full topology C0.
Fig. 2. The COST239 network topology
. In this section, we propose the use of MRC recovery algorithm to a network of datacenters. Assuming that every area contains a datacenter, The MRC recovery scheme will guarantee that every area and link is isolated in at least one backup configuration. The isolated area or link in a backup configuration will not carry any transit traffic, just the traffic destined to or originating from the isolated area.
Fig. 3. Backup configuration C1 generated using the MRC algorithm. Figure 4 shows the backup configuration C2 where area 3 and area 8 are isolated.
Fig. 4. Backup configuration C2 generated using the MRC algorithm. Figure 5 shows the backup configuration C3 where area 2 and area 6 are isolated. Figure 6 shows the backup configuration C4 where area 5, area 10 and area 11 are isolated. Figure 7 shows the backup configuration C5 where area 4 and area 9 are isolated.
Fig. 7. Backup configuration C5 - MRC algorithm. Following the assumption that every area has a datacenter, MRC succeeds to offer service protection as the failure of a single datacenter in any area does not cause the disappearance of a specific service from the whole network. By using the backup configuration that has the failed datacenter isolated, user requests can still be served by any datacenter that supports the specified service.
Fig. 5. Backup configuration C3 - MRC algorithm. Fig. 6. Backup configuration C4 - MRC algorithm.
IV. Impact of MRC on Post Failure Link Load Distribution Once an area running MRC detects a failure towards the final destination, there exists a backup configuration that will forward the traffic to its final destination on a path that avoids the failed area. This shifting of traffic to alternate links after a failure can lead to congestion and packet loss in parts of the network . In this section, we discuss how to minimize the impact of the MRC recovery process on the post failure load distribution over network links. Two techniques used to achieve a good load distribution across links after failure are compared. We proposed the first technique which utilizes manual link weight manipulation with MRC. The second technique is the modified MRC proposed in . A. Manual link weight manipulation for post failure load distribution with MRC Our simulation model using OPNET modeler software shows the impact of the MRC recovery process on the link load. The link load is shown to increase after single area failure. To solve this problem, and to achieve good load distribution across links, we propose the use of manual link weight manipulation on the MRC backup configurations. Considering the case of area 1 failure while having full mesh traffic between all areas, area 2 will use backup configuration C1 to continue forwarding traffic to area 10.
This will cause congestion in some network links. To decrease the load on the congested links, manual link weight manipulate is used to achieve a better load distribution for the traffic. Figure 8 shows the effect of using this approach on the link loads of some links.
B. Modified MRC for post failure load distribution In this section, it is shown how Kvalbein et al. tried to solve this problem. Kvalbein et al. proposed an approach for minimizing the impact of the MRC recovery process on the post failure load distribution. They presented an algorithm to create the MRC backup configurations in a way that takes the traffic distribution into account. They referred to this new algorithm as the modified MRC. They presented a heuristic aimed at finding a set of link weights for each backup configuration that distributes the load well in the network after any single link failure . In figure 10 the standard MRC and the modified MRC are directly compared. The modified MRC often manages to route traffic over less utilized links after the failure of a heavily loaded link .
Fig. 8. Link utilization for A2-A9 and A10-A11 links. As a drawback of applying manual link weight manipulation, some other links in the network have severe high link utilization. Figure 9 shows the drawback for three links.
Fig. 9. Link utilization for A9-A2, A10-A9 & A2-A5 links. As a conclusion, using manual link weight manipulation technique to solve the problem of achieving good load distribution in the network has the following advantages: · Simple - no complex algorithm is needed. · Easy to deploy in small networks. · Used to achieve good load distribution for selected number of links. Its disadvantages are as follows: · Manual Hard to deploy in large networks · Cannot be used to achieve global load distribution for all network links at the same time.
Fig. 10. Load on all unidirectional links after failure using standard MRC and modified MRC . The modified MRC algorithm has the following advantages: · Automatic - no link weight manipulation is needed after the discovery of a failure. · Used to achieve good load distribution across all links. Its disadvantages are as follows: · Complex - the complexity of the algorithm is affected by the size of the network topology G under consideration. · A predefined demand matrix D must be defined. If the matrix is changed, the algorithm has to run again. · Cannot be used to protect against all link failure. The algorithm is designed to protect against the failure of critical links only by distributing the traffic that was held by these links on other available links.
V. Conclusion In this paper, we propose the use of MRC IP fast reroute recovery process for datacenter disaster recovery. We demonstrated the applicability of MRC algorithm for network of datacenters. Using MRC as a proactive IP fast reroute scheme, no traffic will be dropped. The traffic that is destined to the failed datacenter will be immediately rerouted to a backup datacenter in another area away from the area that suffered a disaster. Such scheme suits modern applications that are replicated over distributed datacenters. Also, we discuss the impact of the MRC recovery process on the post failure load distribution over network links and how this impact can be minimized. We propose a manual technique to minimize the impact of MRC and compare it to an automatic technique called modified MRC. Both the advantages and disadvantages of the two techniques are outlined. We conclude that a newer technique is needed that overcome the above mentioned disadvantages. The new technique should not manipulate link weights to achieve good load distribution. Currently, we are developing this new technique that achieves unequal weight load balance with no need to change link weights. To achieve this, OSPF protocol is tweaked to be able to do unequal weight load balance with no change in link weights. This new technique will overcomes all previously mentioned disadvantages while achieving good load distribution across network links. REFERENCES  M. Marchese, R. Surlinelli and S. Zappatore, "Monitoring unauthorized Internet access
es through a honeypot system," International Journal
of Communication Systems, vol. 24, issue 1, pp. 75-93, 2011.  A. Vahdat, L. Hong, Z. Xiaoxue and C. Johnson, "The emerging optical data center," Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference, OFC/NFOEC 2011, pp. 1-3, Los Angeles
, California, USA, 6-10 March 2011.  B. Rimal and E. Choi, "A service-oriented taxonomical spectrum, cloudy challenges and opportunities of cloud computing," International Journal of communication system
s, vol. 25, issue 6, pp. 796-819, 2012.  V. Vusirikala, C. Lam, P. Schultz and B. Koley, "Drivers and applications of optiCal tech
nologies for Internet Data Center networks," Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference, OFC/NFOEC 2011, pp. 13, Los Angeles, California
, USA, 6-10 March 2011.  M. Habib, M. Tornatore, M. De Leenheer, F. Dikbiyik and B. Mukherjee, "A disaster-resilient multi-content optical datacenter network architecture," Proceedings of the 13th International Conference
on Transparent Optical Networks, ICTON 2011, pp. 1-4, STOCKHOLM, SWEDEN
, 26-30 June 2011.  C. Develder, B. Dhoedt, B. Mukherjee and P. Demeester, "On dimensioning optical grids and the impact of scheduling," Journal of Photonic Network Communications, vol. 17, issue 3, pp. 255-265, June 2009.  S. Neumayer, G. Zussman, R. Cohen and E. Modiano, "Assessing the Vulnerability of the Fiber Infrastructure to Disasters," Proceedings of the 28th IEEE International Conference on Computer Communications, INFOCOM 2009, pp. 1566-1574, Rio de Janeiro
, Brazil, 19-25 April 2009.  A. F. Hansen, A. Kvalbein, T. Cicic and S. Gjessing, "Resilient routing layers for network disaster planning," Proceedings of the 4th International Conference on Networking, ICN 2005, pp. 1097-1105, Reunion Island, France, 17-21 April 2005.
 C. Boutremans, G. Iannaccone and C. Diot, "Impact of link failures on VoIP performance," Proceedings of the 12th International Workshop on Network and Operating system
s Support for Digital Audio and Video, NOSSDAV 2002, pp. 63-71, Miami, Florida, USA, 12-14 May 2002.  D. Watson, F. Jahanian and C. Labovitz, "Experiences with monitoring OSPF on a regional service provider
network," Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS 2003, pp. 204-213, Providence, Rhode Island, USA, 19-22 May 2003.  P. Francois, C. Filsfils, J. Evans and O. Bonaventure, "Achieving subsecond IGP convergence in large IP networks," ACM SIGCOMM Computer Communication Review, vol. 35, pp. 35-44, July 2005.  M. Shand and S. Bryant, "IP Fast Reroute Framework," IETF Internet draft, draft-ietf-rtgwg-ipfrr-framework-13.txt, 23 October 2009.  A. Kvalbein, A. F. Hansen, T. Cicic, S. Gjessing, and O. Lysne, "Fast IP network recovery using multiple routing configurations," Proceedings of the 25th IEEE International Conference on Computer Communications, INFOCOM 2006, pp. 1-11, Barcelona, Catalunya, Spain, 19-23 April 2006.  A. F. Hansen, A. Kvalbein, T. Cicic, S. Gjessing and O. Lysne, "Resilient routing layers for recovery in packet networks," Proceedings of the International Conference on Dependable Systems and Networks, DSN 2005, pp. 238-247, Yokohama, Japan, 28 June - 1 July 2005.  A. Kvalbein, A. F. Hansen, T. Cicic, S. Gjessing and O. Lysne, "Fast recovery from link failures using resilient routing layers," Proceedings of the 10th IEEE Symposium on Computers and Communications, ISCC 2005, pp. 554-560, Murcia, Cartagena, Spain, 27-30 June 2005.  A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chen-Nee and C. Diot, "Characterization of failures in an IP backbone," Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 2004, vol. 4, pp. 2307-2317, Hong Kong
, China, 7-11 March 2004.  M. O'Mahony, "Results from the COST 239 project. Ultra-High Capacity Optical Transmission Networks," 22nd European Conference on optical Communication, ECOC 1996, vol. 2, pp. 11-18, Oslo, Norway, 19 September 1996.  I. Sundar, B. Supratik, N. Taft and C. Diot, "An approach to alleviate link overload as observed on an IP backbone," Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 2003, vol. 1, pp. 406-416, San Franciso, California, USA, 30 March - 3 April 2003.  A. Kvalbein, T. Cicic and S. Gjessing, "Post-failure routing performance with multiple routing configurations," Proceedings of the 26th IEEE International Conference on Computer Communications, INFOCOM 2007, pp. 98-106, Anchorage, Alaska, USA, 6-12 May 2007.  A. Kvalbein, A. F. Hansen, T. Cicic, S. Gjessing and O. Lysne, "Multiple routing configurations for fast IP network recovery," IEEE/ACM Transactions on Networking, vol. 17, pp. 473-486, April 2009.
A Mohamed, ESA El