KubeHA

KubeHALogo

KubeHA provides high availability(99.999% and 99.9999%) to mission and business critical applications running in cloud native environment like Kubernetes. The applications can be either of type stateless and statefulset.

Stateless application: If a pod crashes, then there is no outage in stateless applications, but certainly there is service degradation. This degradation can last till a new replacement pod is started, which can take few seconds depending upon the type/size of a pod. Similarly, if a node(hosting many pods) crashes, then again there is a major service degradation. This degradation can last from 30 seconds to few minutes depending upon when node failure is detected by Kubernetes and pods are restarted again on the same or another node.

Statefulset application: If a master/active/primary pod crashes, then there is an outage in statefulset applications. This outage can last till the pod is restarted, which can take few seconds depending upon the type/size of a pod. Similarly, if a node(hosting master/active/primary pod) crashes, then again there is a major service outage. This outage can last from 30 seconds to few minutes depending upon when node failure is detected by Kubernetes and the node is restarted again. Node failure detection in any managed service provide(MSP like AWS, GKE, Azure, etc.) is more than 30 seconds and node recovery takes a few minutes.

For mission critical applications(esp. defence, 5G, etc.), one second of outage means a lot. Mission critical applications need to maintain high availability of 99.9999%(approx. 30 seconds of outage in a year). Business critical applications need to maintain high availability of 99.999%(carrier grade requirement, approx. 5 minutes of outage in a year).

In case of stateful business critical application, more than 1 node failure(recovery takes 2-3 minutes) in the year or at an average  more than 2 pod failure(if takes 2-3 seconds) per week will violate SLA of 99.999% service availability. Mission critical application can’t tolerate node failure at all, one node failure will violate SLA of 99.9999% service availability. More than one pod failure per month will violate SLA of 99.9999% service availability.

Application Type

Typical SLA

Approx. Recovery Time

SLA Violation

Statefulset business critical application

5 minutes i.e. 300 seconds of outage(Service Availability 99.999% )

 

Pod Recovery: 2.5 seconds

Node Recovery: 2.5 minutes

3 pods failure/week

OR

2 nodes failure/week

Statefulset mission critical(defence, 5G, etc.) application

30 seconds of outage(Service Availability 99.9999%)

2 pods failure/month

OR

1 node failure/year

 

In case of stateless business/mission critical application, service degradation will happen depending upon the scale of failure. Multi node failures can create outage in minutes.

Current application monitoring and alerting solutions require manual intervention for restoring the degraded service leading to longer recovery time and poor customer experience.

KubeHA has solved the problems of (pod/node)fault detection and the recovery time in native Kubernetes by implementing a combination of KubeHA side car containers and KubeHA instance.

KubeHA Dashboard:
KubeHA Dashboard is a web-based user interface (UI) designed especially for monitoring and controlling multiple Kubernetes clusters. The dashboard can connect to multiple Kubernetes clusters.

 

dashboardmulticluster

 

 

The main advantage of KubeHA dashboard is making user’s integration effort with KubeHA to zero. The user can upload application’s yaml file, clicks ‘Monitor my deployment’, the dashboard automatically integrates user’s yaml file with KubeHA side car container’s yaml file and launches the application’s pod with KubeHA side car container. KubeHA instance gets started automatically and it starts monitoring the application’s pod using KubeHA side car container.

 

Dashboard-podspod kill

 

 

The dashboard can display all workloads and nodes running in any cluster. It also includes features that can help user to control and modify workloads. The dashboard helps in simulating few failures(pod failures, node failures, etc.) without/with KubeHA and displays the outages that these failures create in the cluster.

KubeHA Test Results:

  1. 1. Stateless Application:
    1. 1.1. Pod Failure:

Comparison of Service Outage for Single Pod (Stateless) failure: With KubeHA Vs Without KubeHA

 

 

MTTD (in Seconds)

MTTR(in Seconds)

Total Outage(in Seconds)

Average service outage without KubeHA

0.0537

3.2236

3.2773

Average  service outage with KubeHA

0.0452

0.0384

0.0836

Comparative Service Outage (Without KubeHA /With KubeHA)

1.188 times faster

83.94 times faster

39.20 times less outage

 

  1. 1.2. Node Failure

Comparison of Service Outage for Single Node failure (Stateless Pods): With KubeHA Vs Without KubeHA

 

MTTD(in Seconds)

MTTR(in Seconds)

Total Outage(in Seconds)

Average service outage without KubeHA

37.2

2.858

40.58

Average  service outage with KubeHA

5.374

0.187

5.561

Comparative Service Outage (Without KubeHA /With KubeHA)

6.92 times faster

15.28 times faster

7.29 times less outage

 

  1. 2. Statefulset Application:
    1. 2.1. Master/Active/Primary Pod Failure:

Comparison of Service Outage for Pod (StatefulSet) failure: With KubeHA Vs Without KubeHA

 

MTTD (in Seconds)

MTTR(in Seconds)

Total Outage(in Seconds)

Average service outage without KubeHA

0.0268

1.4348

1.4616

Average  service outage with KubeHA

0.005

0.4014

0.4064

Comparative Service Outage (Without KubeHA /With KubeHA)

5.36 times faster

3.57 times faster

3.59 times less outage

 

 2.2. Node(hosting Master/Active/Primary Pod) Failure:

Comparison of Service Outage for Pod (StatefulSet) because of Node failure: With KubeHA Vs Without KubeHA

 

 

MTTD(in Seconds)

MTTR(in Seconds)

Total Outage(in Seconds)

Average service outage without KubeHA

36.7023

121.081

157.7833

Average  service outage with KubeHA

3.4177

0.781

4.1987

Comparative Service Outage (Without KubeHA /With KubeHA)

10.738 times faster

155 times faster

37.579 times less outage