Follow

RCA for Cluster 9 (Australia) service outage - February 7th 2019

On February 8th, 2019, bLoyal experienced a brief outage with the Tier1 Loyalty Engine Service on Australia East region for the Service Cluster 9. 

We strive for 100% up-time with local and geographic fail over capabilities and treat any outage very seriously.  This article provides the root cause analysis (RCA) for this outage.

Impact Time

Date 

Impact Time 

Total Downtime 

February 8th, 2019 (AEST) 

2/8/2019 at 9:35am (AEST) 

2/7/2019 at 2:33pm (PST) 

2 Minutes 

 

Impacted Services 

This outage affected the loyalty engine in the primary region and would have resulted in delays in customer lookup and promotion calculations.  

Monitoring

Our monitoring caught the issue however alerts are set to fire after 5 minutes of consecutive downtime.   

Root Cause

The Azure Cloud Service for loyaltyengine9.bloyal.com was set to the default recycle time of every 1740 Minutes (29 hours).  When the service cycled, this would trigger a fail-over event with a brief 2 minute downtime.  Although the Geo Redundant instance took over, a 2 minute downtime is not acceptable for this service unless there is a legitimate disaster recovery scenario to force a Geo Failover. 

Mitigation/Fix

Service Fix: We configured the service not auto-recycle and will continue to do so whenever the VM is updated by Azure.  

Monitoring Fix:  We set the alert for our Tier1 services to 2 minutes so the on-call personnel will be notified quicker than previous monitoring.

 

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk