The Challenge of VoIP System Failures Not Addressed by Most High Availability Designs

Hardware or software can fail at anytime and induce a system failure. It is not possible to reduce such failures to nil. When VoIP based systems experience such failures, it results in the loss of on-going calls. High availability (HA) or redundant systems cannot address this unless they are capable of restoring an on-going call without either one of the end-points re-initiating the call. Most high availability system for Session Initiation Protocol (SIP) based VoIP calls and their redundancy setup, deploy an immediate replacement of the failed component/sub-system to allow continued use of the system. It is good enough for many situations but it might not be adequate for mission critical applications when the HA cannot not restore on-going calls.

Imagine a scenario where an outside caller initiates a call and when it hits the demarcation point of the contact center installation. This could be a premise based contact center or a Cloud set up offering virtual contact center services. When the call setup reaches the intended peer and conversation starts, it is possible that your system, either Cloud based or on-premise solutions, could experience a failure. Once the system detects a failure, its high availability and redundant setup will kick-in and the system will be ready for future calls but what happens to the on-going call? They just die. This is the normal operating mode of traditional high availability systems including most high availability solutions offered for Asterisk. This issue becomes more critical for large contact centers using automatic call distribution (ACD) with significant traffic at any given time.

With contact center ACD, the importance of going beyond the traditional high availability is extremely important. Having the capability to keep calls alive through call survival is critical. This will allow the user to continue the phone conversation without the need for re-initiating the call. It is a sophistication in offering redundancy that goes beyond recognizing the need to bring into action the replacement software and hardware components. It introduces intelligence required in preserving all the on-going calls essential for mission critical systems.

SIP Registration Timeout Settings for High Availability

In setting up a high available telephony system most worry about the back end and ensure it functions as they would expect and require.  However one highly visible user issue I have seen is a misconfiguration of the connected SIP phones in regards to the registration timeouts.  When these are very high on your SIP phone then it may not notice a service has moved (via IP/DNS/etc changes) due to a HA switchover and can potentially miss incoming calls until it does.  Typically an outgoing call attempt will work or at the very least cause a registration attempt to the new server the service has moved to.

For example take a look at Aastra, their defaults in a few models I’ve seen are at a half hour for a failed registration.

aastra-default-configuration-reg-failed-retry

If the failed registration timeout is half an hour and the phone attempts to re-register and fails your phone will show an error or unregistered for the next half hour.  This can happen in the cases where the registration comes in as a box is failing or a failover happened and the configuration is being written/updated due to the switchover process.  More reasonable set of values are shown in the following.

aastra-configuration-reg-failed-retry

In this one I’ve lowered the registration failed retry and also the timeout retry timers.  These will make the SIP phone resolve the registration issue quicker by retrying more often than the defaults.  They could be lower depending on the situation.

One precaution before everyone sets these very low.  These settings should be set appropriately when the SIP phone is off-site and there are protections, for example Fail2Ban, in place to block brute force attacks.  In these cases where the SIP phone is on an app on a mobile device this failed registration timeout should be set high enough to not trigger a lockout of a valid device.  If the devices are in-house or IPs can be whitelisted then the values can be lower without worry.

Rightsizing your Telephony System from the Beginning

A number of common questions come up when purchasing telephony systems. One of the most important which affects costing the system is the expected usage in terms of number of users, ports and active calls. Knowing current needs is relatively straight forward especially for an existing business. However the real issue is businesses are not static and you want your new ACD/PBX system to be able to grow with your organization.

Obviously one can throw money at the issue up front and spec out the system for the projected size required in the future. However the better option is to choose a solution that can scale as your business grows without the need to replace the full system.

We find some systems have a large hardware cost upfront and limit usage via licensing so later expansion is done via purchasing new licenses to unlock the hardware already paid for. The issue here is the initial investment is large and compromises are often made to fit within the current budget which limit future growth on the system.

Two better alternatives are:

  1. Purchasing a system which meets your needs for the near term which can scale properly to meet the demands of the business well into the future.  The Q-Suite telephony platform accomplishes this by having the built-in ability of adding additional asterisk servers and web servers to an existing installation to scale to meet the needs of a business as it grows.  Recently the option of monthly licensing has been offered to save even further on initial costs.
  2. Using a Cloud based ACD/PBX System. This lets a provider worry about hardware upgrades, trunks, etc and allows growth in smaller increments.  Look at the hosted provider VitalVox where your role is only to configure and manage the users, campaign, queues, and other features of your ACD/PBX system.

The Differences in Call Survival and Call Recovery

While investigating High Availability (HA) in CTI and PBX systems you will often find mention of Call Recovery. Another term you run into is Call Survival, which is often used interchangeably with Call Recovery incorrectly. This is because each is a different approach to solving a problem. The problem being a failure which would interrupt the calls of a system.

With Call Survival when a failure happens the caller and callee do not have to take action to continue their call as it survives the failure. At a high level this is done by reacting to the failure quickly and re-routing the audio path around the failure.

With Call Recovery when a failure happens the recovery is different depending on the system. Sometimes the caller will need to initiate the redial the callee or it could be an automated process but the callee still have to answer this new call.

From a user perspective the better option is Call Survival as they may only experience a momentary interruption in their audio as the path is rerouted around the failure instead of having to re-initiate a call to recovery it.

The Q-Suite platform supports Call Survival with the help of the Overseer Watchdog providing HA for other services in addition to being one part of the Call Survival solution.

Essentials of a Cloud Contact Center Service Platform

Gartner makes available for public distribution, a write-up called “Magic Quadrant for Contact Center Infrastructure”.  Aside from the market analysis and evaluation of some of the contact center solution providers, the narrative dated June 18, 2013, also provided insight into how the market perceives the technology. Under “Market Overview” it articulates that more and more solutions are now shipped as contact center software which can be run on properly configured commercial off-the-shelf servers.

Now this may be a recent phenomenon for proprietary solutions that are closed but it has always been the case with open source software driven technology platform particularly the Asterisk based contact center solutions. Linux O/S, Asterisk Telephony, MySQL Database, Apache Web Server, and all the dominant Web Browsers are open source and they form the backbone technology stack of the contact center solution. They are to a great degree, hardware vendor independent. There are many Cloud platform vendors like Rackspace and Amazon who offer the computing power and the backbone technology as a service, for deploying a cloud based contact center solution.

Cloud contact center service providers will require a contact center software that leverages a comparable technology stack and offer the essential functionality required for a contact center operation. Multi-tenant capability with a multi-channel ACD is a must for every Cloud based service provider. Skills based Routing, Dialer, and robust API for CRM integration are some of the other key elements required.

Distributed computing has been around for years. The unprecedented growth of network infrastructure and the evolution of Internet Protocol (IP) based technology methods has made it possible to move distributed computing capabilities to new heights. Cloud  platforms assemble the technology backbone and the contact center software to make it accessible through the IP networks. The shift to a service on demand model is one of the significant advantages offered by Cloud based contact center solutions.

Audio alerts triggered by real-time contact center ACD activities

Automatic Call Distributors (ACD) control and manage the work-flow of a contact center. A multi-channel contact center ACD offers skills based routing and queue prioritization for phone calls, emails, and web channels. The real-time queue metrics are a good indicator of the contact center activities. Even with work-force  management (WFM) software predictions, it is not always possible to staff adequately for handling sudden spurts in call volumes. Organizations should have procedures in place to handle such events. One such option is having supervisory staff and supplementary employees participate in call handling if required.

Key queue metrics like the total number of calls waiting in a given queue, the wait time, the abandon rate, and the overall service level provide a measure of  real-time call center activities. A good contact center software will allow call center managers to set conditions based on the  queue metrics to trigger audio alerts. Different audible alerts can be set, each specific to a particular queue metrics parameter.

Q-Suite for Asterisk is a powerful contact center ACD offering such feature as a part of its call center software. It is a multi-tenant software for setting up Cloud based fault tolerant High Availability (HA) contact center solutions. Its audible notifications can be triggered by setting conditions on queue parameters that are monitored as a part of the real-time contact center reporting. These notifications allow the contact center floor operations to initiate procedures that are put in place for handling sudden spurt in call volumes.

Visual call flow designer for Asterisk based contact centers

Call flow represents the service work flow offered to an incoming call through the voice portal of a contact center solution. A clean and effective call flow improves efficiency and enhances customer service. Incorporating Interactive Voice Responses (IVR) within a call flow provides an opportunity for integrating information from the back-end systems to enable more self service options . The IVR also introduces call flow automation to reduce wait-time and empower the customer. Based on interactive customer responses, call routing decisions are made from the IVR,

A Visual designer is a drag and drop  tool-set for visual call flow modeling with graphical icons representing the IVR and contact center ACD functions. In Asterisk based contact center center solutions, the call flow designer will include a powerful set of telephony and scripting functions, intrinsic to Asterisk Dialplans. Multi-tenant contact center solutions in cloud deployments scale over multiple Asterisk servers. The contact center software controlling the Asterisk cluster through its call center ACD, will deploy the output of the visual call flow designer across all the Asterisk servers. The capabilities of the call center ACD determine the extent of sophistication available in a particular call center solution.

The expanding user base of Asterisk in contact centers, due to its fast evolution, lower costs of acquisition, and superior telephony, have generated a great demand for powerful, dynamic, and visual, modeling tool for building calls Flow and IVR. Contact center software like Q-Suite come with an intuitive graphical visual dialplan builder to create and deploy powerful IVR driven call flow applications.