Mean-time-to-repair (MTTR) is a critical metric for anyone responsible for ensuring continuous and stable service. When applied to complex network operations, it can also be difficult to determine.
The following nine-step approach to determining MTTR was defined and implemented as part of a pilot deployment of a service health management solution at Charter Communications. This new MTTR approach includes not only the time to repair, but also the component costs associated with the steps necessary to ensure service quality and desired throughput in an increasingly complicated operations environment. By understanding the time and cost of each specific activity associated with resolving service degradations, cable operators can identify ways to automate these steps using embedded network intelligence. Such automation is critical to delivering consistently reliable services and reducing MTTR and costs in today’s complex environment.
Service assurance operations involve a repeatable set of processes, either manual or automated, for detecting, qualifying, prioritizing, and then acting upon plant anomalies or potential service degradations. By building intelligence into the network to streamline the now-accurate MTTR process described in this article, the operator realized an average reduction of 50 percent in MTTR for plant outages and a 75 percent reduction of resources required for identification, location and repair of network-related plant degradations.
Step 1 of the process (problem “symptom” detection) is the trigger. This initial reason to suspect a problem often manifests itself as a cluster of customer calls—a reactive indicator. The first n-1 callers—and those that do not call—have lost or degraded service during this time. (See Figure 1.)
In addition to reacting to customer calls, many operations support systems (OSSs) employ periodic polling of DOCSIS management information base (MIB) variables to serve as the trigger. To achieve less than 2 minutes of problem determination latency, we would have to poll all data values every 2 minutes, from all modems in our system. This approach is not practical and becomes a massive scale problem.
As a result, most DOCSIS polling systems back off to a reasonable polling level—typically every 10 minutes for modem status polling for outage detection and every 8 hours for DOCSIS RF and Internet protocol (IP) performance metrics for detecting plant problems. This polling schedule, however, causes extended detection times because of polling lags or may miss degradations that do not persist longer than the polling cycle (e.g., a two-hour ingress event) This is the mean-time-to-detect (MTTD) component of MTTR.
Step 2 (determine persistency of problem) is a simple persistency test to make sure the problem is not transient or the result of a false, intermittent symptom. This step involves a trade-off: ensuring the “quality” of a true problem indicator vs. added response delay. This is the mean-time-to-persist (MTTP) component of MTTR.
During Step 3 (problem validation), a real problem is validated. Validation is typically accomplished through correlation with other system data, often by doing a real-time poll of secondary symptomatic data from an independent system from the initial trigger. For example, if modems suddenly go off-line, we must determine if it is a real plant outage or a residential power outage. A network operations center (NOC) or field technician may interrogate the multimedia terminal adapters (MTAs) in that region to see if they are still communicating. Another method would be to check the power supply monitors to see if they are on battery backup to validate a residential power outage.
For noise or excessive attenuation, unless there are independent monitoring systems to correlate with DOCSIS values, a technician may need to be dispatched to begin the painful and time-consuming search and hunt. Here, a major trade-off evaluation is made between the cost of operations and service quality. We must determine how to qualify and prioritize one potential RF condition over others. During this time, real problems are persisting, degrading or interrupting services. This is the mean-time-to-validate (MTTV) component of MTTR.
At Step 4 (determine problem impact/severity), we recognize that not all problems are created equal. An outage is a loss of service, which is typically more severe than a service degradation. But a severe degradation of some sensitive services, such as voice over Internet protocol (VoIP) telephony, could be viewed as a service outage. A service-affecting attenuation could measure only a 3 dB loss or a 10 dB loss or more. A noise condition could be moderate white noise or severe-high power impulse noise.
Each problem, depending on its characteristics, may affect different services in different ways. For example, a 1 percent upstream packet loss may go un-noticed for high-speed data transmissions, but would be extremely disruptive to a phone call or a live video stream. A downstream noise condition may cause video artifacts or slow Internet video streams. Given the wide variations in problem types, we need to assign severity and service impact values as a relative measure against other validated problems to aid in problem ranking and prioritization.
Figure 1: Key service assurance operations steps
Figure 2: Comprehensive nine-step MTTR
In Step 5 (determine problem scope), we investigate problem scope as a qualifier for prioritization. Problem scope refers to the breadth of the problem, such as the number of customers affected, over what geographic region, with which services—even residential vs. commercial. While this is a straightforward concept, it is sometimes hard to determine readily. DOCSIS value variance or customer calls typically do not cover the entire scope of the problem. Many customers do not call in to report an experienced problem, or they are unaware at the time, and one RF problem area can mask or be coincidental to another. This is the mean-time-to-severity/scope (MTTS) component of MTTR.
Now, we have created a priority qualifier on all of the current known and validated problems. Naturally, we will focus our attention on repairing the highest priority ones first.
Steps 6 and 7
In Step 6 (determine problem root-cause) and Step 7 (determine exact location), we must determine the root-cause of the problem and isolate its precise location. For example, is the problem at amplifier GR01-A5’s power supply located at the corner of Grant Road and Garden Lane, or is it a fiber-cut 50 feet off node FA, or a broken tap port 110 feet off Aux1 leg of amp HR02-A1? Depending on the integrity of the data in Steps 1-5, isolating the location may be a quick or a very long process. Any inaccuracy of Steps 1-5 will lead to false or ambiguous cause determinations and locations. It is equally important to understand both what is not working and where and what IS working and where and how are they network (plant) topologically related. For example, you can get modems going off-line because of dynamic host configuration protocol (DHCP) or headend issues, not outside plant problems. Unfortunately, in many instances, this step involves mostly a manual, unsynchronized process. This is the mean-time-to-cause-and-location (MTTL) component of MTTR.
In Step 8 (repair problem), we fix the problem. Hopefully, there are systemic remedies or replacement systems or parts available and access to the affected areas is allowed. If the root-cause cannot be corrected, there are typically two options: compensate or contain. We may be able to compensate for the original problem by adding a blocking filter or changing a pad, until we can actually return to repair the original source of the problem when access or spares are available. Ideally though, the problem would be addressed by some automated compensation. Forward error correction (FEC) is an example of an automated way to compensate for initially destroyed packets. Containing the problem would include manually or remotely disconnecting the entire leg. For example, if the noise were so severe as to bring down the service of all customers on a given node, it is better to disrupt 10 customers than 400. We will use a lowercase “r” for this mean-time-to-repair (MTTr) to represent the actual repair time component in our overall MTTR.
The final step, Step 9 (verify service restoration), is necessary to confirm that the repair actually restored the service level. If the root-cause and location were accurate (which the initial determination may not have been) and the repair was performed correctly, the service should be restored to nominal, pre-problem levels. This final step provides confirmed closure of the problem by validating that the symptomatic measures are restored or service performance measures are up to optimal levels. Operations policy regarding this final step determines whether the mean-time-to-close (MTTC) component of MTTR is included.
By summing up the component times described earlier, we arrive at a comprehensive MTTR, as depicted in the following equation (see Figure 2):
MTTR = MTTD + MTTP + MTTV + MTTL + MTTr (+ MTTC)
Prioritization was discussed as a result of Steps 4 and 5 (scope and severity). In reality, prioritization is required at all phases where a potential dispatch of a technician or any other limited resource is required. Over the entire process flow, depending on where manual intervention is required, prioritization would be necessary.
Depending on the accuracy and level of integrity of the service health management system or the nature of the problem, prioritization could be necessary as early as problem validation (Step 3). Therefore, summing MTTR would vary widely, based on where manual operations would be required. Variations could range from as much as 60-120 minutes for outages or 1-48 hours (or longer) for RF-related plant problems.
Turning this intp MTTFo an economic equation, we arrive at a mean-cost-to-repair (MCTR), as depicted in the following equation:
MCTR = MCTD + MCTP + MCTV + MCTL + MCTr (+ MCTC)
MCTR would represent the total cost in operations labor dollars (for example, call center, NOC, field ops). This cost would not include cost of nominal operations or average cost of test equipment. For a business case, this cost is ultimately measured against the cost (or value) of lost customers, customer revenue (ARPU), or obtaining new subscribers (RGUs). Not to be overlooked, this total cost could also be measured against itself in an effort to reduce the total cost of operations.
In an ideal world, we would have to dispatch no earlier than after Step 7, where Steps 1-6 were fully automated, and where MTTR = MTTr. And better yet, the dispatch would be while an automated compensation or containment system was invoked.
Opportunities to cut MTTR and MCTR
Many challenges exist in moving toward the goal of cost-effective, continuously managed service and network health—where MTTR approaches MTTr. The pilot deployment of a network health management system gave Charter Communications a view into the potential of embedded network intelligence to achieve the industry’s operations efficiency and network availability goals. In terms of the new MTTR, the system provided:
1. Scalable, continuous behavior awareness
2. Auto-generation and maintenance of a topological relationship of information
3. Real-time data synchronicity across multiple layers
4. Automation of all of Steps 1-7 of the operations process (and Step 9)
5. Automated compensation and containment (parts of Step 8)
It will take systems of this caliber, something more than just network monitoring, to deliver consistently reliable services and reduce MTTR and costs, all imperatives in today’s cost-sensitive and competitive environment.
Keith Hayes is VP network operations and engineering services for Charter Communications, and Richard Berthold is CTO of Proxilliant Systems. Reach them at Keith.Hayes@chartercom.com and firstname.lastname@example.org, respectively.