Now that we’ve finally reached the Year of VoIP (voice over Internet protocol), our installation and operations teams are beginning to realize that although voice and data both involve packets traveling across networks, voice requires different tests to verify that telephony is working. Voice quality measurement, or how closely the received waveform of a conversation matches the sending end waveform, is quickly gaining status as a metric for telephony offerings. Along with that status comes the need to understand the correlation between various voice quality measurement models and the limitations of automated voice quality testing. How we got here
Cable has been offering carrier-grade telephony service for some time now, so you might ask why voice quality was not identified as a critical parameter from the start. The answer is in the technology used to deliver the service. Although voice quality has always been important, the need for voice quality testing is far greater in a packet-switched network than in a circuit-switched network. With a circuit-switched, end-to-end path ensured for the duration of a telephone call, latency is almost always well within the 300 msec round-trip limit for end-to-end signal travel time, and jitter is a nonissue because all voice information takes the same route. Signals can still be degraded due to physical impairments such as cable faults, but in general, these trouble conditions can be detected by operations systems that monitor the electrical characteristics of connections. A number of models have been developed to measure voice quality, but there are only two ways to rate voice quality: listening quality and conversational quality. Listening quality refers to the quality experienced by listening to one side of a conversation only, while conversational quality is the overall quality experience of a two-way telephone call, specifically including the effect of echo. Currently, most voice quality measurements are in the listening quality category. Various models
As background for understanding how voice quality testing should be used, let’s first look at a summary of some key listening quality models and how they relate to each other. Given the prevalence of models developed by PsyTechnics, I will focus on that company’s products for much of the discussion. Those readers who would like a more thorough discussion of the individual models and alternate vendor implementations are referred to the April 2003 issue of this column, which can be found in the archives at www.ct-magazine.com/archives/ct/0403/0403_telephony.html . Mean opinion score (MOS) is both the gold standard and grandfather of voice quality testing. Its methodology, which goes back to Bell Laboratories testing of network equipment, consists of assembling a panel of human listeners who rate the quality of several hundred speech samples from 1 to 5, with 5 indicating best quality. The International Telecommunications Union E Model is a design tool that predicts the average voice quality of calls processed by a network, based upon mathematical estimates of the effects of delay, jitter, packet loss and codec performance. It rates a network from 0 to 100, with 100 indicating best quality. Perceptual speech quality measure (PSQM), perceptual analysis measurement system (PAMS), perceptual evaluation of speech quality (PESQ), and PESQ-listening quality (PESQ-LQ) are variations of models that predict MOS scores based upon comparison of a voice file that has been processed by the network under test against a clean reference file. Tests using these models are intrusive because they require a dedicated test call, rather than actual conversations. The PESQ model provides scores from -0.5 to 4.5, with 4.5 indicating best quality, while PESQ-LQ scores range from 1 to 5, the same as MOS. PsyTechnics, one of the developers of PESQ, claims a correlation of better than 90 percent between PESQ and MOS. ITU-T P.563 and PsyTechnics PsyVoIP are nonintrusive models that predict an MOS score based upon live traffic. These models analyze real time protocol (RTP) streams for source and destination addresses, sequence number and jitter profile and predict the impact of the IP bearer on the MOS value with an 80 to 90 percent correlation. The catch
Despite such high levels of predicted correlation, you cannot expect exactly repeatable results and the ability to predict the score of an alternate model. For example, if an individual test has a 3.8 MOS score, the PESQ score should be within 90 percent, or in the range 3.42 to 4.2. This may not always be true, however, for reasons relating to the particular reference file being used and the MOS testing methodology. A true MOS test is subjective by nature, and large numbers of individual listening tests by different people are required to statistically make a MOS score valid as a measurement. Also, differences in language, gender of the speaker, or even file content, have been found to influence both MOS and PESQ. Even with the same reference file, the correlation between MOS and any of the other models for a given test may not be within the predicted value, since the prediction is based upon several hundred samples. As for variations between MOS and the other models, it pays to remember that unless the MOS scores are from tests using human subjects, they are derived from other network parameters and therefore are also based upon predicted correlations derived from large samples. The most valid application of automated voice quality testing is therefore between network aggregation points, such as gateway to gateway, with a large number of test calls to simulate behavior under actual network traffic volumes. Applied this way, a network quality number can be established as a metric for other tests. For example, MOS values can be determined for individual calls, in conjunction with a calibration file for network aggregation points. When individual test scores for network endpoints are observed, they can be compared against network averages to determine possible faults, such as a malfunctioning codec in a multimedia terminal adapter (MTA), or trends, such as poor routes. Statistical variation aside, there are other limitations to automated voice quality testing. Justin Vandenland, Psytechnics marketing director, pointed out that voice quality testing cannot predict how well pure tones will pass through a network. "The algorithms are based upon vocoding, or pattern sampling, which involves several frequencies rather than a pure tone." In fact, many test files are essentially a collection of jibberish that generates the patterns needed to test network equipment in as short a time as possible. Also, even a voice quality test based upon actual conversation will not catch problems with equipment or wiring on the subscriber side of the MTA. Acterna applications engineer Gary Meyer points out that a defective telephone mouthpiece, for example, will adversely affect call quality, but would do nothing to impact a voice quality test. "The algorithms accurately analyze how closely a received voice frequency sample matches a sample on the sending side, but just like computers, ‘garbage in, garbage out’ holds true." What it all means
The bottom line on voice quality testing is that it is a macro tool for determining an objective number (metric) that indicates overall network quality or the performance of a network element over a large number of tests. Individual line tests can be useful in pointing out some line quality issues, but problems on the subscriber side of the MTA will only be found with supplemental troubleshooting. Justin J. Junkus is president of KnowledgeLink and telephony editor for Communications Technology. Reach him at firstname.lastname@example.org.