Return to Omnia hompepage

State of the Art Speech Processing for Broadcasting

Martin Wolters
Cutting Edge, Cleveland, Ohio

ABSTRACT
Many algorithms for processing speech have been developed over the past few decades including compression, automatic gain control, de-essing and equalization. Today, equipment is available providing only a subset of these functions (compressor, de-esser) or providing a combination of many functions (microphone processor). Sometimes, the same devices are used by engineers in recording and broadcasting, although there are different objectives in each application and different considerations have to be taken into account. Historically, limitations of analog equipment and limited budgets often led to workarounds and very inefficient use of available algorithms. This paper explores state-of-the-art processing of speech signals in a broadcast environment. The advantages of digital processing are described taking into account the interaction between speech processing and commonly used audio processing. Finally, different ways of integrating a digital microphone processor into a broadcast studio are illustrated.

INTRODUCTION
Creating the "sound of the station" has become an important issue in the broadcast industry over the past decades. A number of factors make the aesthetics of sound a key point in a station's format and success. These include, for example, increased competition, improved quality of alternative transmission systems (e.g. cable, DAB), the high quality of new receivers and stereo systems (even for car radios) and the higher expectations of listeners for good sound quality. Using purpose-built audio processing equipment -æ usually inserted at the very end of the audio chain æ is a common technique to create the specific "sound of the station". Most of the time, this audio processing is optimized to improve the sound of the radio station's music format æ obviously a very important, sometimes the most important part of the program. Since the music within a format and therefore the sound of the different songs within a program tends to be quite consistent, one can find that the application of certain processing parameters suffices to establish a station's on-air sound. In this case the raw material fed into a sound processor consists of more or less carefully produced recordings with a certain standard of quality with regard to leveling and equalization.

But announcers', talents' and DJs' voices are also an essential component of most formats. Much of this raw material is produced live, and very often there is no way to maintain the same standard that you can find in the above mentioned recordings. A specific processing of speech becomes necessary and is part of most modern broadcasting facilities. Nevertheless, it seems that the development of speech processing specifically for broadcasting has been neglected during the past decades. The result is little knowledge about how to use digital signal processing most effectively for such applications.
Based on knowledge about broadcasting and sound processing, combined with new scientific approaches about the properties of speech signals and the utilization of digital signal processing, new investigations toward the development of microphone processing products have recently been made. Some of the results are presented in this paper.

ALGORITHMS AND FUNCTIONS
The algorithms used in processing speech are automatic gain control (AGC), equalization (EQ), dynamic range control (DRC), de-essing, phase rotation (PR) and reverberation. Each of these algorithms has a specific task and the order in which these functions are arranged should not be arbitrary. Figure 1 shows an optimized signal path.

Each function will be discussed in the following paragraphs focusing on the specific requirements in a broadcast studio, the advantages of digital signal processing and the benefits of combining these functions into a single unit.



Figure 1

Automatic Gain Control (AGC)
One important issue in processing speech signals is level control. In a recording studio, the sound engineer usually takes care of the correct gain settings. The necessary gain is dependent on the room, the choice and position of the microphone and, of course, the person's voice. In a broadcast environment, the same room and the same microphone is used most of the time, so one could adjust the gain by taking into account these two factors. But the person's voice and the person's position might change. This is why an AGC is necessary. From an engineer's point of view, an AGC is a gain controller with a slow attack and release time. Another point of view is to consider the AGC as a replacement for the sound engineer. This latter concept might be more helpful, because one can visualize some important issues:

1. A good sound engineer carefully monitors the input level to make sure that it is nominally at 0 dB (the reference level). A compressor does a similar task; however it changes the level more frequently and more quickly.

2. A sound engineer can "detect" if a person is speaking or not and tries to maintain the desired level of operation when the person is speaking. When the person is not speaking, the sound engineer "freezes" the last gain setting. Therefore, the AGC must take into account the operation of a noise gate which detects voice activity.

3. A sound engineer not only watches the input level, but also watches the compressor's activity. Adjusting the parameters of a compressor and adjusting the input gain are not independent and, therefore, the AGC and the compressor should interact.

The AGC should be one of the first stages within the signal path. Only a high-pass filter with a very low cutoff frequency (often referred to as a "rumble filter") should be placed before this algorithm. This high-pass filter reduces a possible DC offset introduced by the analog input circuit and filters unwanted noises such as hum, low frequency disturbance from touching the microphone (stand), etc. These signals would otherwise affect the operation of an AGC.

Dynamic Range Control (DRC)
In a superficial view, AGC and DRC appear similar in some ways. This is the reason that compressors -- one part of DRC -- are sometimes used as AGCs by adjusting the threshold very low so that the compressor provides an almost constant output level, independent of the input level. The result is very poor gain control, since none of the above mentioned issues are taken into account.

There are three new issues addressed by DRC:

1. DRC is used to "optimally use the full amplitude range of a recording system" . Unfortunately (from a sound engineer's point of view) there are a few high level peaks in speech which reduce the available headroom of a recording. These peaks do not increase the perceptual loudness of a signal because this is affected more by an average value . Hence, reducing the peaks does not decrease the loudness, but increases the available headroom and allows additional gain, resulting in an overall increase in loudness. This is sometimes referred to as peak control and is a more technical aspect of DRC, especially compression/limiting. Carefully chosen parameters lead to inaudible compression, up to a certain amount of gain reduction.

2. Beyond this certain level of gain reduction, compression becomes audible. Fortunately, this "sound" imparted by a compressor æ the increased density of the speech signal æ can be considered pleasant and is sometimes used to create a specific "sound". This is the more art-related aspect of compression; the compressor as a tool for creating the "sound of the station".

3. DRC consists of more than just compression/limiting. A second, lower threshold can be utilized to further reduce all signals below this value. This is called an expander and, if the ratio of the reduction is almost infinity, signals below that threshold are muted and

4. would be referred to as a "noise gate". The idea is that signals below a certain threshold are generally non-speech signals (e.g. background noise, paper shuffling and so forth) and should be reduced. This is especially important during interviews with studio guests or in the case of multiple announcers where a person's microphone is open but that person is not speaking. New research in the field of speech detection (e.g. for applications like mobile phones) led to "intelligent" noise gate algorithms.[3] Rather than just monitoring the energy of a signal, these algorithms utilize zero crossing rate and analysis in the frequency domain to determine if a valid speech signal or a disturbing background noise is present. Digital audio processing allows the implementation of some of these ideas into a microphone processor, resulting in a more accurate noise gate.

De-essing
In the past DRC -- especially compression -- and de-essing were integrated. De-essing was an extension to a compressor. Since research during the last two years resulted in new information about the properties of sibilants and led to development of new algorithms based on psychoacoustic evaluations, the connection between de-essing and DRC needs to be re-evaluated. An overview of the algorithms and concepts used in the past and an overview about sibilants and the problems in recorded speech introduced by these sounds can be found in a research study from 1998 [4].

To summarize the new information, one should distinguish between detection and reduction of sibilants. Investigations on speech recordings in four different languages showed that a very good and reliable detector for unpleasant sibilants is the psychoacoustic unit sharpness.[5] Figure 2 shows the mean and standard deviation of sharpness calculated for 50 test sentences and for 141 sibilants within these test sentences which were marked as disturbing by at least three of four test persons (experts from the recording industry). A value of 1.2 acum can be utilized to safely detect unpleasant sounding sibilants. Based on a frequency analysis related to the human hearing system, sharpness can be calculated in today's digital signal processors (DSPs).[6]



Figure 2

A very effective algorithm for the reduction of sibilants without many artifacts can be implemented using a combination of spectral subtraction with a time varying band-pass filter and broadband compression. Time varying means the band-pass adapts to the spectral properties of a specific sibilant. In addition the use of a small amount of broadband compression reduces the so called lisp-effect.[4]


Equalization

There are three different types of filters used in audio and speech processing:

1. High-pass/low-pass filters: As already mentioned, a high-pass filter with a very low cutoff frequency can be used to reduce a possible DC offset, low frequency hum, and background noise. Similarly, low-pass filters can eliminate high frequency noise. In general, these filters are used to limit the audio spectrum. They are less important in controlling the "sound of a station".

2. Shelving filters: These filters are used to weight (boost or cut) certain frequencies, in particular high frequencies above the cutoff frequency and low frequencies below the cutoff frequency respectively. One can create a specific sound of a station using these filters. But it may not be necessary to carefully adjust the parameters for each individual person. A more general approach (maybe separate for male and female announcers) can lead to a successful, good sounding timbre.

3. Peak filters: These filters allow very detailed changes within the frequency spectrum. They also allow changes of any desired frequency. Full parametric peak filters provide control of the center frequency, Q-factor and gain. Used with a low Q-factor, peak filters can be used as a general cut or boost of the midrange, similar to the effect of shelving filters on high and low frequencies. Peak filters can be used for more detailed changes as well by utilizing a high Q-factor. But these kind of adjustments need to be made for each individual speaker.




Figure 3

Figure 3 shows examples of the described audio filters. The graphic depicts second order low- and high-pass filters, four different shelving filters, peak filters with low and high Q-factors, and a combination of a high-pass, a low-shelving, a peak and a high-shelving filter. This combination simulates a possible EQ-chain in a microphone processor.

These filters, in general, are probably the best known sound processing tools. Rather than review the fundamentals of filters, there are two properties of digital signal processing related to the implementation of filters that will be discussed:

1. Without explaining the reasons and effects in detail, one should know that if DSPs with fixed point arithmetic are used, the quality of filters with low cutoff frequencies can be inferior. Even if this is a problem of fixed point arithmetic in general, there are good sounding, low noise algorithms available. These problems are less prevalent in a DSP with a floating point arithmetic, but even this approach may yield poor filter performance. This means a digital microphone processor that uses floating point arithmetic does not necessarily sound better than a unit that uses fixed point arithmetic; the best way to test such units is to tune them to low cutoff frequencies.

2. The number of algorithms that can be used at the same time within a digital processor is limited by the computational power of the DSP used. This means, for example, that the number of filters that can be used at the same time is limited. Traditionally, three filters have been a reasonable number for a broadcast microphone processor. However, there are no restrictions to the number of types of filters. Since there is no drawback, a digital microphone processor allows one to use all of the above mentioned types of filters in any combination. Assuming there are three filters available the following combinations could be useful: a) Three peak filters (this combination might need careful adjustments on a per person basis); b) an adjustable high-pass filter followed by peak and/or shelving filters; c) a low shelving, a peak and a high shelving filter or any other combination.

Artificial Reverberation
Although digital signal processing makes high quality reverberation possible there are still huge differences in the quality of artificial reverberation. This depends significantly on the computational power available æ more than any other function described in this paper æ and therefore directly impacts the price of a unit. In broadcasting, where artificial reverberation is infrequently used, the highest quality products are not required. For example a detailed adjustment of reverb parameters æ such as the kind of surface, size of a room or absorption of higher frequencies æ might not be necessary.

However, for broadcasters desiring artificial reverberation, there are two significant advantages in integrating artificial reverberation into a broadcast microphone processor:

1. A specific microphone preset (e.g. the personal preset of an announcer) would contain all parameters, including the settings for reverberation. Anticipating a later discussion in this presentation, it should be mentioned that it is very important to be able to restore settings of all parameters in a quick and easy way. It seems not to be very applicable to store parameters for a separate reverb processor within a microphone processor. Integrating microphone processing into a broadcast facility includes controlling of reverb parameters and can be accomplished more easily by a built-in reverberation algorithm.

2. The combination of de-essing and reverberation increases sound quality. Some sound engineers in recording studios realized that the unpleasant sound of sibilants in recorded speech is significantly increased by artificial reverberation. They discovered that the problem could be mitigated by the use of two different de-essing units: One that controls the sibilants of the main signal and a second one that controls the sibilants of the signal used by the reverb processor. Integrating a de-esser and a reverb processor into a single microphone processor allows the use of this idea without increasing the cost of the unit. Since the detector for sibilants has to be implemented only once, a specific, advanced reduction of sibilants in the signal used by the reverb algorithm does not require much more computational power.

Phase Rotation (PR)
A function unique to microphone processing for broadcasting is phase rotation. It was invented a couple decades ago to minimize artifacts created by general sound processing in broadcast facilities, especially during clipping. The reason for these artifacts is the asymmetric nature of some human voices.



Figure 4

Figure 4 shows, in the upper left corner, the waveform of a typical asymmetric voice. Although the average over time of this signal is zero (meaning that there is no DC offset), one can see that the peak values above zero are much smaller than the peak values below zero. A clipper limits a signal to an absolute value. The dashed lines in Figure 4 indicate a possible clip threshold. Clipping would affect the two halves of the signal differently. In such a case, clipping produces a more disturbing sound than clipping of a symmetric signal.

The reason for the asymmetry of a voice signal can be found by observing the relation in time of the different formants of a specific phoneme. The two bottom plots on the left of Figure 4 show the major frequency components resulting in the asymmetric waveform. This relation in time is formed by the human vocal tract which can be modeled as an acoustical system of tubes with different lengths and sizes. The dimensions of these tubes are different for different individuals and different phonemes. This can cause an "unfavorable" phase of the frequency components resulting in an asymmetric signal.



Figure 5

Changing the phase of these signals more or less randomly (with an "all-pass" filter) is called "phase rotation" and results in reestablishing a symmetric signal. The right side of Figure 4 shows the processed signal and the changed relation in time of the two formants. Figure 5 clarifies the effect of an all-pass filter on the phase. One can see that the amplitudes of the signal are not affected. These changes of phase are usually not audible, except in the case of very transient signals. The best solution is to implement a phase rotator in a microphone processor and adjust this function for each person individually. In this way, the music programming is not affected and the phase rotator is only used when desired.

The challenge for the sound engineer is how to determine whether to use a phase rotator for a specific person and, if so, how much phase rotation is necessary. One could simply trust his ears. In that case, limiting the voice using a clipper can aid in adjustment. This might not be very accurate but in the end is the most important detector. One could add an oscilloscope to visualize the signal making asymmetric voices easier to analyze. The most accurate and elegant method would be an indicator within the microphone processor. By measuring the peak-to-average level of the positive and negative signal values and comparing these values, a simple but highly effective indicator would help the sound engineer to adjust phase rotation for a specific person.

USER INTERFACE AND INTEGRATION OF A MICROPHONE PROCESSOR IN A BROADCAST FACILITY

Requirements
There are some important requirements on how to integrate a microphone processor in a broadcast facility which affect the user interface of such a device and which are different from requirements in a recording studio. Besides the differing algorithms and functions described in the first part of this paper, the requirements of the user interface are an important reason to design specific microphone processors for broadcast facilities:

  • There are generally several on-air and production studios within a broadcast facility. Once the parameters are adjusted for a specific person, it should be possible to use these settings in every studio.

  • There is often no technician available. Selecting the correct preset must be very simple so non-technical persons can perform that task.

  • Radio stations take their sound very seriously. In most cases the talent should not have access to change parameters capriciously.

  • The unit should assist a technician in troubleshooting. Live broadcast requires reliability and, in case of technical problems, a quick way to detect and fix problems.

  • A microphone processor should be able to be integrated in an on-air scheduler. That way the selection of correct presets can be automated.

  • The microphone processor can be inserted as an effects processor into a mixing console or can be used as a microphone preamplifier as the first component within the audio chain. In the case of a digital studio, the AES/EBU outputs should be able to be synchronized.

An Elegant Solution
Based on the premise that most radio stations are already equipped with a computer network, the following system was designed:

1. The microphone processor itself has a very easy-to-use user interface. The simplest design is appropriate æ meaning that the user can only change the preset of the unit but no other parameter. He chooses from a list that is sorted by preset number, preset name or the most recently used presets, allowing a convenient and fast way to find a specific preset.

2. In the case where a fixed preset is required (e.g. guest microphone), the preset can be locked.

3. There are level meters and status LEDs to assist in case of technical problems.

4. A headphone jack allows monitoring without additional hardware (e.g. at a workstation) and assists during troubleshooting.

5. Parameters and presets can be edited using remote software running on a computer. A more sophisticated user interface on this remote application assists the sound engineer when adjusting parameters much better than a necessarily smaller display on the front panel of the unit. The remote software can use different physical connections to the microphone processor such as TCP/IP networks, RS232 ports or other serial connections.

6. In a broadcast facility with more than one microphone processor, the units are connected to the network. A preset management system integrated into the remote application allows for easy distribution of a new or changed preset to each unit. Bigger radio networks can administer microphone processors in different studios from a single place. A security system allows only certain people to change presets and protects the units against unauthorized tampering.



Figure 6

Figure 6 gives an example how the different units are connected to control parameters and presets. Whereas a computer in the production studio might primarily be used to adjust parameters for a specific person, another computer (e.g. in the office of a station engineer) could run an application for the preset management and other administrative tasks.

CONCLUSION
The algorithms and functions used in state of the art speech processing for broadcasting were summarized. Where digital signal processing can improve these functions, the necessary technical information was given. The benefits of combining all speech processing in a single unit were listed. A summary of the properties of speech signals was added where they explain the goals and reasons of a specific processing function. An overview of the requirements for the integration of a microphone processor into a broadcast facility led to a new approach for a specialized user interface.

References

[1] U. Zoelzer: Digital Audio Signal Processing, John Wiley & Sons Ltd, Chichester, 1997

[2] E. Zwicker: Psychoacoustics, Springer Verlag, Berlin, 1990

[3] R.J. Santiago: A Noise Robust Method for Detection of Endpoints of Speech Utterances, Master Thesis, Marquette University, 1997

[4] M. Wolters: The Acoustical Properties of Sibilance and New Basic Approaches for De-essing Recorded Speech, paper at the 20th Tonmeistertagung, Karlsruhe, Nov. 1998

[5] M. Sapp, M. Wolters, J. Becker-Schweitzer: Reducing Sibilants in Recorded Speech Using Psychoacoustic Models, paper at the ICA/ASA-Meeting, Seattle, 1998

[6] M. Wolters, M. Sapp, J. Becker-Schweitzer: Adaptive Algorithm for Detecting and Reducing Sibilants in Recorded Speech, 104th Convention of the AES, Amsterdam 1998, Preprint 4677

Top