|
Adaptive
Algorithm For Detecting And Reducing Sibilants In Recorded
Speech
Martin Wolters*, Markus Sapp,
Jörg Becker-Schweitzer Institute of Communication Engineering
(IENT), Aachen University of Technology (RWTH), Germany
(*now with Telos Systems, Cleveland, USA)
Abstract
Different concepts for reducing sibilants in recorded speech
have been developed, and have been successfully implemented as
'De-Esser' hardware. Most of them are working in the analog
domain, only a few implementations using adaptive mechanisms.
Therefore a digital algorithm is introduced which basically
uses psychoacoustic and physical units to detect sibilancy and
to adapt a time-variant bandpass filter to perform the de-esser
operation.
Overview
Two examples which demonstrate the need for an adaptive
DeEsser algorithm are the broadcast studio and the auditorium.
In both situations there may be several different announcers
but no possibility to change parameters between each case by
hand, either because there is no technician available or the
needed time is missing. An adaptive algorithm might be helpful
in more sophisticated situations as well, e.g. mastering,
where the influence on the signal should be as small as
possible and where the algorithm should be adjusted to the
current signal very precisely.
To cl assify the different types
of sibilants new investigations on these sounds in German,
English, Spanish and French have been made. Connecting with
phonetic studies, statistical evaluations of a listening test
led to a detailed description of the common and differing
properties in the time- and frequency-domain. [Wol97]
To find an adaptive algorithm which
considers these properties, a reliable method is needed for
detecting the different kinds of sibilants. In [Sap98] it is
shown that the psychoacoustic unit 'sharpness' is well
correlated with the appearance of these sounds (see figure 1)
and that the critical-band intensities - an intermediate step
in calculating sharpness - can be utilised to get information
about the spectral properties of each sibilant.
This paper focuses on an effective
real-time implementation of an adaptive DeEsser algorithm on
digital signal processors. Because of the complexity in
calculating sharpness a two-step-detector is introduced. The
first step uses zero-crossing rate and energy of the signals
to roughly detect possible sibilants. But these physical units
do not provide information about the spectral properties of
the specific sibilant and do not represent the strength of
unpleasant sibilants very well. Therefore the detected sounds
are analyzed in more detail within the second step by
calculating sharpness. To further reduce the complexity
simplifications in computing sharpness are presented.
Finally sibilancy is reduced by
filtering the disturbing part of the signal using cascaded
linear-phase highpass- and lowpass filters and subtracting the
filtered part. An equation for calculating a weighting factor
and usable filter concepts are submitted.
Calculating Sharpness
Sharpness S is a weighted first moment of the
critical-band rate distribution of specific loudness N'.
[Zwi90]
(1)
where
(2)
N is called the loudness of a
signal. Sharpness is calibrated so that a narrow-band noise
centered at 1 kHz produce a sharpness of 1 acum.
Aures modified equation (1) and described four
steps for calculating sharpness.[Aur84] First the
critical-band intensity has to be computed. This measure
represents the energy of a signal within a critical-band
(equal to 1 Bark), which is a frequency mapping related to the
human hearing-system. A transformation from the frequency
scale to the critical-band rate scale is
(3)
Aures suggested an algorithm based on
a 1024-point-FFT at a sample-rate of 30.72 kHz leading to 48
values from 0 to 24 Bark with an interval of 0.5 Bark. Taking
the transmission factor a0 (representing the
transmission between freefield and our hearing system) into
account and considering a level-depended influence of
neighboured critical-band intensities by adding slope
excitations, excitation E(z) can be calculated in a
second step. The third step calculates the specific loudness
(4)
where
threshold
factor representing just-noticeable differences
excitation corresponding to the reference intensity
excitation
at threshold in quiet
reference
specific loudness, which is used to calibrate the equation,
so that a 1-kHz tone with a level 40dB produce exactly 1sone
as total loudness.
Aures' modified equation for
calculating sharpness is
(5)where
(6)
and
(7)
Simplifications in calculating
sharpness of speech signals on a DSP
Calculating sharpness on a DSP using the algorithm suggested
by Aures is very inefficient. Besides it is impossible to
measure the absolute level of the speech signals in the
assigned equipment. Therefore the following simplifications
are used.
As suggested by Zwicker [Zwi90] the
slope excitations can be neglected in a first approximation
of the specific loudness for speech signals. This is because
of speech signals have a broad spectrum and slope
excitations lead to significantly different excitations only
for narrow spectrums.
The exponent 0.23 in equation (4) is
debatable in technical literature and sometimes higher
values are suggested. Therefore the exponent 0.25 (doubled
square root) is used because of easier computation.
The terms 1-s(z) and -1
in equation (4) are neglected because they only have an
effect in the presence of weak signals which will not be
processed by the DeEsser algorithm.
Even though Aures showed that his
equation for calculating sharpness is more accurate
especially for signals with different loudness, equation (1)
is used in the presented algorithm. Otherwise Aures' formula
for the weighting factor g(z) (6) is still applied.
Instead of a 1024-point-FFT, a
256-point-FFT is used for calculating the critical-band
intensities at 44.1kHz and 48 kHz sampling rates. At higher
bands the smaller frequency resolution does not have any
influence especially because of the broad spectrum of speech
signals. Only a few lower bands cannot be computed because
of missing spectral lines. For these bands a simple
interpolation can be used. Additionally, the functions a0
and g(z) are almost constant or linear in the lower
frequency-region. Therefore detailed information about the
distribution of energy within this region is not necessary.
Taking these simplifications into
account calculating sharpness is reduced to the following
steps:
256-point-FFT
Computation of critical-band
intensity I(z) for 48 overlapping critical bands by
summing up the spectral lines and interpolating the missing
values.
48 multiplication's and 48 doubled
square roots for calculating the specific loudness
(8)
where a0 can be calculated using
the equation (derived from [Ter79])
(9)
48 multiplication's, 96 summation's
and one division for calculating sharpness.
(10)
The mean error produced by this
algorithm can be reduced by adjusting the factor c. Comparing
the values calculated using this simplified algorithm with the
values calculated by the original Aures-algorithm leads to a
mean error of 0 % and a standard deviation of 5.8 % for 50
test sentences (in four different languages, spoken by native
speakers). This accuracy is sufficient for detecting
sibilants. A side effect of these simplifications is that the
absolute level of the speech signals has not been taken into
account in calculating sharpness.
Detecting Sibilants
Even though the simplified algorithm for calculating sharpness
can be implemented on a DSP much more efficient than Aures'
algorithm it cannot be computed for each sample on today's
standard DSP-chips. On the contrary calculation of this value
for non-overlapping blocks is possible in real-time but leads
to an inaccurate detection of the beginning of a sibilant.
Therefore two easily computable physical units can be utilized
for a rough detection of sibilance:
zero-crossing rate Z and
short-time energy E.
Rabiner summarizes the coherence of
speech signals with these units and specifies the following
equations for computing.[Rab78]
(11)
where
(12)and
(13)and
(14)where
(15)
He suggests a highpass filter starting
at 60 Hz before calculating the zero-crossing rate (to
minimize the error introduced by an offset) and a final
lowpass filter (to flatten the output). A possible sibilant is
detected if zero-crossing rate and short-time energy exceed
certain thresholds. Observing the energy prevents false
detection in case of silence whereas zero-crossing rates
higher than 10-20% indicate the presence of sibilants.
If this algorithm notices a possible
unpleasant sibilant, sharpness can be calculated leading to a
more sophisticated two-step detector. (Figure 2) This detector
does not increase the introduced time delay but decreases the
basic computational load while improving the detection of a
sibilant's beginning.
Reducing Sibilants
A basic algorithm for reducing sibilants is suggested in
[Sap98]. The disturbing part of the signal is extracted using
cascaded linear-phase highpass and lowpass filters, scaled by
a weighting factor and subtracted from the correctly delayed
input signal. The already computed critical band intensity
represents aurally adequate information about the spectral
properties of the sibilant. The frequencies nearest to the
peak where the critical-band intensity falls below a certain
percentage of the peak, can be taken as cut-off-frequencies
for the highpass and lowpass filters. Percentages between 20%
and 40% lead to good results, with smaller values increasing
the bandwidth of the reduction. (Figure 3)
The choice of cut-off-frequencies
should be restricted to the part of the critical-band
intensity greater than 15 Bark. On the one hand this prevents
instabilities in case of transients or voiced phonemes. On the
other hand statistical evaluations on sibilants in four
different languages showed that the low cut-off-frequencies
will be in the region of 15 to 22 Bark and the high
cut-off-frequencies will be in the region of 19 to 24 Bark so
that the search-domain can be limited. (Figure 4)
The constraint of linear-phase filters
reduces the number of usable filter implementations.
Linear-phase IIR filters as introduced by Azizi in [Azi97a]
and [Azi97b] as well as linear-phase FIR filters can be used,
however considering signal delay introduced by the filtering
process might be easier in case of linear-phase FIR filter
implementations. Storing coefficients in memory is efficient
because only 26 different coefficient-sets are necessary. The
filters do not have to be adjusted during a sibilant as proven
in [Sap98].
Filter parameters like order,
stop-band attenuation, number of taps and steepness depend on
the algorithm implementation and the hardware capabilities.
Increasing stop-band attenuation (i.e., decreasing the
influence on other signals than the disturbing part of the
sibilant) increases the number of taps and the filter order as
well. While flat flanks decrease the artifacts produced by
subtracting the filtered part from the original signal, steep
flanks minimize the influence to other signal components.
Bandwidth and center-frequency of the
disturbing part of the sibilant as well as the calculated
sharpness should be taken into account by computing a
weighting-factor for the subtraction. One might use the
following equation for calculating this factor:
(16)
where
AS, AM, AD
:weighting the influence of sharpness, center-frequency
and bandwidth
S :sharpness
BM :mid-frequency of
cascaded filters
BZ :'reference
mid-frequency'
BD :bandwidth of cascaded
filters
G :weighting the strength of
de-essing
A time varying weighting-factor can be
used for smoothly fading the DeEsser-algorithm in and out.
Zero-crossing rate and short-time energy can be successfully
utilized as criteria for the end of a sibilant by detecting
when one of these units falls short of the threshold.
There are a lot of possibilities to
provide experts with meaningful parameters. The thresholds of
sharpness, zero-crossing rate and energy for detecting
sibilants can be combined to create one threshold for the
DeEsser. The percentages for finding the characteristic
frequencies can be utilized to adjust the bandwidth of the
filter as already mentioned and the parameter G in
equation (16) is similar to the well known 'ratio knob'.
Conclusions
Major simplifications in calculating sharpness of speech
signals were presented. Connected with zero-crossing rate and
short-time energy this led to a two-step detector for
sibilants which can easily be implemented on today's low-cost
standard digital signal processors and which can be combined
with an adaptive algorithm for reducing sibilants. These
algorithms can be configured as a standalone unit without any
parameter as well as a more sophisticated system including the
well known parameters threshold, bandwidth and ratio. Several
specialists like recording engineers and developers of
professional audio equipment performed tests with these
algorithms and suggested refinements of parameters.
REFERENCES
|
[Aur84]
|
Wilhelm Aures: Berechnungsverfahren
für den Wohlklang beliebiger Schallsignale, ein
Beitrag zur Gehörbezogenen Schallanalyse,
Dissertation am Lehrstuhl für Elektroakustik der
Technischen Universität München, 1984
|
|
[Azi97a]
|
S.A. Azizi: Realization of
Linear Phase Sound Processing Filters Using Zero Phase
IIR Filters, Proc. 102nd AES
Convention, Munich, Preprint 4506, March 1997
|
|
[Azi97b]
|
S.A. Azizi: Performance
Analysis of Linear Phase Audio Filters based on the
Zero Phase Filtering Concept, Proc. 103rd
AES Convention, New York, Preprint 4535, September
1997
|
|
[Rab78]
|
Lawrence R. Rabiner, Ronald W.
Schafer: Digital Processing of speech signals,
Prentice-Hall, New Jersey, 1978
|
|
[Sap98]
|
Markus Sapp, Martin Wolters, Jörg
Becker-Schweitzer: Reducing sibilants in recorded
speech using psychoacoustic models, Paper accepted
for presentation on ICA/ASA-Meeting, Seattle, 1998
|
|
[Ter79]
|
E. Terhardt: Calculating
virtual pitch, Hearing Research, Vol. 1, 155-182,
1979
|
|
[Wol97]
|
Martin Wolters: Entwicklung
von Algorithmen zur Detektion und Unterdrückung von
Zischlauten, diploma thesis at the RWTH Aachen ,
1997
|
|
[Zwi90]
|
E. Zwicker: Psychoacoustics,
Springer-Verlag, Berlin, 1990
|
Figure 1. Correlation between
sibilance and sharpness. 50 test sentences in German, English,
Spanish and French were spoken by native speakers. The
sibilants were marked as disturbing by at least three of four
experts in a listening test. [Sap98]
Figures
Figure 2. Two-step detector and
algorithm for reducing sibilants.
Figure 3. Typical example for
using critical-band intensity to determine cut-off
frequencies. Percentages between 20% and 40% lead to good
results, with smaller values increasing the bandwidth of the
reduction.
Figure 4. Detected cut-off
frequencies for 141 sibilants in four different languages,
spoken by male and female native speakers.
Top
|