Towards the use of sub-band processing in automatic speaker recognition

  • Robert Finan

    Student thesis: Doctoral Thesis


    Automatic speaker recognition uses a person’s voice as a means of identifying them. It has many practical applications, particularly in the area of security systems. The thesis investigates the application of neural networks (both classifying and predicting) to the problem, as well as offering a new alternative to the current wide-band approach to speaker recognition.

    As part of the work, a new method of ranking a speaker’s impostors is presented. It involves creating a vector quantisation (VQ) model for each potential impostor and testing this impostor model against the genuine speaker’s VQ model. The ranking of these model vs. model scores is indicative of the final ranking of the impostors, as determined by the test results. Previous methods of ranking impostors required testing the genuine speaker model with every impostor’s training utterances, which requires considerably more computation than this new method which only needs one test per impostor.

    Impostor ranking has potential applications in score normalisation for models which don’t use discriminative training. Such models include predictive neural networks (PNN) and vector quantisation, which both produce a score which is the distance of the test utterance from the model. These scores may be normalised by comparing them to a distance from an anti- speaker model. Using the model vs. model impostor ranking, a cohort of impostors may be selected for each speaker to represent the anti-speaker model. A normalisation method, based on cohorts of these ranked impostors, markedly improved the verification error rates of distance models for both text-dependent and text-independent conditions.

    Further reductions in error rates may be achieved through the use of information complementary to the linear prediction cepstrum coefficients (LPCC). When the scores from a recogniser based on the LP residual are used in conjunction with those from an LPCC recogniser they lead to a drop in the identification error rate.

    This concept of combining results from recognisers which focus on complementary areas of the speech signal is further developed in an approach known as sub-band processing. The sub-band processing implemented for this work is new to the field of automatic speaker recognition. It focuses on different regions of the speech signal by splitting it into 16 sub-bands based on the mel-scale, each with its own dedicated recogniser. The individual sub-bands emphasise the spectral properties of the band-limited frequency ranges, and this is reflected in the make-up of the cepstral coefficients for each sub-band. This provides a more detailed model of the speaker than the wide-band approach, which uses a single model to cover all frequencies. The final score for a test utterance is determined by combining the scores from the different sub-bands. The sub-band processing approach made significant improvements on the error rates of the wide-band system.
    Date of AwardJul 1998
    Original languageEnglish

    Cite this