Python辅导 | CMP-6026A Audio-visual processing

School of Computing Sciences
MODULE: CMP-6026A Audio-visual processing
DATE SET: Monday Week 6
RETURN DATE: Friday Week 12
SET BY: Prof Richard Harvey
CHECKED BY: Dr Ben Milner
This assignment is to design, implement and evaluate speaker-dependent visual-only and audio-visual
speech recognition systems. This coursework builds on the first coursework and should use the same
vocabulary of 20 names.
This assignment has three parts:
1. Visual speech recogniser – you will construct a lip-reading system (i.e. a visual speech recogniser)
using the vocabulary of names you developed in the first coursework of this module1. Since you
have already developed the architecture of a speech recogniser based around HTK, this new task
should amount to extracting features from a video and using those features instead of the audio
2. Audio-visual speech recogniser – you will now construct an audio-visual speech recogniser. You
will need a set of audio features and a set of video features. The video features will be at a lower
sampling rate than the audio so you will need to upsample the video features. There are then two
ways to combine the features: early integration or late integration. In early integration the audio and
video features are joined together (to form a longer vector that is sometimes called a supervector).
These supervectors are then used to train and test a new HMM-based classifier. In late integration
you build two separate classifiers for the audio and video streams and you build some logic that is
able to pick the probable name given the list of possible outputs from each classifier.
3. Presentation – you will prepare a 15 minute presentation for Wednesday Week 12. Your
presentation must provide a brief introduction (bear in mind that other speakers may say similar
things), a description of the data collection, the design and justification of the visual feature
1 The 20 names are: Alex, Alice, Brett, Ebtesam, George, Hao, Jack, James, Julien, Kevin, Luzhang,
Matt, Matthew, Max, Mazvydas, Peter, Rory, Ruari, Sevkan, Thomas
extraction, the design of the visual-only and audio-visual speech recognisers making sure you
highlight methods of feature fusion, your evaluation and your conclusions. The presentation must
be delivered jointly.
To achieve develop these systems you will need to consider the following aspects:
Data capture
If you recorded simultaneous audio-visual sequences for the first coursework then you can simply reuse
the acoustic features and labels from those exercises. Otherwise, you will need to re-record and label
instances of the student names so that you have visual features and acoustic features for audio-visual
recognition – refer to the exercise sheet for coursework one for the required names.
Feature extraction
Working in pairs (use the same partner from the previous coursework) you should first consider what
visual features you will use. Previously you were required to use MFCCs as these are the standard
features used for acoustic speech recognition. However, for visual speech (lip-reading) there is no real
agreement as to what form the visual speech features should take. These could be image-based features
(DCT is common, but equally PCA-based features could be used). Image-based features provide an
implicit visual speech feature by coding the image containing the mouth, they do not code the mouth
directly. Alternatively, the features might use higher-level knowledge and provide an explicit visual
speech feature by measuring properties of the position/shape of the speech articulators directly (e.g.
mouth width and height), or they might include both shape and appearance information.
Typically, both visual-only and audio-visual speech recognisers perform better if both shape and
appearance information are included. However, there is increased complexity in extracting shape as
specific feature points must be identified in each and every video frame. Furthermore, for lip-reading
to be practical there is a strong need to extract simple features. Marks will be awarded based on the
effectiveness of the feature extraction used and the completeness of the testing (e.g. comparing different
types of feature).
Together with your partner you will need to decide on the number of visual feature dimensions that you
will use for training and testing your recognisers. You might consider which of the features are
perceptually significant (e.g. by visualising the features), or you might determine the optimal number
of features empirically using some form of objective measure of recogniser performance.
Visual-only speech recognition
After extracting your visual features in MATLAB and writing to file in the appropriate HTK format,
you should measure the baseline performance of a speech recogniser using these visual features. That
is you should build a visual-only speech recogniser trained and tested using only the visual features.
You may reuse many of the scripts developed for coursework one. But be aware that important files
(such as the HMM prototype) will need to be updated to account for the differences in the feature
You do not need to consider the performance of this recogniser as a function of noise since the visual
features from your video are not affected by acoustic noise.
Audio-visual speech recognition
Next you should integrate the acoustic and the visual information to build audio-visual speech
recognisers. To do this you will need to write out an additional set of files that contain both audio and
visual features. The data-rate for these two modalities is likely to be different, so to write the features
to the same data file for HTK, one or other of the features will need to be re-sampled. It is customary
to up-sample the video data to the acoustic data-rate rather than down-sample the acoustic data to match
the visual. This ensures that none of the original information is lost. You then need to decide if you will
consider early or late integration of the acoustic and the visual information and implement the
recognisers accordingly. You might also consider comparing the performance of your features for both
early and late integration to determine if one approach might be better than the other. You should
consider the performance of your audio-visual recogniser as a function of the acoustic noise (use the
same noise files as used in the previous coursework). You can then report the effective gain (or
otherwise) that arises from incorporating the visual information into your acoustic-only recogniser.
This assignment has no written work – the sole form of our assessment of you will be via the
presentation. Note that an important part of the marking structure is our assessment of your credibility
and professionalism and it is worth discussing with us how this is best established.
Delivery of the assessment will be through a 15 minute oral presentation followed by 5 minutes for
questions. The presentation sessions will take the form of a mini-conference in Week 12. We will advise
on location and timings nearer to the date.
You will need to use audio/visual recording equipment/software, MATLAB, HTK, SFS, as used in the
lab classes. The use of MATLAB is not mandatory – it’s just a good idea.
Marking scheme
Marks will be allocated as follows:
• Background and introduction (10%)
• Visual feature extraction (35%)
• Evaluation of classifiers (35%)
• Quality of visual materials and presentation skills (10%)
• Question answering (10%)