Abstract
The ability to relate information about languages heard through visual and audio data is a crucial
aspect of audio-visual speech recognition (AVSR), which has uses in data manipulation for audio-visual
correspondence, including AVE-Net and SyncNet. The technique described in this research uses feature
disentanglement to simultaneously handle the tasks listed above. By developing cross-modal standard
learning methods, this model can transform visual or aural linguistic characteristics into modality
independent representations. AVE-Net and SyncNet can all be performed with the help of such derived
linguistic expressions. Furthermore, audio and visual data output can be modified based on the required
subject identity and linguistic content information. We do comprehensive trials on various recognition
and synthesis tasks on both tasks separately, and that solution can successfully take on both audio-visual
learning problems. The system gives great results in the enhanced video with 91.5% with 5 frames, while
this will increase with the increase of frames with 99.03% with 15 frames, which is more efficient than
the previous methods.
aspect of audio-visual speech recognition (AVSR), which has uses in data manipulation for audio-visual
correspondence, including AVE-Net and SyncNet. The technique described in this research uses feature
disentanglement to simultaneously handle the tasks listed above. By developing cross-modal standard
learning methods, this model can transform visual or aural linguistic characteristics into modality
independent representations. AVE-Net and SyncNet can all be performed with the help of such derived
linguistic expressions. Furthermore, audio and visual data output can be modified based on the required
subject identity and linguistic content information. We do comprehensive trials on various recognition
and synthesis tasks on both tasks separately, and that solution can successfully take on both audio-visual
learning problems. The system gives great results in the enhanced video with 91.5% with 5 frames, while
this will increase with the increase of frames with 99.03% with 15 frames, which is more efficient than
the previous methods.
Keywords
AVE-Net
AVSR
CNNs
Deep learning
SyncNet
Keywords
AVE-Net
AVSR
CNNs
SyncNet
التعلم العميق