video to audio models...?

Or what about video to closed captions...?