Text this: A Practical Multimodal Fusion System With Uncertainty Modeling for Robust Visual and Affective Applications