Text this: Leveraging Multi-Modality and Enhanced Temporal Networks for Robust Violence Detection