Text this: Multimodal anomaly detection in complex environments using video and audio fusion