Text this: Video Temporal Grounding with Multi-Model Collaborative Learning