Text this: DI-VTR: Dual inter-modal interaction model for video-text retrieval