Text this: Visual Automatic Localization Method Based on Multi-level Video Transformer