Text this: Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding