Text this: Audio-visual event localization with dual temporal-aware scene understanding and image-text knowledge bridging