Text this: Open-Vocabulary Action Localization With Iterative Visual Prompting