Text this: Action Recognition by Joint Spatial-Temporal Motion Feature