Text this: Efficient spatio-temporal modeling for sign language recognition using CNN and RNN architectures