Text this: MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention