Text this: Thangka image captioning model with Salient Attention and Local Interaction Aggregator