Text this: Vision Foundation Model Guided Multimodal Fusion Network for Remote Sensing Semantic Segmentation