Text this: Construction of a multi-modal digital human education platform based on GAN and vision transformer