Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding

Pre-trained language models (LMs) have been shown to achieve outstanding performance in various natural language processing tasks; however, these models have a significantly large number of parameters to handle large-scale text corpora during the pre-training process, and thus, they entail the risk...

Full description

Saved in:
Bibliographic Details
Main Authors: Myeongsup Kim, Pilsung Kang
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9680703/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850144759794892800
author Myeongsup Kim
Pilsung Kang
author_facet Myeongsup Kim
Pilsung Kang
author_sort Myeongsup Kim
collection DOAJ
description Pre-trained language models (LMs) have been shown to achieve outstanding performance in various natural language processing tasks; however, these models have a significantly large number of parameters to handle large-scale text corpora during the pre-training process, and thus, they entail the risk of overfitting when fine-tuning for small task-oriented datasets is conducted. In this paper, we propose a text embedding augmentation method to prevent such overfitting. The proposed method applies augmentation to a text embedding by generating an adversarial embedding, which is not identical to original input embedding but maintaining the characteristics of the original input embedding, using PGD-based adversarial training for input text data. A pseudo-label that is identical to the label of the input text is then assigned to adversarial embedding to conduct retraining by using adversarial embedding and pseudo-label as input embedding and label pair for a separate LM. Experimental results on several text classification benchmark datasets demonstrated that the proposed method effectively prevented overfitting, which commonly occurs when adjusting a large-scale pre-trained LM to a specific task.
format Article
id doaj-art-17f5658c59bf40d7866dfff5dcb81b25
institution OA Journals
issn 2169-3536
language English
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-17f5658c59bf40d7866dfff5dcb81b252025-08-20T02:28:15ZengIEEEIEEE Access2169-35362022-01-01108363837610.1109/ACCESS.2022.31428439680703Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial EmbeddingMyeongsup Kim0https://orcid.org/0000-0002-2495-9094Pilsung Kang1https://orcid.org/0000-0001-7663-3937School of Industrial & Management Engineering, Korea University, Seoul, Republic of KoreaSchool of Industrial & Management Engineering, Korea University, Seoul, Republic of KoreaPre-trained language models (LMs) have been shown to achieve outstanding performance in various natural language processing tasks; however, these models have a significantly large number of parameters to handle large-scale text corpora during the pre-training process, and thus, they entail the risk of overfitting when fine-tuning for small task-oriented datasets is conducted. In this paper, we propose a text embedding augmentation method to prevent such overfitting. The proposed method applies augmentation to a text embedding by generating an adversarial embedding, which is not identical to original input embedding but maintaining the characteristics of the original input embedding, using PGD-based adversarial training for input text data. A pseudo-label that is identical to the label of the input text is then assigned to adversarial embedding to conduct retraining by using adversarial embedding and pseudo-label as input embedding and label pair for a separate LM. Experimental results on several text classification benchmark datasets demonstrated that the proposed method effectively prevented overfitting, which commonly occurs when adjusting a large-scale pre-trained LM to a specific task.https://ieeexplore.ieee.org/document/9680703/Text embedding augmentationadversarial trainingpseudo-labelgeneratingretraining
spellingShingle Myeongsup Kim
Pilsung Kang
Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
IEEE Access
Text embedding augmentation
adversarial training
pseudo-label
generating
retraining
title Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
title_full Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
title_fullStr Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
title_full_unstemmed Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
title_short Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding
title_sort text embedding augmentation based on retraining with pseudo labeled adversarial embedding
topic Text embedding augmentation
adversarial training
pseudo-label
generating
retraining
url https://ieeexplore.ieee.org/document/9680703/
work_keys_str_mv AT myeongsupkim textembeddingaugmentationbasedonretrainingwithpseudolabeledadversarialembedding
AT pilsungkang textembeddingaugmentationbasedonretrainingwithpseudolabeledadversarialembedding