Learning the Style via Mixed SN-Grams: An Evaluation in Authorship Attribution
This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic <i>n</i>-grams (mixed sn-grams). The method comprises an algo...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | AI |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2673-2688/6/5/104 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This study addresses the problem of authorship attribution with a novel method for modeling writing style using dependency tree subtree parsing. This method exploits the syntactic information of sentences using mixed syntactic <i>n</i>-grams (mixed sn-grams). The method comprises an algorithm to generate mixed sn-grams by integrating words, POS tags, and dependency relation tags. The mixed sn-grams are used as style markers to feed Machine Learning methods such as a SVM. A comparative analysis was performed to evaluate the performance of the proposed mixed sn-grams method against homogeneous sn-grams with the PAN-CLEF 2012 and CCAT50 datasets. Experiments with PAN 2012 showed the potential of mixed sn-grams to model a writing style by outperforming homogeneous sn-grams. On the other hand, experiments with CCAT50 showed that training with mixed sn-grams improves accuracy over homogeneous sn-grams, with the POS-Word category showing the best result. The study’s results suggest that mixed sn-grams constitute effective stylistic markers for building a reliable writing style model, which machine learning algorithms can learn. |
|---|---|
| ISSN: | 2673-2688 |