Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset o...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10947740/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850179614590107648 |
|---|---|
| author | Gokhan Ercan Olcay Taner Yildiz |
| author_facet | Gokhan Ercan Olcay Taner Yildiz |
| author_sort | Gokhan Ercan |
| collection | DOAJ |
| description | We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., <underline>g</underline>ramm<underline>a</underline>r – <underline>c</underline>ramm<underline>e</underline>r). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks—unrelatedness identification and relatedness classification—to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., ‘-co_here+ance+y’ for ‘coherency’), with 405 affixes in Turkish and 467 in English. |
| format | Article |
| id | doaj-art-48817d447b0148adb5ff9f13fb6d0db1 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-48817d447b0148adb5ff9f13fb6d0db12025-08-20T02:18:27ZengIEEEIEEE Access2169-35362025-01-0113644126445810.1109/ACCESS.2025.355708610947740Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated WordsGokhan Ercan0https://orcid.org/0000-0002-2782-8217Olcay Taner Yildiz1https://orcid.org/0000-0001-5838-4615Department of Computer Engineering, Işık University, İstanbul, TürkiyeDepartment of Computer Engineering, Özyeğin University, İstanbul, TürkiyeWe show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., <underline>g</underline>ramm<underline>a</underline>r – <underline>c</underline>ramm<underline>e</underline>r). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks—unrelatedness identification and relatedness classification—to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., ‘-co_here+ance+y’ for ‘coherency’), with 405 affixes in Turkish and 467 in English.https://ieeexplore.ieee.org/document/10947740/Derivational morphologydistributional semantic modelinglanguage resourcemorphological segmentationorthographic similarityword-relatedness |
| spellingShingle | Gokhan Ercan Olcay Taner Yildiz Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words IEEE Access Derivational morphology distributional semantic modeling language resource morphological segmentation orthographic similarity word-relatedness |
| title | Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words |
| title_full | Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words |
| title_fullStr | Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words |
| title_full_unstemmed | Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words |
| title_short | Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words |
| title_sort | grammar or crammer the role of morphology in distinguishing orthographically similar but semantically unrelated words |
| topic | Derivational morphology distributional semantic modeling language resource morphological segmentation orthographic similarity word-relatedness |
| url | https://ieeexplore.ieee.org/document/10947740/ |
| work_keys_str_mv | AT gokhanercan grammarorcrammertheroleofmorphologyindistinguishingorthographicallysimilarbutsemanticallyunrelatedwords AT olcaytaneryildiz grammarorcrammertheroleofmorphologyindistinguishingorthographicallysimilarbutsemanticallyunrelatedwords |