Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words

We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset o...

Full description

Saved in:
Bibliographic Details
Main Authors: Gokhan Ercan, Olcay Taner Yildiz
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10947740/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850179614590107648
author Gokhan Ercan
Olcay Taner Yildiz
author_facet Gokhan Ercan
Olcay Taner Yildiz
author_sort Gokhan Ercan
collection DOAJ
description We show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., <underline>g</underline>ramm<underline>a</underline>r &#x2013; <underline>c</underline>ramm<underline>e</underline>r). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks&#x2014;unrelatedness identification and relatedness classification&#x2014;to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., &#x2018;-co_here+ance+y&#x2019; for &#x2018;coherency&#x2019;), with 405 affixes in Turkish and 467 in English.
format Article
id doaj-art-48817d447b0148adb5ff9f13fb6d0db1
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-48817d447b0148adb5ff9f13fb6d0db12025-08-20T02:18:27ZengIEEEIEEE Access2169-35362025-01-0113644126445810.1109/ACCESS.2025.355708610947740Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated WordsGokhan Ercan0https://orcid.org/0000-0002-2782-8217Olcay Taner Yildiz1https://orcid.org/0000-0001-5838-4615Department of Computer Engineering, I&#x015F;&#x0131;k University, &#x0130;stanbul, T&#x00FC;rkiyeDepartment of Computer Engineering, &#x00D6;zye&#x011F;in University, &#x0130;stanbul, T&#x00FC;rkiyeWe show that n-gram-based distributional models fail to distinguish unrelated words due to the noise in semantic spaces. This issue remains hidden in conventional benchmarks but becomes more pronounced when orthographic similarity is high. To highlight this problem, we introduce OSimUnr, a dataset of nearly one million English and Turkish word-pairs that are orthographically similar but semantically unrelated (e.g., <underline>g</underline>ramm<underline>a</underline>r &#x2013; <underline>c</underline>ramm<underline>e</underline>r). These pairs are generated through a graph-based WordNet approach and morphological resources. We define two evaluation tasks&#x2014;unrelatedness identification and relatedness classification&#x2014;to test semantic models. Our experiments reveal that FastText, with default n-gram segmentation, performs poorly (below 5% accuracy) in identifying unrelated words. However, morphological segmentation overcomes this issue, boosting accuracy to 68% (English) and 71% (Turkish) without compromising performance on standard benchmarks (RareWords, MTurk771, MEN, AnlamVer). Furthermore, our results suggest that even state-of-the-art LLMs, including Llama 3.3 and GPT-4o-mini, may exhibit noise in their semantic spaces, particularly in highly synthetic languages such as Turkish. To ensure dataset quality, we leverage WordNet, MorphoLex, and NLTK, covering fully derivational morphology supporting atomic roots (e.g., &#x2018;-co_here+ance+y&#x2019; for &#x2018;coherency&#x2019;), with 405 affixes in Turkish and 467 in English.https://ieeexplore.ieee.org/document/10947740/Derivational morphologydistributional semantic modelinglanguage resourcemorphological segmentationorthographic similarityword-relatedness
spellingShingle Gokhan Ercan
Olcay Taner Yildiz
Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
IEEE Access
Derivational morphology
distributional semantic modeling
language resource
morphological segmentation
orthographic similarity
word-relatedness
title Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
title_full Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
title_fullStr Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
title_full_unstemmed Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
title_short Grammar or Crammer? The Role of Morphology in Distinguishing Orthographically Similar but Semantically Unrelated Words
title_sort grammar or crammer the role of morphology in distinguishing orthographically similar but semantically unrelated words
topic Derivational morphology
distributional semantic modeling
language resource
morphological segmentation
orthographic similarity
word-relatedness
url https://ieeexplore.ieee.org/document/10947740/
work_keys_str_mv AT gokhanercan grammarorcrammertheroleofmorphologyindistinguishingorthographicallysimilarbutsemanticallyunrelatedwords
AT olcaytaneryildiz grammarorcrammertheroleofmorphologyindistinguishingorthographicallysimilarbutsemanticallyunrelatedwords