iSentenizer-μ: Multilingual Sentence Boundary Detection Model
Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genre...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Wiley
2014-01-01
|
| Series: | The Scientific World Journal |
| Online Access: | http://dx.doi.org/10.1155/2014/196574 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849690128441671680 |
|---|---|
| author | Derek F. Wong Lidia S. Chao Xiaodong Zeng |
| author_facet | Derek F. Wong Lidia S. Chao Xiaodong Zeng |
| author_sort | Derek F. Wong |
| collection | DOAJ |
| description | Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i+Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets. |
| format | Article |
| id | doaj-art-ec30fc29082c4e629ac181291534d00b |
| institution | DOAJ |
| issn | 2356-6140 1537-744X |
| language | English |
| publishDate | 2014-01-01 |
| publisher | Wiley |
| record_format | Article |
| series | The Scientific World Journal |
| spelling | doaj-art-ec30fc29082c4e629ac181291534d00b2025-08-20T03:21:24ZengWileyThe Scientific World Journal2356-61401537-744X2014-01-01201410.1155/2014/196574196574iSentenizer-μ: Multilingual Sentence Boundary Detection ModelDerek F. Wong0Lidia S. Chao1Xiaodong Zeng2NLP2CT Laboratory, Department of Computer and Information Science, University of Macau, MacauNLP2CT Laboratory, Department of Computer and Information Science, University of Macau, MacauNLP2CT Laboratory, Department of Computer and Information Science, University of Macau, MacauSentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i+Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.http://dx.doi.org/10.1155/2014/196574 |
| spellingShingle | Derek F. Wong Lidia S. Chao Xiaodong Zeng iSentenizer-μ: Multilingual Sentence Boundary Detection Model The Scientific World Journal |
| title | iSentenizer-μ: Multilingual Sentence Boundary Detection Model |
| title_full | iSentenizer-μ: Multilingual Sentence Boundary Detection Model |
| title_fullStr | iSentenizer-μ: Multilingual Sentence Boundary Detection Model |
| title_full_unstemmed | iSentenizer-μ: Multilingual Sentence Boundary Detection Model |
| title_short | iSentenizer-μ: Multilingual Sentence Boundary Detection Model |
| title_sort | isentenizer μ multilingual sentence boundary detection model |
| url | http://dx.doi.org/10.1155/2014/196574 |
| work_keys_str_mv | AT derekfwong isentenizermmultilingualsentenceboundarydetectionmodel AT lidiaschao isentenizermmultilingualsentenceboundarydetectionmodel AT xiaodongzeng isentenizermmultilingualsentenceboundarydetectionmodel |