Innovative Approaches to Coreference Resolution in Sindhi: Addressing Morphological Complexity and Linguistic Nuances
Download Volume 5 Issue 2 2024 | |
---|---|
Author(s): |
Saira Baby Farooqui* Shah Abdul Latif University, Khairpur, Pakistan , sairafarooqui@sau.edu.pk Noor Ahmed Shaikh Shah Abdul Latif University, Khairpur,Pakistan, noor_salu@yahoo.com, Samina Rajper Shah Abdul Latif University, Khairpur, Pakistan, samina.rajper@gmail.com |
Abstract | This study introduces a tailored coreference resolution framework for the Sindhi language, addressing challenges unique to Sindhi’s linguistic structure, such as gender agreement, postpositions, and complex pronominal forms. This research aims to bridge the natural language processing (NLP) resources gap for Sindhi, an under-resourced Indo-Aryan language. It utilizes a multi-step, process-driven framework incorporating tokenization, acronym expansion, short vowel restoration and parts of speech (POS) tagging followed by a coreference resolution mechanism adopted by the Sindhi syntax and morphology. Using a curated corpus annotated with Sindhi-specific features, the framework achieved an F1-score of 80%, outperforming baseline and general-purpose coreference models adapted from English. This excellent performance is attributed to integrating linguistic rules and socio-pragmatic factors (e.g. honorifics & gender constancy), which are crucial for accurate coreference linking in Sindhi. The novelty of this framework lies in its combination of rule-based techniques with machine learning methods, demonstrating an adaptable approach that can be extended to other low-resource languages with similar linguistic characteristics. This framework is a significant step forward in developing coreference resolution for Sindhi and improving the accuracy of NLP applications like information extraction, sentiment analysis, machine translation, etc. Future work will focus on expanding the dataset, refining Sindhi-specific embeddings, and evaluating the model in practical applications, paving the way for further developments in NLP for under-resourced languages. |
Keywords | Sindhi Language Processing, NLP, Coreference Resolution, Machine Learning |
Year | 2024 |
Volume | 5 |
Issue | 2 |
Type | Research paper, manuscript, article |
Recognized by | Higher Education Commission of Pakistan, HEC | Category | Journal Name | ILMA Journal of Technology & Software Management | Publisher Name | ILMA University | Jel Classification | -- | DOI | - | ISSN no (E, Electronic) | 2790-590X | ISSN no (P, Print) | 2709-2240 | Country | Pakistan | City | Karachi | Institution Type | University | Journal Type | Open Access | Manuscript Processing | Blind Peer Reviewed | Format | Paper Link | https://ijtsm.ilmauniversity.edu.pk/arc/Vol5/i2/pdf3.pdf | Page | 26-34 |