Innovative Approaches to Coreference Resolution in Sindhi: Addressing Morphological Complexity and Linguistic Nuances

Download

Volume 5 Issue 2 2024

Author(s):

Saira Baby Farooqui* Shah Abdul Latif University, Khairpur, Pakistan , sairafarooqui@sau.edu.pk

Noor Ahmed Shaikh Shah Abdul Latif University, Khairpur,Pakistan, noor_salu@yahoo.com,

Samina Rajper Shah Abdul Latif University, Khairpur, Pakistan, samina.rajper@gmail.com

Abstract This study introduces a tailored coreference resolution framework for the Sindhi language, addressing challenges unique to Sindhi’s linguistic structure, such as gender agreement, postpositions, and complex pronominal forms. This research aims to bridge the natural language processing (NLP) resources gap for Sindhi, an under-resourced Indo-Aryan language. It utilizes a multi-step, process-driven framework incorporating tokenization, acronym expansion, short vowel restoration and parts of speech (POS) tagging followed by a coreference resolution mechanism adopted by the Sindhi syntax and morphology. Using a curated corpus annotated with Sindhi-specific features, the framework achieved an F1-score of 80%, outperforming baseline and general-purpose coreference models adapted from English. This excellent performance is attributed to integrating linguistic rules and socio-pragmatic factors (e.g. honorifics & gender constancy), which are crucial for accurate coreference linking in Sindhi. The novelty of this framework lies in its combination of rule-based techniques with machine learning methods, demonstrating an adaptable approach that can be extended to other low-resource languages with similar linguistic characteristics. This framework is a significant step forward in developing coreference resolution for Sindhi and improving the accuracy of NLP applications like information extraction, sentiment analysis, machine translation, etc. Future work will focus on expanding the dataset, refining Sindhi-specific embeddings, and evaluating the model in practical applications, paving the way for further developments in NLP for under-resourced languages.
Keywords Sindhi Language Processing, NLP, Coreference Resolution, Machine Learning
Year 2024
Volume 5
Issue 2
Type Research paper, manuscript, article
Recognized by Higher Education Commission of Pakistan, HEC
Category
Journal Name ILMA Journal of Technology & Software Management
Publisher Name ILMA University
Jel Classification --
DOI -
ISSN no (E, Electronic) 2790-590X
ISSN no (P, Print) 2709-2240
Country Pakistan
City Karachi
Institution Type University
Journal Type Open Access
Manuscript Processing Blind Peer Reviewed
Format PDF
Paper Link https://ijtsm.ilmauniversity.edu.pk/arc/Vol5/i2/pdf3.pdf
Page 26-34