A Grammatically Annotated Corpus for Sana’ani Arabic Dialect
Abstract
In this paper, we introduce a new resource for Sana’ani Arabic dialect. This grammatically tagged corpus is basically a collection of social media texts that is primarily developed as a training data for developing Sana’ani Arabic Part Of Speech (POS) tagger. The corpus consists of 7,295 tokenized sentences with an average of 15 tokens in each sentence and with a total number of 112,517 tokens and 15,940 types. The corpus is manually annotated using a modified tagset from The Biestagset which covers 24 tags. The manual annotation performed is rather a grammatical annotation ignoring morphological inflections and concentrating on the syntactic features using the context to identify the part of speech of each token.