A Grammatically Annotated Corpus for Sana’ani Arabic Dialect

Authors

  • Sabah Al-Shehabi
  • Mohammed Sharaf-Addin

Abstract

In this paper, we introduce a new resource for Sana’ani Arabic dialect. This grammatically tagged corpus is basically a collection of social media texts that is primarily developed as a training data for developing Sana’ani Arabic Part Of Speech (POS) tagger. The corpus consists of 7,295 tokenized sentences with an average of 15 tokens in each sentence and with a total number of 112,517 tokens and 15,940 types. The corpus is manually annotated using a modified tagset from The Biestagset which covers 24 tags. The manual annotation performed is rather a grammatical annotation ignoring morphological inflections and concentrating on the syntactic features using the context to identify the part of speech of each token.

Downloads

Published

2020-03-27

Issue

Section

Articles