BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews

Published in Findings of the Association for Computational Linguistics: ACL 2023, 2023

Citation (IEEE format): M. Kabir, O. B. Mahfuz, S. R. Raiyan, H. Mahmud, and M. K. Hasan, “BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews,” arXiv preprint arXiv:2305.06595, 2023.

arXiv PDF Code/Data Slides Video

@misc{kabir2023banglabook,
    title={BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews},
    author={Mohsinul Kabir and Obayed Bin Mahfuz and Syed Rifat Raiyan and Hasan Mahmud and Md Kamrul Hasan},
    year={2023},
    eprint={2305.06595},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Authors: Mohsinul Kabir†, Obayed Bin Mahfuz†, Syed Rifat Raiyan†, Hasan Mahmud, and Md Kamrul Hasan.
Abstract: The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pretrained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla.