Ethical Consideration in Short Message Service Dataset Corpus Creation

Author(s)

Izang Aaron , Erihri .O.Jonathan , Ajayi, S. Wumi ,

Download Full PDF Pages: 24-30 | Views: 552 | Downloads: 136 | DOI: 10.5281/zenodo.4305665

Volume 9 - November 2020 (11)

Abstract

Ever since the first Short Message Service (SMS) service was introduced in 1993, its popularity has continued to soar over the years such that SMS communication now constitutes a major segment in the spectrum of telecommunication. The popularity and extensive usage has attracted the interest of many researchers to the inherent potential in harvesting data and metadata from collection of SMS corpus for the performance of linguistic, diachronic, normalization and sociolinguistic studies and also in the validation and comparison of different classifiers in SMS spam filters. However, freely available dataset where this type of information can be found for research purposes are quite difficult to obtain. This is mostly due to the confidentiality of SMS where users want to reveal as little of the contents of their phones as possible. This work examines the techniques adopted in the creation of SMS corpus and the ethical consideration involved in the protection of users’ interest and privacy. A critical review of existing work in the field was done to ascertain ethical observations adopted and it was discovered that in other to achieve successful SMS corpus creation, the main consideration is the requirement to protect the rights and interests of the message donors and any other person mentioned in the text messages, without altering the original text in order to gather sufficient metadata information. Participant consent, data anonymization, and ensuring participants’ safe information storage are basic ethical consideration adopted to ensure a successful SMS corpus creation in this work.

Keywords

Corpus, Metadata, Linguistic, Normalization, Sociolinguistic, SMS, Spam filter

References

                i.            Akanji, O. J., & Adeleke, B. S. (2018). Subscribers attitude toward unsolicited text messages (UTM) among Nigerian telecommunication firms. European Journal of Management and Marketing Studies.

      ii.            Almeida, T., Hidalgo, J. M. G., & Silva, T. P. (2013). Towards SMS spam filtering: Results under a new dataset. International Journal of Information Security Science, 2(1), 1-18.

    iii.            Australian Council for International Development (ACFID)(2016).Principles and Guidelines for ethical research and evaluation in development.14 Napier Close, Deakin ACT 2600 Private Bag 3, Deakin ACT 2600, Australia.

     iv.            BAAL (2006). Recommendations on Good Practice in Applied Linguistics‘. Retrieved from: http://www.baal.org.uk/goodprac.htm.

       v.            Burk. K (2017) How Many Texts Do People Send Every Day (2018) Text request. Retrieved from: https://www.textrequest.com/blog/how-many-texts-people-send-per-day/

     vi.            Chen, T., & Kan, M. Y. (2013). Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation, 47(2), 299-335.

   vii.            Cloudmark whitepaper (2013). SMS Spam Overview. Preserving the value of SMS texting. Retrieved from:https://www.cloudmark.com/en/s/resources/whitepapers/sms-spam-overview

 viii.            Durscheid, C and E. Stark (2011). SMS4science: An International Corpus-Based Texting Project and the Specific Challenges for Multilingual Switzerland, Chapter 5. Oxford University Press.

     ix.            Elizondo, J. (2011). Not 2 Cryptic 2 DCode: Paralinguistic Restitution, Deletion, and Nonstandard Orthography in Text Messages. Ph. D. thesis, Swarthmore College.

       x.            Fairon, C. and Paumier, S. (2006). A translated corpus of 30,000 French SMS. In Proceedings of Language Resources and Evaluation Conference.2006, Genova.

     xi.            GOV.UK. Data Protection Act. Retrieved from: https://www.gov.uk/data-protection/the-data-protection-act.

   xii.            How, Y. and M. Kan (2005). Optimizing predictive text entry for short message service on
mobile phones. In Proceedings of Human-Computer Interaction Institute (HCII). Lawrence Erlbaum Associates.

 xiii.            Oates. B.J (2009).Researching Information Systems and Computing. SAGE Publications Ltd, London.

 xiv.            Sanders, E., (2012). Collecting and Analyzing Chats and Tweets in SoNaR. In Proceedings of Language Resources and Evaluation Conference 2012, Istanbul, Turkey.

   xv.            Song.Z, Strassel. S, Lee. H, Walker. K, Wright.J, Garland.J,  Fore.D,  Gainor.B, Cabe.P, Thomas.T,  Callahan.B, Sawyer.A(2012).Collecting Natural SMS and Chat Conversations in Multiple Languages:The BOLT Phase 2 Corpus. Linguistic Data Consortium, University of Pennsylvania.

 xvi.            Sotillo, S. (2010). SMS Texting Practices and Communicative Intention. Hershey: IGI Global, Chapter 16, pp.252–265.

xvii.            Tagg, C., (2009). A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham, united Kingdom.

xviii.            Treurniet, M., De Clercq, O., Oostdijk, N., Heuvel, H. vanden, (2012) Collecting a Corpus of Dutch SMS. In Proceedings of LREC 2012, Istanbul, Turkey,

 xix.            Verheijen L., Stoop W. (2016) Collecting Facebook Posts and WhatsApp Chats. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science, vol 9924. Springer, ChamRock, F. (2001) ‗Policy and practice in the anonymisation of linguistic data‘International Journal of Corpus Linguistics 6/1: 1-26.

   xx.            Walkowska, J. (2009). Gathering and Analysis of a Corpus of Polish SMS Dialogues. Challenging Problems of Science. Computer Science. Recent Advances in Intelligent Information Systems, 145–157.

Cite this Article: