SMS Fraud Detection
RESEARCH SCOPE
The aim of this project was to develop a dataset of SMSs containing both normal and fraudulent messages with the purpose of developing machine learning models for classification of SMS fraudulent messages. The research was conducted by the Kuyesera AI lab. We report on the creation of the dataset and on preliminary machine learning algorithms that were conducted on the dataset.
DATA COLLECTION
Data collection was conducted using two methodologies: online surveys and face-to-face questionnaire data collection.
- A structured online google form survey was distributed through various digital platforms at MUBAS via the Students Union president. Posters were displayed on the campus at MUBAS calling for participation. A total of 102 people participated in the online survey.
- Face-to-face data collection was conducted with individuals selected through random sampling techniques around the campus. A total of 86 students and members of staff participated in the face-to-face data collection.
KEY FINDINGS
- Awareness of fraudulent SMSs: Online Surveys-96.1% respondents reported that they are aware of the existence of fraud SMSs, and 3.9% reported they do not know of its existence while Face-to-Face data collection- all 86 individuals reported that they are aware of fraudulent SMS representing 100%.
- Prevalence of Fraudulent SMSs: Online Surveys-2% of respondents reported receiving fraudulent SMS messages of more than 10 a month, 16.4% reported receiving between 5 to 10 SMSs a month, and 81.4% encountering such messages occasionally while Face-to-Face Interviews- participants expressed varying degrees of exposure to fraudulent SMS messages, with 12% reporting frequent encounters and 88% stating rare occurrences.
- Types of Fraudulent SMS Scams: Both online survey and face-to-face data collection highlighted common types of fraudulent SMS scams, including phishing attempts, prize/sweepstakes scams, fake investment opportunities, and impersonation scams. Participants shared similar experiences regarding the types of scams encountered, indicating consistency in the prevalence of fraudulent SMS messages.
- Response strategies: Online Surveys-72% of respondents stated they ignored fraudulent SMS messages, and 28% deleted the messages immediately while Face-to-Face Interviews-participants mentioned diverse response strategies, including ignoring and deleting the messages, reporting the messages to relevant authorities or engaging with the sender to gather more information.
- Impact of fraudulent SMSs: Both data collection methods highlighted the potential financial and emotional impact of falling victim to fraudulent SMS scams, with respondents expressing concerns about identity theft, financial losses, and personal data compromise. 90.2% admitted that they or a friend had ever fallen victim, while 9.8% did not.
DATASET
Some participants in the study consented to and submitted sample sms data. We developed a dataset of 15,299 SMS messages. We categorized these as FRAUD, NORMAL and SPAM of which 1,370 are fraudulent, 1,826 are spam and 12,033 are normal. We are using this dataset to develop a machine learning algorithm for SMS classification. This will help in detecting SMSs that are potentially fraudulent.
EXPERIMENTS PERFORMED
We used the dataset to experiment with machine learning algorithms to build a classifier. The dataset consists of the text of the SMS and other features such as phone number that sent the SMS, teh network, the day of the month when the message was reveived, etc. We manually annotated the dataset and classify each message as spam, normal, fraud. We also translated a subset of the dataset fully into English. We created carved out subsets based on selected features. We then developed predictors based in ML algorithms and compared their performance.
EXPERIMENTS' RESULTS
Our experiments show that machine learning algorithms can effectively classify Chichewa SMS messages as fraudulent or non-fraudulent. The Random Forest algorithms achieved a classification accuracy of 95%. Testing the Chichewa based model on new SMS messages showed it could identify almost all Chichewa fraudulent messages, even with a relatively small dataset. Translating the dataset to English improved accuracy to 99%, while machine translations achieved an accuracy of 91%. The Logistic Regression model performed the best overall. Performance differences between models built on human-translated datasets from different translators were minimal, suggesting either similar translations or a need for further study. Extracting less similar messages for training did not significantly improve accuracy. Allowing all SMS features improved model performance, while limiting them diminished it.
SMS FRAUD AWARENESS DAY
The event was held at the Malawi University of Business and Applied Sciences(MUBAS), ODeL building on 10th May 2024 from 14:00 PM to 16:00 pm. The event aimed at raising awareness on the existence of SMS fraud, how people can identify fraudulent SMS messages to protect themselves and how machine learning can be used to help combat the SMS fraud in Malawi. On the machine learning significance to combat SMS fraud, we highlighted our recent research findings from the research project titled “SMS Fraud Detection Using Machine Learning in Malawi” The research aimed at investigating the potential of using machine learning to classify Chichewa SMS messages as fraudulent and non-fraudulent. In attendance were the students from MUBAS who were involved in the SMS data collection, members of staff, MSU Director of Publicity and Publications and the invited guests from TNM and Inq. Read more