ScamBuster

I recently participated in this hackathon: So you think you can hack?

I collaborated with 2 sophomore students from Georgia Tech to make this project. My initial idea was a bit different- to make a spam-call-buster which interactively checks and blocks calls if they’re a spam call. However, we later decided a better idea would be to make a Scam Buster, which identifies scam and spam messages from SMS and alerts the user, hence saving them from potential loss.

It may seem improbable, but many people lose money from scam messages by clicking on links and as such. I was surprised when I heard that in 2020 alone, text scams costed Americans 86 million dollars.

Check out its website here. You can download and check the app out yourself!

What I learnt during this hackathon:

Importance of teamwork and communication between fellow teammates.
Learnt about Tensorflow and PyTorch framework.
Implementing language models in an android application.
Methods of making an android application.
Converting PyTorch models to different models, such as TorchScript, ONNX, TFLite, and the problems regarding the conversions.

My part in the project:

We divided the project into 3 parts for all 3 teammates. My job was to find a suitable model, train it as per the requirements and help in its integration with the application.

Finding the appropriate dataset:

I ran into the first issue right away: the dataset was too small. Me and my teammates all searched for a good large dataset. Unfortunately, all we found was an old dataset dating back to 2016. We looked for projects or scientific papers who used a bigger dataset, and we found one. One of the papers linked to a GitHub repo containing 50k spam/ham messages, both in Chinese and English. I looked deeper and downloaded this dataset, but when I checked, the messages seem to be incomplete. It contained unintelligible messages, even I couldn’t understand them. So, I decided to stick with the original dataset.

It contained 5.5k messages, with only 700 scam messages and 4800 ham messages. This imbalance would hurt the training of the model, so I augmented the data. I upscaled the scam messages without changing the 4800 ham messages (using random oversampler), resulting in a final amount of 4825 scam and 4825 ham messages. Now the model would no longer have issues with imbalanced training.

Deciding the model to use for the task:

For this classification task, I decided to load a pretrained model, so that it already had a basic understanding of English, this way it would make up for the low amount of scam messages we have. Out of many models, with helpful suggestions from ChatGPT, I decided to load multiple variations of pretrained BERT Models. There are multiple reasons for choosing this:

BERT (Bidirectional Encoder Representations from Transformers) models are state-of-the-art models which are flexible and compatible with multiple machines.
BERT Models have been pretrained on a large dataset corpus.
These models employ transfer learning, which helps them leverage data learnt during pretraining and fine-tuning, giving better results.

Fine-tuning the model:

For fine-tuning our model, we had to first decide whether we will host it online or integrate it into the Android Application. Originally, I trained the mobile BERT and large BERT (768 hidden layers). Then, I tested both of these. Their results were quite close to each other, so I decided to keep the mobile BERT to directly integrate it into my app. However, after multiple dissatisfactory attempts, I decided to host it online.

Since we decided on hosting it online, I hosted the larger model, since it had a much better understanding of English and was trained on a much larger corp. We hosted it on HuggingFace. It gave an accuracy of 99.8% on the test dataset.

Hosting it online:

The models were large. Even the mobile BERT was around 95mb, and large one being over 400mb. It would be inconvenient for users to download an app with such a large size. So, I decided to host it on HuggingFace. This way, the app wouldn’t take up much space. There were actually multiple reasons for hosting it online, but this was a great benefit of it.

HuggingFace allows us to use Hosted Inference APIs to call the model. I integrated this API into my Android Studio project to request information from the model and receive responses.

Enhancing the UI of the Android Application:

My colleague, Bratee, made the original UI for the android application. Once I hosted the BERT Model online, I added some final touches to our app. I added a user-friendly send-receive text box and added some design to the messaging bubbles.

I made a video on how to use ScamBuster! Watch this 30 second short

Team-mates in my project:

I had the privilege of working alongside two exceptional individuals who played vital roles in our hackathon project. Allow me to introduce my brilliant teammates:

Bratee Podder, Sophomore student & researcher at Georgia Tech. Responsible for the Android App development with Java classes, as well as the final presentation.
Lily Chrisholm, Sophomore student at Georgia Tech. Responsible for the integration of the Model with the Android App.

Projects