Detecting Illegal Activities on Instagram using Computer Vision
Abstract
Instagram have been getting a lot of momentum and a drastic increase of users, it is becoming a platform that is being monetized by businesses and individuals, this turnout made it a target to illegal activities. This project aimed to answer the following question;
for the current available data in regard to illegal activities through Instagram, is it possible to detect illegal posts in real time?
The approach followed to address this aim includes the following: scraping Instagram public hashtag images, this was done by using “insta-loader” which is a public python library that scrapes through Instagram. Followed, a total of three thousand images were divided into two groups of legal and illegal posts manually, then this project explored the implementation of four computer vision models on the dataset: Convolutional Neural Network, Visual Geometry Group, Residual Networks and DenseNet. Moreover, the finding of the project was able to confirm that model DenseNet delivers the best outcome with an accuracy level of 96% to grouping posts as legal or illegal in real time.
Machine Learning Methodology
Documenting the machine learning process in this project will be according to a machine learning methodology by Mr. Matthew Mayo, this will confirm that the process is being documented without missing critical points.
1. Data Collection
Starting by collecting the dataset, the method that was used is “insta-loader” which is a project that aims to scrape through Instagram public data, it is a convenient algorithm as it offer a good variety of useful tools like changing the file name, which was used to name the file under the Instagram account user, this is done to ease out the process of reporting the cybercrimes account to the entity in charge. The collecting algorithm was assigned a code to only return images, not retrieving any video. Also, in-case of posts with multiple images, only the first image from the post was installed. Total of 6,000 image were collected from two different hashtags: UAE and Dubai. Furthermore, the comments and the hashtags were not part of the dataset, this was done to lower the processing power.
2. Data Preparation
After collecting the dataset, the images have been divided into two groups manually, post that are legal, and posts that are illegal. Legal posts were much common than illegal posts, the illegal dataset summed up to 1,324 posts, thus an amount close to that was chosen for the legal posts, removing the extra images of the legal posts to balance out the two classes. A total of 3,165 images were chosen to represent the dataset, divided into four groups:
Class | Number of Images |
Training - Legal (90%) | 1,674 |
Training - Illegal (90%) | 1,187 |
Testing - Legal (10%) | 167 |
Training - Legal (10%) | 137 |
Followed, a data augmentation process was added before the machine learning model for the goal of generating more images to balance between the classes.
3. Choosing The Models
Four different computer vision models were chosen:
Convolutional Neural Network - CNN
Visual Geometry Group - VGG
Residual Networks - ResNet
DenseNet
The results of the models were then compared to choose the model that is most fit for the available processing power and data.
4.Training the Models
In the training phase, the four models ran with an iteration equal to thirty, and a learning rate of 0.0001. Also, the models were trained on the same CPU (Intel i9-9900K).
The training phase took 16 hours in total, figure (1) displays the completion of the training for all four models.
Figure (1) - Final Iteration for all models
5. Evaluating
The accuracy of the models was obtained to compare and evaluate the models with each other, it represented how well the model did on classifying the images as legal or illegal, the value of the accuracy goes from zero being the lowest, to one being the highest. A model with an accuracy of 1 means that the model can categorize the images without any mistake so with an accuracy rate of 100%. The accuracy of the models was retrieved by running the algorithms on the unseen datasets, figure (2) represents the models accuracy.
Figure (2) - Models Accuracy
The DenseNet model returned the highest accuracy level, returning an accuracy of 98%.
Discussion
Past papers focused on identifying one illegal activity at a time, thus all training data were done on the data that are symmetric to each other. On the other hand, this paper identified four illegal activities making the dataset very diverse, this difference in scope made achieving a high accuracy challenging, but this challenge was outcome by multiple points, first, in the collection process: only relevant and necessary data was collected, removing any unnecessary data and allowing for a bigger and more accurate dataset. Second, in the data preparation process: the data were divided accordingly, and data augmentation was used to balance out classes. Finally, in choosing the models, the most relevant and newest models to the dataset were chosen, those beforementioned points allowed the accuracy rate to reach 98%.
The project confirmed that it is possible to identify posts as illegal in real-time with an accuracy rate of 98%. It also showed what cybercrimes are most common in the United Arab Emirates, based on the collected data, the most frequent illegal activity on Instagram was Unlicensed Medicine Selling, at a rate of 38%, followed by Black Magic at a rate of 31%, afterward was Investment Scam at a rate of 26%, least was drug and weed selling, at a rate of 5%.
Comments