Machine Learning Combats Spam Comment

16 Apr 2020, 14:42 PM

1745 Views

This is a day and age when online sectors catering to eCommerce, travel, media, etc. rely quite heavily on direct and regular customer feedback. This feedbacks lets businesses provide quality services, understand customer’s pain points, identify gaps in their service infrastructure and provide prompt resolutions.

Feedbacks via comments is a user-focused approach to provide consumers a window to voice their likes and dislikes, build engagement, and act as a go-between to engage proactively with readers to generate opinions.

But these feedbacks, like technologies, prove detrimental as it provides opportunities for misuse, fraud, and attacks. The feedbacks, most of the time, could turn out to be spam such as cyber-bullying, drive unrelated and unwarranted discussions, etc. This kind of feedback abuse and misuse could mar genuine consumer experience.

With thousands of individuals discussing, debating, and deliberating in the comment section of the above-mentioned businesses every day, monitoring and filtering out negative and unwarranted comments that hijack the context is a considerable challenge.

Bypassing conventional spam identifiers:

‘Stop Word’ method which identifies spam comments based on a predefined set of keywords is now turning out to be a primitive method. People can bypass the system by replacing letters in words with special characters and signifiers. Keyword identification is also reliant on contextuality as words could have different semantics when used in isolation than in a wider context.

Simply put, it is not possible to block comment spam in a webscale through manual identification. It is just not the time and cost which is a loss but also based on the moderator’s bias. The fluidity of interactions cannot be maintained in real-time for response moderation.

Tonic for Spam problems:

Companies need to counter these attackers by investing in modern infrastructure through a measured mix of proactive audience participation at flagging unwarranted comments. Also, an automated process that learns from existing steps to hold bad actors and develop its identification mechanism to combat such spammers.

Therefore many businesses now have concrete guidelines that plead users to repudiate themselves from unjustifiable engagements. They also further prod to assist the team by identifying comments that hurt, malign, politicize or spread unrequired agenda amongst the comment-base.

Pre-screen spam:

Machine learning (ML) is made up of training and inference. Any new comments have to pass the pre-screening process. Follow a static pre-processing set-up to set the tone for spam identification.

Certain steps need to be taken and we got it down for you:

Checking for duplicate comments – Some users deploy duplicate comments to gather extra visibility bandwidth causing distractions to genuine readers. Such duplicate comments need to be blocked.
Phone number, email, or URL check – Miscreants leave behind sensitive information such as phone numbers or email IDs in the comment sections to bait unsuspecting readers. These readers could be vulnerable to attacks. Comments containing such information are therefore removed.
Copy-paste comments – This is one common thing that happens in the comments section to gain visibility. These comments are identified and removed.

Building a strong data-inventory:

Convolution Neural Network (CNN) is similar to machine learning and relies on a predefined data-set for training and development purposes.

Data can be collected from reliable resources like:

Audience contributions- is one of the reliable sources of inputs in identifying malicious content in the comments. Malicious content flagged by end-users can be studied across a week and tagged. Non-malicious content is studied across 3 days before being deemed safe. These then become the input to the model training process.
Data from Kaggle- Being the world’s largest ML and data science community, Kaggle’s contributions are still held in high regard to build a predefined data feed for the CNN (Convolutional Neural Networks) system. Spam datasets are available across various verticals such as Email, SMS, etc. in Kaggle.
Manual filtering- The online back office of the business continually builds a data set of malicious content. This is required for frequent model re-training that prevents it from going stale.

Identifying Context:

Structuring a spam identification system merely based on pre-identified word matching could fetch skewed results and low accuracy. The meaning of the word is determined primarily by its context. Or the words and phrases surrounding it.

The sequence in which the words appear is very important, changing the sentence construction of a word could alter the semantics beyond repair. The spam detection system should take this account to have any worthwhile precision.

Comments

Susie McCrea

I learned something new in this article. Thanks so much

4 years ago
Stevon Lewis

Wow!

4 years ago
Jasmine Anron

Thanks for sharing such great information with us. Also, here an nice mask machine share for you. Taking care ourselves.mask machine

3 years ago

View all comments

Explore the endless possibilities with AI