Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
According to a Silicon Valley-based startup, there is a “big data solution” for one of the Internet’s most common afflictions.
Impermium, a Silicon Valley based venture-backed company that specializes in removing spam, has launched a new tool that that scans tens of millions of comments across the web. It spots highly offensive dirty words, and automatically removes them.
According to the company’s chief executive, Mark Risher, a tool like this is the first of its kind as algorithms have not traditionally been able to handle “internet speak” (coooooool, skilz, pr0n, and so on).
It took several months and the brainpower of leading data scientists to develop the technology. Spammers have evolved to become more sophisticated. “These are people that are actively trying to avoid getting caught,” said Risher in a phone interview with VentureBeat.
“Trolls and hackers and social spammers are changing their techniques to avoid detection,” he said.
As Facebook and Twitter reach their ascendancy, it has become more imperative that we find a way to remove spam. Recent data from security company Barracuda Labs sheds light on the extent of the problem. The study found that one in four Facebook users have received a virus or malware, often something posted to their public wall.
It’s a more profound problem for small businesses — the appearance of hate speech on their site is a reputation killer.
Rather than use an internal data services team, the company turned to Kaggle, a startup that hosts data-driven competitions.
From around the world, statisticians competed to build a tool that combines machine learning and natural language processing to root out malicious commentary. The winner received $7,000 — their solution was the most accurate with a false positive rate of less than one percent.
After posting the competition, the company received 154 submissions from contenders, and made some interesting discoveries about the nature of hate speech. In the words of Kaggle’s CEO, Anthony Goldbloom:
The key takeaways
- North Dakota is the most rambunctious state (6.4 percent of traffic is insulting) and Maine is the least (only 3.7 percent of traffic was considered spam or hate speech).
- People on the internet are more malicious than you might expect (“there was no difficulty finding enough insults”).
- Text with the word “mom” is more likely to constitute an insult.
- The word f*#k is a surprisingly poor indicator that text contains and insult. It is as often “f*#k yeah” as it is “f*#k you”.
According to Risher, it was more technically challenging than expected, namely because it takes a trained human eye to detect hate speech. For instance, the algorithm might flag a user for regurgitating lyrics to a popular rap song. Alternatively, it may fail to detect a malicious comment that is veiled in sarcasm.
“People are continuing to find new ways to insult each other,” said Risher. The self-described “spam Czar” left his job at Yahoo to focus on social media sites. In his former career, he was responsible for mitigating email spam.
Launching this week, the new tool called “Intelligent Content Protection” is already used by content-heavy sites like WordPress, Disqus and Livefyre. Pricing is flexible, but it’s about $2-3,000 per month, significantly less than the cost of a human editor.
Impermium launched in 2011, and has received funding from Charles River Ventures, Accel Partners, and others.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more