Folks, we are approaching a Mega information clutter in the near future. There will be trillions of Web pages. People will have petabytes (quadrillions of bytes) of information on their local computers, and it will look like the biggest mess ever piled up in the history of human civilizations. A big portion of this information is junk, irrelevant, accidental and bad quality.

It is like we are building cities without any sewer, gutter, garbage collection, or sanitation systems.

Junk email filters or virus programs, meanwhile, actually encourage more information pollution by giving us a false sense of orderliness. It is analogous to recycling aluminum cans. Recycling makes consumers feel good about purchasing more of them, and we’re seeing a boom in “clean technology.” There is no “information cleaning” technology, however. Never has been, and never will be.

Our last hope is the “finding” technologies - namely, the search engines. The mess will still be there, just with better search engines we might be able to steer around the garbage. Therefore, the critical question is how much better the search engines should get to save the day?

To be able to answer this question in general principles, we can postulate that a search engine must understand what is going on, which requires algorithms capable of understanding content. This is what experts in the search field have called* Semantic Search Technology (SST). I like the abbreviation SST, it sounds like a wake-up whisper! Some reading material related to SST can be found at http://www.ontologicalsemantics.com. It would not merely understand the meaning of words and sentences, but would also assess its relevancy, accuracy and so on. Right now, it is an ideal. More in a sec.

This is to be distinguished from the so-called “Semantic Web,” which refers to
technology that allows us to understand the meaning of content. It is somewhat less ambitious. It is widely heralded today. But what about information pollution? Belief in the semantic web assumes people posting articles on the Web will be so nice and educated to follow the “mysteriously unified” standards of semantic structuring. And it only applies to Web documents, not to those on your hard drive. Obviously, this view is too optimistic and idealistic. The Semantic Web is a document formatting idea, whereas the SST is a computer algorithm that does not rely on people’s best behavior while generating new documents.

So let’s stay with SST.

When SST is implemented properly and content detection capability reaches satisfactory levels, search engines will be able to sift through massive data and pull out relevant answers on a consistent basis. To the end user, the garbage (or unrelated content) will be invisible during the search. And once it is invisible, information pollution will remain a dormant disease. Accordingly, mastering SST seems to be the only straight line to tackle the information pollution problem, or better put - to steer around it.

Although there is a substantial amount of academic research, the current state of SST in the application field is at infancy. There is a good chance however, to see significant advances in Web search in the next couple of years as I have mentioned in my earlier blog posting.

Information pollution strives in sub-optimal environments. It is either created out of carelessness, or produced by people who have a certain expectation of return, banking on the leakages in this sub-optimal environment (See Dr. Jakob Nielsen’s earlier postings). Junk email, for example, has a tiny return, yet still a viable advertising strategy because some people still read it. But if you erase that tiny return with new technology, then the junk producers will eventually go out of business or lose interest.

Similarly, if SST makes it impossible for bad-quality information to rank high in search engines, then the bad-quality information producers will eventually get discouraged. Nevertheless, this requires a period of consistency with a proven SST implementation industry wide.

The challenges ahead are sizable but can be met step-by-step. Among them, the most important challenge is to have the vision to build search engines from ground-up, inventing new infrastructures and middleware to sustain scalable semantic computations. A patch job over the existing technologies will most likely suffer from their inherent limitations.

The other positive development to reduce information pollution is the natural evolution from “push” to “pull”. Before on-line advertising was invented, all promotions were on the “push” mode. Marketers carpet-bombed us with messages. After all, if someone is sending you an email about “balding in aging men” and you fit this definition, then it may not be a junk email for you. Thus, destination accuracy is also what makes junk a success.

With on-line advertising, you are exposed to ads according to your query, or the page content. This is the “pull” mode. You may still get an ad related to “balding in aging men” for your query of “bald eagles” today. But this can be fixed via SST in the short run.

Once the promoters, spammers, and no-credibility content are taken out of the pollution equation by SST, then semantic search may save the day. But the subject of information pollution actually goes deeper than that. Like a credible content source that says “There are WMD in Iraq”… How do you filter that?

Trackback URL

3 Trackbacks

  1. HELM, WHM/cPanel, Windows, Linux and SEO Blog » Blog Archive » SearchCap: The Day In Search, August 7, 2007 said:

    [...] Information pollution: Can semantic search save the day?, VentureBeat [...]

  2. Stalkk.ed Bookmarks del 9 Agosto 2007 [del.icio.us] | Stalkk.ed said:

    [...] Information pollution: Can semantic search save the day? - La proliferazione delle informazioni sul Web ed il paragone con la costruzione di una città. Petabytes di informazioni sui computers di ciascuno. Potrà la ricerca semantica salvarci dalle informazioni “spazzatura”, irrilevanti, casuali, ecc.? [...]

  3. hakia Blog » Blog Archive » Information Pollution: Can Semantic Search Save the Day? said:

    [...] This article first appeared in Venturebeat on August 6, [...]

4 Comments

  1. Ken Ewell said:

    What an informative article. I like the SST moniker too!

    One thing about the semantic web though: Information polluters would have to go the extra step of creating a namespace and rdf tuples for their pages; not so with SST. So it is likely that information pollution would be less of a problem for semantic web applications.

    Data cleansing is not a new problem and this is probably why closely engineered and monitored content, though costly, is not a such bad idea. Where do you think the saying “garbage in, garbage out” came from.

    One inherent problem is that search engines present links that are used to drive traffic to “the right place”. People that want to scam you will fight to be that right place. The scam is to appear real, authoritative and accurate. Just like you said in the end, Riza, how can you filter “There are WMD in Iraq”?

  2. John Nagle said:

    This has been tried before. Cycorp has been claiming that they’d have something like this working in two years since about 1990. What they have so far can be tried at “http://game.cyc.com”. You won’t be impressed.

    We’ve discovered that kicking the junk off the web is a solveable problem. Try our “http://www.sitetruth..com”. This is an automated due diligence system tied to a search engine. It tries to find out who’s behind a web site, and lowers the search positioning based on that info. All those “domaining” sites, link farms, referrer pages, and similar junk just drop out. Remember “on the Internet, no one knows if you’re a dog?” We don’t let the dogs in.

  3. nick said:

    Nice piece. I agree with many of your points.

    “Mega information clutter” really changes the data management game in many deep and meaningful ways.

    Round these parts, we call the clutter “data superabundance”…

  4. Courtney Benson said:

    Riza -

    Sounds like your on to a much needed service offering. When will it be ready for prime time?

  5. mortgagevladim said:

    Hi
    I’ve found thesite with the searching system
    The site has domain name
    I’ve read many corresnondences
    I want to open a new theme
    It will be interesting
    By by,everybody
    mortgage

Add a Comment