Folks, we are approaching a Mega information clutter in the near future. There will be trillions of Web pages. People will have petabytes (quadrillions of bytes) of information on their local computers, and it will look like the biggest mess ever piled up in the history of human civilizations. A big portion of this information is junk, irrelevant, accidental and bad quality.

It is like we are building cities without any sewer, gutter, garbage collection, or sanitation systems.

Junk email filters or virus programs, meanwhile, actually encourage more information pollution by giving us a false sense of orderliness. It is analogous to recycling aluminum cans. Recycling makes consumers feel good about purchasing more of them, and we’re seeing a boom in “clean technology.” There is no “information cleaning” technology, however. Never has been, and never will be.

Our last hope is the “finding” technologies – namely, the search engines. The mess will still be there, just with better search engines we might be able to steer around the garbage. Therefore, the critical question is how much better the search engines should get to save the day?

To be able to answer this question in general principles, we can postulate that a search engine must understand what is going on, which requires algorithms capable of understanding content. This is what experts in the search field have called* Semantic Search Technology (SST). I like the abbreviation SST, it sounds like a wake-up whisper! Some reading material related to SST can be found at It would not merely understand the meaning of words and sentences, but would also assess its relevancy, accuracy and so on. Right now, it is an ideal. More in a sec.

This is to be distinguished from the so-called “Semantic Web,” which refers to
technology that allows us to understand the meaning of content. It is somewhat less ambitious. It is widely heralded today. But what about information pollution? Belief in the semantic web assumes people posting articles on the Web will be so nice and educated to follow the “mysteriously unified” standards of semantic structuring. And it only applies to Web documents, not to those on your hard drive. Obviously, this view is too optimistic and idealistic. The Semantic Web is a document formatting idea, whereas the SST is a computer algorithm that does not rely on people’s best behavior while generating new documents.

So let’s stay with SST.

When SST is implemented properly and content detection capability reaches satisfactory levels, search engines will be able to sift through massive data and pull out relevant answers on a consistent basis. To the end user, the garbage (or unrelated content) will be invisible during the search. And once it is invisible, information pollution will remain a dormant disease. Accordingly, mastering SST seems to be the only straight line to tackle the information pollution problem, or better put – to steer around it.

Although there is a substantial amount of academic research, the current state of SST in the application field is at infancy. There is a good chance however, to see significant advances in Web search in the next couple of years as I have mentioned in my earlier blog posting.

Information pollution strives in sub-optimal environments. It is either created out of carelessness, or produced by people who have a certain expectation of return, banking on the leakages in this sub-optimal environment (See Dr. Jakob Nielsen’s earlier postings). Junk email, for example, has a tiny return, yet still a viable advertising strategy because some people still read it. But if you erase that tiny return with new technology, then the junk producers will eventually go out of business or lose interest.

Similarly, if SST makes it impossible for bad-quality information to rank high in search engines, then the bad-quality information producers will eventually get discouraged. Nevertheless, this requires a period of consistency with a proven SST implementation industry wide.

The challenges ahead are sizable but can be met step-by-step. Among them, the most important challenge is to have the vision to build search engines from ground-up, inventing new infrastructures and middleware to sustain scalable semantic computations. A patch job over the existing technologies will most likely suffer from their inherent limitations.

The other positive development to reduce information pollution is the natural evolution from “push” to “pull”. Before on-line advertising was invented, all promotions were on the “push” mode. Marketers carpet-bombed us with messages. After all, if someone is sending you an email about “balding in aging men” and you fit this definition, then it may not be a junk email for you. Thus, destination accuracy is also what makes junk a success.

With on-line advertising, you are exposed to ads according to your query, or the page content. This is the “pull” mode. You may still get an ad related to “balding in aging men” for your query of “bald eagles” today. But this can be fixed via SST in the short run.

Once the promoters, spammers, and no-credibility content are taken out of the pollution equation by SST, then semantic search may save the day. But the subject of information pollution actually goes deeper than that. Like a credible content source that says “There are WMD in Iraq”… How do you filter that?