Wednesday, January 17, 2007

When Google turns against you...

Google began as a research project in January, 1996 by two students in Stanford University, California. Larry Page and Sergey Brin (the two students) believed they had a better idea for searching the Internet than the existing ones which ranked web sites based on the number of times a search term appeared on each page (so the more keywords you had in your home page, the higher rank you got - obviously inaccurate and exploitable). The Google Search Engine actually analyzed the relationships between websites (how and how many sites linked to another site). It went public in 1997.

Today, the year is 2007. Almost, ten years have passed and Google is estimated to hold 70-78% of the Internet market. It has become a synonym for web searches and is even, unofficially, a verb ("Google 'this' to know more"). It's one, big, central point (hence "point of failure") which searches and indexes the entire World Wide Web. Some call it the "front page of the Internet". People would kill for their sites to appear on the very first results for a keyword. And on the other hand, "if you site is unreachable by Google then it doesn't exist". This really changed the Internet in the late decade but what about now? There truth is that various problems are coming up.

Google can be fooled or (worse) manipulated.

A couple of days ago I read an interesting article on Javalobby which talked about how their forums got spammed and how the Google flagged them and removed them from its results. The guy who posted the story on Slashdot titled it "When Your Site Ceases to Exist". But, let's take things from the top. Javalobby maintains a forum with parts of it where unregistered (and anonymous?) users may post. That's a BIG mistake. It's also very naive (I'll expand on this another time). As a result, spammers exploited that and filled them with about 50.000 messages advertising pills, porn and gambling. Very soon Google's filters picked that up and considered their site yet another "bad apple", burying it down into the ground. How did that affect the site? According to the article's author, they lost about 10.000 visits a day which came from search results references. That's a tragedy for someone who has invested time and money so that users can find his site when looking for certain things. Of course he didn't have a contract or any other agreement with Google. He just trusted it to do the job right.

Yesterday, I read another interesting story on GNUCITIZEN. According to this, their site was down a couple of days due to technical difficulties, displaying an automated error page (Wordpress default error). It's the same error page any site would show if using the same software they did. So the author discovered that Google correlated his site with others showing the same error because it thought they displayed the same content. Technically they did. It was the same HTML page so, based on absolute deterministic logic, they should belong in the same group. The problem is they don't and, worse, even after the problem was fixed and the normal home page was back on, Google kept grouping the site with irrelevant ones. The author goes even further thinking this from a security point of view. He says that if an attacker sets up a couple of web sites (with minimal cost) displaying the same error page from above, plus some Pay per Click advertisements, Google will "work" for him, grouping them with other legitimate sites, which may hold a very high rank in certain keywords. So if an average user types those keywords he'll a list of search results containing the legitimate sites, followed by the attacker's. As a result the attacker with little time and cost will have managed to steal those keywords in his favor and have malicious content showing up in the first page of search results. Imagine porn advertisements in the same results page as the link to "" when someone searches for "IBM". And Google will have been his accomplice.

So... what do we have here? It has been clearly demonstrated that Google is a single point (of failure) which can be fooled and manipulated. Nowadays if Google can't find you, you don't really exist. Does it have to be that way?

The answer is NO and it comes from the world of peer-to-peer (P2P) systems.

A lot of people know and use digg. It's a website with no actual content but which allows any user to post a link about another website and a short description. Then, other users who find that link interesting, place a positive vote on that post (not on the user). If they don't, they place a negative vote. And when some third user visits digg, he is presented with a portal containing the latest of everything on the Internet (and links to them). There's a threshold (can be customized per user) on what you see. You get posts that collected a large number of positive votes and miss all the others. You can also vote positively or negatively on one or more posts, shaping that way its ranking. The result is that you read stuff that's interesting for most people and skip the rest.

And what about It's a website with no content of its own (like digg) where users keep records of their personal bookmarks. It works this way: you come across a site you want to bookmark, you place it on and tag it (that is characterize it using keywords). Then, you or everybody can select a tag and find all listings under it. For example you find a funny link about computer games. You post it and tag it using the keywords "funny", "computer" and "games". After a month or so you may select "funny" to see all links that you considered to be funny and so can everybody else.

Isn't that searching the Internet? The difference is that it is users who do the job and not some stupid bot. This isn't P2P on a system level but on a human one. It's not that there are a lot of computers searching and exchanging results and listings but it's humans who do that. If I see something worth looking to, I'll post it so that my (Internet) friends can see it. Then, each of them will promote it (by voting it or tagging it) so that his (Internet) friends can see it and so on. Nobody waits a single source for information.

In both cases from above crowdsourcing is applied. That is both digg and are empty vessels, filled with user activity from all over the globe. Pages are commented and ranked using democratic votes. If one posts something bad, he gets all negative votes and it's buried under the good posts. If one makes a bad comment on a good post, the comment itself is buried by negative votes on the comment and not on the post. It's all very amazing.

OK it's not like "I want to know about Company ABC" but it's not like YET. Imagine a similar site as big as Google where people registered and tagged everything. Then there would be an "ABC" tag and not just that. There would be comments on the company, it's products, it's site, etc. How about that?

Isn't that a couple of generations ahead from looking at a huge list with nothing but a couple of URLs (Google Style)?

Now that we are talking about it there is another, more obvious, P2P model for searching. A distributed Search Engine. Each user's browser contains a small "agent" which looks at a small piece of the Internet, indexing sites. Then, it communicates with it's (network) neighbors (browsers) informing them of its findings. They do the same thing. So when the user (human) types something in that search engine (the interface of which resides in his computer), it knows where to find it or, a least, who to ask. Then it gets multiple results and even mirrors of the source along with ranks and tags like the ones I've just talked about.

When you are looking for something, wouldn't you prefer an option some guy you trust has recommended? Well, that's what I am talking about! I'm talking about the future.

No more "Google can't find me, therefore I don't exist". We won't have to care about that. If you exist you will be found. Or maybe the saying will be shaped into this: "I'm not interesting enough, therefore I don't exist". Haha we are talking about the ultimate democracy. Which government or agency will be able to suppress such system?

When you are trying to censor or manipulate content that the entire planet reviews and comments on, the battle is lost before it even begins.

Of course there a lot of security issues here. Will someone be able to poison such system by deploying thousands or millions of user-imitating bots? Will someone be able to run a DDOS attack by manipulating the system (Slashdot Effect or Digg Effect)? Will phenomenons like "psychology of the crowd" or "gossiping" take over? These (and a lot more) are factors we should really take into account but decentralizing information databases and adopting a more open model when it comes to content management and distribution is certainly the way of the future. I'll get back to the security portion of this some time.

To sum up, the Google algorithm in its current form may not able to handle well the Internet of today and certainly won't be able to do so in the future. Like two students from California revised in 1997 the way we searched the Internet, maybe it's time again to make the next step forward into something as radical and advanced as Google was for that time.

No comments: