10 November 2014 - Artadys

8 13 TrustRank 10 05

so now we will start talking about how

to combat the web spam and the method we will talk about is called trust rank so as we mentioned before there are two types of web spam there is the term spam and there is the link spam term spam you can think of it it's very similar to the email spam filtering right basically we would like to identify what are the terms what are the words that correspond to spam and for every web page we would like to automatically say whether it contains a set of those words and filter out the webpage combating link spam however is is more more X more expensive or more tricky the idea here is basically that we would want to detect or blacklist a set of graph structures on the web that look like spam forums in particular what is the problem here is that kind of dis links this leads to another war where if you would have such a filter that would try to identify structures that look like spam farms is that then spammers would try to hide themselves and we would have to update that back blacklist and everything would go back and forth so we need a more universal solution so the more universal solution is called trust rank and in its basic idea trust rank is simply a topic

specific page rank with a teleport set to a trusted set of web pages right so when I say for example trust a set of web pages this would be web pages to which to which we trust and that are not likely to be hacked or farm so for example dot edu domains or from non-us schools would be such a such a way to identify a good set of trust web pages so let's think about this a bit a bit more and see what is the idea so basically the idea behind the trust rank is this is the notion of approximate isolation where the idea is that it's rare for a good page to point to a bad page right it's very rare that a good legitimate page would point to spin so the idea is that we want to select a set of seat pages from the web that we trust that they are good and then we would want to compute some kind of similarity or importance or proximity of all the other web pages on the web to this to this trusted set and now if not some other web pages gitam at then it will be close to the trusted set of web pages of course what we are depending on here is that we have some Oracle right some human to identify good trustworthy set of web pages and

and of course that the idea is that web spam pages are not in this Web in this set of course this is a very expensive task because a user would have to go or a human would have to go through and do this do these labels so in some sense we want to make the set of seed trusted web pages as small as possible so the idea is the following what basically we'll be doing is we are doing the trust propagation so basically we have a subset of good seed pages that we identify as good as we will call them trusted pages and then we perform as I mentioned before the topic page sensitive page rank with the teleport set to the trusted pages what this really means is that now basically we are propagating trust across the links of the graph right and the trust can take value between 0 and 1 the same way as the as the page rank score so one possible way to identify spam spam would be the following we take the set of trusted web pages we computed the personalized or topic specific page rank score with the teleport set being the set of trusted web pages and now what we do is we compute the pagelings core of every other page on the graph and if the

if the patient score of some page on the on the web graph is smaller than some given threshold then we go and cut that page away from the from our data set and we say that that corresponds to spam the problem with this method though is that basically we go and cut away all the web pages that have low PageRank scores with respect to the trusted set and now of course some web page on the web can have a low-paying score because this web page is new and and has just been born or this web page you can have a load the page rank score because it's spam so the question is can be why is this a good idea and maybe can be later improve on this idea so let's first talk about why this idea of personalized page rank score from a given trusted set of web pages a good idea to detect links so first is notion is that we have this notion of trust attenuation where the degree of trust conferred by a trusted page decreases with the distance from that page in the graph right so kind of farther away of given pages from the from the trusted set of pages the lower the lower the trust it receives will be and another important notion that also works in our favor is the notion of

trust splitting right where the larger the number of outings from a page the less scrutiny a page author gives to each outlet in a sense what this means is that if a page has lots of trust but then has lots of outings then this trust kind of gets split into these small chunks and distributed over over the target pages right and this is exactly what is happening in the topic specific page rank with the trust trusted set of web pages being the teleport set so now let's quickly discuss how do we go in practice to pick a seat set picking assets that we have kind of two conflicting considerations in some sense we would want to make the seat set to be as small as possible why as small as possible because human labeling web pages as trusted or not is very expensive so we want to use as little of human time as possible on the other hand we would like in some sense to be the seat set to be as big as possible because we want to cover all the good pages on the web so in a sense ideally you would like to put every non-spam page into our seat set but that would mean the seat set is is huge so the question is how do we balance out these

two kind of competing or conflicting goals the idea is the following right for example imaginary want to select a seat set of K pages one idea would be for example that we pick the K pages on the web according to the page rank and the hope is that this the top small fraction of web pages on the web are really the truly important pages on the web and we label those as the seat set another another idea that I briefly mentioned before is that we use a set of trusted domains whose membership is controlled by some trusted organization so for example dot ed u dot mil or dot gov domains these are the these domains that not just everyone can connect astir so all the pages in these domains we would trust them to be good pages and that are not spamming don't SPECT don't point to other spammy pages so one way to create a trusted set of web pages would simply be to take all the web pages at these domains what is also interesting now is that we can actually take our initial idea of identifying web spam and extend it a bit and we will extend it by creating this notion of spam mess right so the idea is the following we will use the trust rank

as a model and start with a good set of good trusted pages and propagate the trust but now we will flip our reasoning and we will kind of view use two complementary views our goal will kind of be to ask what fraction of page page rank comes from the spam pages right so what we would like to do is not to ask how what is your proximity to the trusted part of the web but we would like to say what fraction or estimate what fraction of page rank score of a given page comes from the spam right so in practice we don't know the answer to this question but we need to estimate it so the way we can think about this is the following we can think that we have our web page that we are interested in here as a red circle we have a trusted set of web pages and we have the full web and our goal is to estimate what fraction of page rank score of our red note here comes from the spam pages so what we can do is proceed as follows we can first go and compute our sub P where P is our red red node where r sepideh page rank of our node p and then we can also compute the RF plus of P which is the page rank of the same node where the teleport was done to the trusted set of

web pages right and now we can say that the spam mass of a given web page is simply RP minus r r+ p what does this mean basically as we say we will compute the PageRank score of a page using the simple page rank where the teleportation is uniform we will also compute the pagelings score of a page where where we always travel from this trusted set of web pages and now if a web page is is really spam then this difference will be high right basically there is a lot of other web pages on the web that we don't trust that really boost the importance of that page so this way we can come we can define the notion of a spam mass of a of a page which is simply the ratio of the overall PageRank score of the page and the amount of spam mass that that page receives and this way we would we could go and all the pages that have this spam mass fraction hi we would proclaim these pages as spam and remove them from the from our web corpus and this idea is is better than the first solution to our problem because here the question whether a page is spam or not does not depend on the absolute value of its PageRank score but kind of it depends on

the relative value when we compare how much PageRank score of a page comes from the trusted part of the web and how much of its PageRank comes from the untrusted part of the web and the ratio between these two quantities tell us how spammy is a web page

XML Transcript: