Search Engines vs. SEO Spam: Statistical Methods

Posted: March 3, 2010 – 6:44 pm

High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site’s business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.

In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called “black-hat” SEO.

‘Black Hat’ SEO and Search Engine Spam

The oldest and simplest “black SEO” strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However “black-hat’ SEO went one step further creating the so-called “doorway’ pages – tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.

Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of “black-hat”‘ SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.

“Black-hat” SEO is responsible for the immense amount of search engine spam-pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.

Using Statistics to Detect Search Engine Spam

An example of an application of statistical methods to detect web spam is presented in the paper “Spam, Damn Spam and Statistics” by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.

Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects – the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

The research concentrates on studying the following properties of web pages:
– URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
– Host name resolutions.
– Linkage properties.
– Content properties.
– Content evolution properties.
– Clustering properties.

URL Properties

Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits-and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

Host Name Resolutions

One can notice that Google, given a query q, tends to rank a page higher if the host component of the page’s URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs-to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

Linkage Properties

The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

Content Properties

Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

Content Evolution

The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

Clustering Properties

Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

To form clusters of similar pages the ’shingling’ algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.

The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)

To Sum Up

The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.

References:

1. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages” (2004). Microsoft Research.

2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. “Syntactic Clustering of the Web”. In 6th International World Wide Web Conference, April 1997.

Oleg Ishenko


drawing aliens,graffiti characters

Technorati Tags: , , , , , , , , , , , , , , , , , , , , , ,

The Eye Catching Variety in Romance Writing

Posted: February 24, 2010 – 10:26 am

Romance writing has always been a popular genre of literature. This is because it is a subject that is very close to people’s hearts. Romance writing has always been undertaken by people who feel they have a passion for setting the stage and directing the drama all in writing romance. It is a field that has very many players and, the results have been very impressive. Writers have been able to draw an audience from dry readers with their juicy narrations which can only be described as captivating; many writers have achieved this. The bar was set very high by writers who are long gone. Their creative tales of the past romance is covered with an erotic innocence that will not fade away soon. Writers like William Shakespeare were known for their artistic seventh sense as they ventured into the world of fiction and reality combined. There is so much history to look at when it comes to writing. For any writing to progress, it is vital to look at history for the purpose of judging how far we have come. We have come from far and, the good news is that romance gets better with age.

Contemporary romance writing has taken center stage. This is becoming the specialty of all age groups and genders. The diversity has only brought progress to the world of romance fiction. There are writers who have decided to use that historical edge to totally bring out real life characters who are able to make a better impression. Writers like Virginia Henley have done this perfectly. She is known to blend her work and set the romance with a historical theme and influence. If you are the kind of person who enjoys this kind of realism, the romance writing is for you. There are so many categories of subgenres that come with this kind of writing. As I have just mention, historical romance is still very popular in the modern day. We also have the modern or contemporary genres which are becoming more and more dynamic. Erotic romance is another form of writing that mainly focuses of the strong sexual urge between main characters. Many really appreciate this kind of writing because it is considered more practical.

Paranormal romance literature and science fiction romance are other categories of the writing. They literally feature extra ordinary stuff which is all intertwined in a romantic story. In the modern world, people are becoming more and more interested in the paranormal. The twists have definitely brought romance writing to another level. There are very many categories to choose from and, you can be assured of amazing reads. Get to know the kind of writing you are more interested in. This ay, you will be able to cut to the chase and know the kind of story you want to read. For more insight or guidance on reading such writings, you can join a book club. If you do not have time to meet, you can join an online book club which will provide you with the support and guidelines you need to start reading.

Francis Githinji


Drawing graffiti letters can be done in hundreds of different ways, but drawing on graph paper helps the letters stay to scale, and using a brush pen creates a flared effect. Draw a few different types of graffiti letters with a demonstration from an experienced artist and art supply store employee in this free video on drawing

Technorati Tags: , , , , , , , , , , , , , , , , , , , , , ,

The Santa Brand: How Does Santa Stack Up Against The Pillsbury Dough Boy?

Posted: January 30, 2010 – 7:45 pm

An Entrepreneur’s Guide to Getting Noticed in a Noisy Marketplace

My daughter, the one I affectionately call Daughter Number 2, recently challenged herself to participate in a high school Debate Tournament, following in her mother’s footsteps. The topic? Be It Resolved that Santa Claus is a Dangerous Concept Which Should be Abolished. So, 6 AM, the morning of the debate, I’m surfing the net for stories of bank robberies and kidnappings by men in Santa suits. It didn’t take long before I got sidetracked onto something even better― a bunch of articles on The Santa Brand. (Let the kid do her own research!)

Gotta admit, it never occurred to me before, but Mr. Claus fits most of the criteria I set out in my upcoming book “Step Into The Spotlight! -’Cause ALL Business is Show Business!” (Publication Date: April 2008), criteria for developing a dynamic business persona using showbiz techniques.

In show business, actors, directors and playwrights spend a lot of time on character development. In business, we call this building a brand. A business persona, just like a character in a play, needs a unique look (white beard, rosy cheeks, an enlarged perimeter), a unique costume (Red Suit, much better for branding than Banker Blue), a unique name (Santa Claus), a clearly defined personality (Jollier than the Jolly Green Giant), a strong philosophy (You gotta be nice, not naughty) and the guy’s gotta know his lines and stick to the script (”Ho, Ho, Ho!”).

Santa does all that. And the guy’s consistent. You never see him in a blue Hawaiian shirt, even if he’s hanging out at the Honolulu Hilton in December. Try leaving your scarf or gloves or umbrella at a Chamber of Commerce Networking Breakfast. Would everyone immediately know to whom it belonged? They would if you forgot your red velvet hat with a dangling white pom-pom!

The Pillsbury Dough Boy, The Maytag Repairman and The Man from Glad also each have a consistent look and OK, the Dough Boy is irresistible. But none of these characters have the emotional connection with their audience that Santa has. And it doesn’t matter how many times you’ve seen Santa’s show, you’ll be sitting in the front row again next December. The Maytag and Glad guys stand for dependability, but Santa’s not only dependable, he stands for hope as well, ask any kid on December 24.

Speaking of kids, why is it that we let our kids sit on the laps of strange men in department stores? Why is it that year after year, chubby red suited guys get away with “naughty” deeds like robbing banks and kidnapping kids? Why? Because Santa is such a strong brand that not only kids, but adults, lower their guard and trust the guy. We even leave the guy milk and cookies by the fireplace and encourage him to break into the house when we’re all asleep. Even the Grinch Who Stole Christmas eventually succumbed to his charm as did the journalist who wrote “Yes, Virginia, there is a Santa Claus”. What does he stand for? Goodness and kindness and “pull out your wallet”.

Santa even knows how to work publicity. Many would disagree, but my philosophy has always been that it’s hard to burst onto the scene if you’ve been hanging around on stage all along! Santa doesn’t try to get ink 365 days a year. He lets Cupid have Valentine’s Day, lets the chicks and bunnies arm wrestle over Easter, leaves Thanksgiving to the turkeys and only then, bursts onto the scene after the stuffing’s been stuffed away.

But we’re talking business. You’re probably thinking, “Yeah Tsufit, but can the guy make money?” Yah Man! Actors are always asking their director “What’s my motivation?” and the classic joke answer is “To get paid”. Santa knows how to bring in the bucks as well as the next guy, better even. But there’s one question nobody seems to be asking. Who’s he making money for?

The major downside of the Santa Brand is that, unlike the Pillsbury Dough Boy or the Man from Glad or the Maytag Repairman, Santa will work for anyone. (You’d never catch the Maytag guy hawking computers on the side.)

I recently snuck out of a marketing seminar to visit the Coca Cola Museum in Atlanta and learned that although the Claus-ster’s been around for ages, Coke gave the guy his current look, Coca Cola Red suit and all, way back in the 1930’s and put him to work selling The Real Thing. But like Kleenex became just another tissue and Zipper became just another fastener, Generic Red Suit Santa started raking it in for anyone who wanted a piece of the action.

It’s nice that he lends his name to charity and stands on street corners pulling in bowls of dollars for the Salvation Army and unwrapped new toys for unfortunate kids. But that’s where I’d draw the line if he were my brand. In Showbiz, unique characters are the show’s best currency. If the character of Ugly Betty started showing up on Grey’s Anatomy and The Gilmore Girls and Desperate Housewives, it wouldn’t be long before she’d lose her draw.

The lesson here? Develop a clear living breathing persona for your business, but make sure it’s your brand, one that has a unique look, philosophy and connection with the crowd so people will pull out their wallets for you too. Before you know it, you’ll be rolling in more dough than the Doughboy!

iS

TSUFIT


cholowiz13http://gdata.youtube.com/feeds/api/users/cholowiz13Howto"how, to, drw, characters", "graffiti, "art, lessons", "drawing, tutorial, illustrator, art, animation, "learn, draw, cartoons", "westcoast, graffiti", "como, dibujar", "dibujos, animados", "pencil, drawing", sketchbook, blackbook, mtsk, "drew, street", artist, "step, by, step, wildstyle, "spraycan, art", artist", "glassell, park"how to draw graffiti characters (HQ)

Technorati Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Copyright © 2008 How to Draw Characters. All rights reserved.