Data Mining, Quant, Statistics, Computer Science: Jobs, Resumes, Directory

Precision Recruiting

In the Press

Site Map

[ Home ]

[ Finance ]

[ Web Audit ]

[ Consulting ]

Click Fraud 2007: New Types of Attacks, New Detection Strategies

Types of “weird” clicks

In the past, most of the undesirable clicks could be identified through simple features such as anonymous proxy IP in remote countries, velocity spikes, abnormal traffic volumes in some segments such as high paid keywords, abnormal query-to-click ratios, or just unsophisticated homemade robots.

All the parties in the PPC industry have gained experience, and this has changed the overall picture. Savvy advertisers selling to domestic clients receive many fewer international clicks, and the use of out-of-country anonymous proxies is declining on 1st-tier networks. Instead, we see new emerging trends:

Search distribution partners associated with 1st tier search engine networks, delivering real traffic but substantially boosting their revenue by using artificial clicks – paid humans in the past, but we have noticed an increase in botnets usage recently as they are easier to coordinate. For instance, each IP must click twice a day targeting a different advertiser each day. This is easier to achieve with botnets. Some of the most widespread botnets target IE but not Firefox.

Search distribution partners associated with 2nd tier search engine networks are getting smarter too. We have seen a noticeable increase in transparent proxy usage to perpetrate click fraud. In its simplest forms, the fraud scheme is very crude, essentially generating massive amounts of clicks through AOL accounts. More refined fraud schemes in this category involve using botnets or human beings “sitting” on Comcast or AOL IP addresses, generating only a few clicks against each advertiser each day.

3rd tier search engines are still impacted with low-grade click fraud. In some instances, fraud removal has resulted in being over-conservative and loss of revenue for the search engine, because the click fraud detection algorithm generates too many false positives.

Content network is becoming a favorite place for fraudsters. Now 1st tier networks allow advertisers to bid differently for the search network and the content network. However, one big gain for advertisers will come from using adequate click fraud detection tools that quickly identify the bad apples in a content network, drop them and transfer money to good affiliates (in the content network). Those who drop content network altogether or do not set the right bid in content network will lose out.

Breakdown of "weird" clicks

Regardless of the source, any claim pinpointing a specific level of fraud in the industry is, by definition, arbitrary and therefore indefensible.

The proportion of undesirable clicks varies greatly depending on the search network (1st tier, 2nd tier, 3rd tier) and click origin (search, content network or contextual advertising). While some industries were more impacted in the past because of aggressive advertising practices, more segments and verticals are impacted now as PPC advertising is increasingly used in many industries.

Content network fraud frequently reaches levels above 50%, mostly from affiliates with unsophisticated types of fraud. Here, by fraud, we mean fraud that can easily be identified and quantified, not gray clicks that come from traffic buckets with fraud or unquantifiable quality issues. We have uncovered smarter content network fraudsters that use a network of affiliates. Foreign traffic for advertisers not interested in it is now significantly down on Google, to less than 1%. Yet identifying country of origin in real time is not as simple as it might seem, thus these non-domestic clicks still exist. The situation outside Google is not as good. Even on Google, advertisers accepting foreign traffic are facing significantly higher levels of fraud targeting non US advertisers or US advertisers reaching outside US, as some of the click fraud activity has moved abroad.

Small advertisers, on Google’s search network, can experience fraud levels above 50% as some botnets are designed to hit an advertiser no more than a few times a day. The botnets that we’ve seen so far are not yet able to discriminate between a small and a big advertiser. Another type of advertiser that fraudsters prey on are companies generating revenues out of leads sent downstream. The scheme is as follows: a fraudster generates bogus queries on Google, bogus clicks and then bogus conversions on the target web site. Both Google and the victim (advertiser) benefit from this and might not even realize that their traffic has problems, until the clients downstream start to complain about the “bogus conversions”. These bogus conversions are actually fraudulent clicks for the client downstream. Thus, even traffic that converts well can be bad, just as non-converting traffic may not necessarily be bad (e.g. bad landing page). One botnet that we have investigated was targeting 50% of all advertisers, generating 1% of all clicks (search network, 1st tier search engine) for a mid size advertiser, and as much as 50% of the conversions (all bogus).

It is sometimes difficult to identify whether a click fraud case is due to a botnet or some other scheme. For instance, it is impossible to distinguish between an AOL botnet, a fraudster using 50 legitimate AOL accounts together with a good homemade robot, or spoofed AOL IP addresses. However all cases have patterns that will make them detectable as fraud.

In our experience, 1st tier search networks (if you exclude content network and international clicks) generate 5 to 10% of totally un-billable clicks. Not all these clicks are fraudulent, some are actually “friendly” robots that somehow were not filtered out by the search engine, sometimes because they are using newly assigned IP addresses. We estimate that after excluding the 5-10% garbage, another 30-50% of the clicks have issues and should be charged at a discounted price – essentially these clicks have lower odds of converting across a large pool of advertisers. While all search engines have some price discounting strategies (Google calls it “smart pricing”) based on click quality, we have found so far that these strategies are currently very simplistic and produce a highly bumpy price distribution for any given bid, keyword and advertiser combination.

Metrics used to detect fraudulent activity

We can not divulge the details of our developed proprietary, patent-pending solutions, but IP address, user agent, time and time zone, velocity, geographic location – including country detection in real time, query, actions performed by agent (agent behavior) and data reported by search engines are part of the mix. More sophisticated metrics are now routinely used, such as proxy category, network topology metrics, multiple and generic conversion metrics tracked at the session level, full reconciliation between advertiser and search engine data, pre-scored traffic segments or data from external advertisers and search engines when processing a client, and even third party data and data collected through various honeypots, test campaigns and design of experiments. One example of a more advanced metrics is checking whether a particular user agent (say, some version of Firefox) is behaving as it should, in terms of http requests. This type of browser-based metric has actually led to the development of new intellectual property (patent pending).

Compound metrics are routinely used, such as click to conversion ratios (CTR). Actually, correct CTR is not easy to compute: we have to match a conversion with a click (not a simple task with dynamic IP addresses or session cookie that don’t work when the agent has cookies turned off), and then to assess whether the conversion is genuine or not.

When possible, we try to avoid clear gif or javascript to monitor traffic, as they will miss many fraud cases, particularly in impression or query logs. So, while the choice of metrics is critical, the methodology and the data collected is very important as well. It is also important to help advertising clients set their servers properly to avoid missing clicks, and create a proper comprehensive tagging system, usually through the search engine API.

We have been able to detect many types of fraud in a particular type of click log that did not contain user agent information (not even a user agent ID), nor keyword, nor bid. Large clients might not have the capacity to store many fields, and thus developing algorithms that can work on limited data is critical.

Possibly the most important factor is the ability to work with a large number of generic rules and a few anti-rules, check how these rules interact, and develop a sound statistical system to weight the various rules properly, in a robust way. Failing to do so will result in false positives and inaccuracies. Our system is based on the most advanced statistical tools ever used in any time of scoring system, merging multiple classifiers and efficiently processing large data sets without the drawbacks of explicitly computing decisions trees or full PLS logistic regression models, yet taking advantages of these methodologies indirectly, in a more efficient way.

Example of bad clicks: the spiralup botnet

Many fraud cases have been discussed above. Here we provide specific details about a widespread botnet still operating. As many as 50% of all advertisers might be hit, albeit with a low frequency. It is connected with a particular search distribution partner on the largest search engine network. We will call it spiralup, although its real name is different.

Their traffic has been growing exponentially over the last few years, according to Alexa (see graph below). Note that Alexa can’t always discriminate between real and fake traffic. Some software (AlexaBooster) can be purchased to artificially inflate your Alexa rankings.
Two sharp dips in early 2006 and 2007, see graph below..
Back in 2006, the browser distribution was different, with more Firefox, possibly indicating a network of humans paid to click in poor countries
In 2007, the browser distribution shifted, with more Internet Explorer, as they use a botnet programmed specifically for Internet Explorer but not for other browsers
They are growing by constantly adding new advertisers (victims) in their target list, but rarely generate more than 3 clicks per day per advertiser. Newly infected computers are assigned to advertisers newly added in their list.
Advertisers accepting clicks from foreign countries, and small advertisers, are hit hardest
Part of their traffic is real, part of it is bogus and generated by botnets (clicking agents attached to viruses), part of it is human beings paid to click according to some pre-specified schedule
They use a very large pool of IP addresses (they must have infected MANY computers), although it definitely has international flavors and some IP blocks tend to be overused. Also foreign transparent proxies are over-represented, probably because it is more difficult for the scheme to be unearthed if it hides behind proxies. In addition, you can do more automated clicking behind a proxy before being detected.
Their name is associated with spyware. Spyware might have been or might still be their primary source of income, but they are now definitely into click fraud.
They do not have statisticians working for them – if they do they must be lousy ones, and that is how we caught them. Their traffic patterns are associated with unrealistic variances. They cannot simulate meaningful variances, in particular in the number of clicks found in some traffic buckets. In addition, the clicking activity over time, and the click-to-conversion ratios are too regular. They are also generating bogus conversions – all the very numerous conversions coming from them were 100% bogus..

Below are four clicks from spiralup:

13/May/2007:08:58:54, query=data+marts, IP=xxx.139.16.154
02/May/2007:04:31:47, query=on+line+shopping+sears+canada, IP=xxx.55.121.2
06/Jan/2007:02:22:23, query=malpractice, IP=xxx.115.106.226
13/Feb/2007:19:33:17, query=fort+myers+mesothelioma+lawyers, IP=xxx.152.21.8

Details on the four clicks:

Each click is from a different advertiser
Each click has a Google gclid tag.
The time zone is from the advertiser log.
Sessions consist of one or two pages: the landing page, and quite frequently, the conversion page. Time to conversion is less than 10 seconds.
The first click was billed at full price (even days later, the charge did not disappear). It resulted in a bogus conversion. It also triggered an HTTP request on the target page for a blank stylesheet.
This means that the botnet is a parasite of Internet Explorer, and does not have its own code to connect to the Internet, but rather rely on Internet Explorer to do so.
All four clicks have IE 6 as a user agent, as one would expect.

Spiralup's exponential traffic growth:

Data Mining • Machine Learning • Analytics • Quant • Statistics • Econometrics • Biostatistics • Web Analytics • Business Intelligence • Risk Management • Operations Research • AI • Predictive Modeling • Actuarial Sciences • Statistical Programming • Customer Insight • Data Modeling • Competitive Intelligence • Market Research • Information Retrieval • Computer Science • Retail Analytics • Healthcare Analytics • ROI Optimization • Design Of Experiments • Scoring Models • Six Sigma • SAS • Splus • SAP • ETL • SPSS • CRM • Cloud Computing • Electrical Engineering • Fraud Detection • Marketing Databases • Data Analysis • Decision Science • Text Mining