Key Findings from 37GB of Dot-cm Typosquatting Scheme Logs
By Jüri Shamov-Liiver on 13 June 2018
Earlier this year Brian Krebs (KrebsOnSecurity) published an article about a large typosquatting scheme (https://www.krebsonsecurity.com/2018/04/dot-cm-typosquatting-sites-visited-12m-times-so-far-in-2018). The story is based on 5 years of typosquatting operations' logs with quite many details on the scheme setup and persons behind it. However, when getting hold of these logs, I was intrigued to dig deeper into the figures to get a more comprehensive overview of the operation. With SpectX at my disposal, I could easily analyse the entire dataset rather than rely on a few samples of selected periods.
The logs with a total size of ~37 Gb represent the period of operations since 2013/08/09 until 2018/03/30. They contain five types of daily rotated files:
The following analysis is based on the main [dd-MM-yyyy].log
access logs. It represents requests that are related to a campaign - i.e interaction with a visitor based on the rules of a particular typosquatted domain. The requests are partly originating from a visitor, partly from redirects of the campaign.
The whole operation is based on campaigns: i.e. the lifecycle of a typosquatted domain. The [dd-MM-yyyy].log
access log contains a numerical ID which is consistent every time a request with the same domain appears in the log. Requests with domains that do not have an assigned ID (i.e are not registered with the system) end up in [dd-MM-yyyy]-notinsystem.log
The ID appears to be assigned in an increasing manner over time. When sorting ID's in an ascending order of their first date of appearance, we see that they are increasing. An incremental ID is used in most databases, this appears to be the case here as well. A few exceptions to that rule most likely mean that the typosquatting domain was registered earlier but did not appear in the log for a while (i.e did not "work"). Similarly, there are gaps in the ID sequence, which likely represent campaigns that never worked.
Several of the domains appear to be clustered. When sorted sequentially by the ID, similar domains appear together. If newly added domains correspond to incrementally higher ID numbers, it is likely that this pattern was produced by someone who added together all similar variants of a site being typosquatted. For example, the following cluster:
The incremental nature of the ID lets us infer the rate at which new typosquat-domains were added:
I can also see that they were getting gradually better at choosing domain names. Leaving aside the first year (as I don't know from which ID the real operations started from) the ratio of registered to "working" typosquatted domains is getting better. The numbers are large enough to imply someone working full time on it.There are in total 4376 unique domains
observed throughout the operation of the scheme. For some reason, 60 of them have been registered twice under different ID's. A wide variety of domains is present, including imitations of Fortune 500 company domains, domains imitating sites at the top of Alexa rankings, and some domains that are likely to be mistyped, such as single letter domains (like a.com.cm)
that range through the whole alphabet and every number. A large number of porn sites, technology companies, news organizations, banks and retailers are also represented.
The campaigns are distributed between 81 different top-level domains:
More than half of the typosquatted domains are in .com
. But as we see later, the most "successful" ones are .cm
(country Cameroon and Oman top-level domains) that represent almost quarter of the domains. Alongside of .info
, and .org,
a large number of other country level TLDs are involved.
Let's look at some other campaign properties. The number of unique referral domains
(i.e domains from the referrer field) characterise the exposure of a campaign. Some of them are client-originated domains (for example google search), others represent redirects made by the scheme. The larger the variety of referral domains, the larger the target focus of a typosquatting campaign. The following table lists top 5 campaigns with the largest focus :
The number of unique visitors
(i.e unique IP-addresses) obviously shows the exposure of a campaign. The following table lists top 5 campaigns with the largest number of visitors:
The number of requests
is perhaps the most correlated income of the campaign operator: the more requests the more ads displayed, malware, scareware, adware installed, etc. This is what they're paid for. The following table lists top 5 campaigns with the most traffic:
Unsurprisingly, facebook.cm is definitely the most prevalent of campaigns. It is present in all the "top 5" tables.Campaign lifetime
shows the distribution of requests over time. The following diagram displays top 6 daily requests of a campaign over the whole operation period:
The nature of the campaign traffic is very different. facebook.cm
is the most prolific with very stable daily traffic throughout the whole period. It shows steady (almost constant) rate of decline, from 23000 - 25000 daily requests in Jan 2014, to 8000 - 9000 daily requests in Mar 2018. There are a few clearly abnormal peaks where traffic surges up to 3-10 times compared to the daily average: on 2017/09/13 and on 2017/11/24.
On 2017/09/13 the peak is caused by requests from a single IPV4 address (54.144.35.xxx) in US. All the requests are identical, with an empty referrer, query to root ("/") and a user agent belonging to the Firefox browser on Windows XP: "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:18.104.22.168) Gecko/2009011913 Firefox/3.0.6".
On 2017/11/24 the peak is caused by requests from a single IPV4 address (197.251.139.xxx) in Ghana. All the requests are identical, with an empty referrer, query to root ("/") and a user agent belonging to Chrome 63 browser on Windows 7: "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3177.0 Safari/537.36".
Both cases show obvious signs of an automated script gone awry. The average volume is too low to represent a genuine DOS attack, perhaps it could be a script for monitoring quality of the service?
pornhub.cm also has quite stable traffic, although at a much lower rate compared to facebook.cm, but with steadily increasing daily volumes:ilshelter.tk
has a much shorter lifecycle compared to the others in top 6, just about a year from October 2014 to October 2015 (with the final burst in January 2016):HOW MUCH OF THIS TRAFFIC CAN BE ATTRIBUTED TO HUMANS?
There are about 1.6 million different user agent strings in the dataset. A lot of them are invalid, a vast majority containing hacking attempts (such as shell commands, exploiting various vulnerabilities of php, etc). This is not an uncommon technique for targeting applications parsing user agent strings.
Filtering the list for web browsers and weeding out various bots and crawlers (not all of them), we end up with 1 406 686 user agents. The top 5:
It is possible that many of them are not actual browsers as many malware crawlers mask themselves with a valid browser user agent strings.
Applying the same filter to all of the traffic we end up with 81.89% being initiated by browsers.
Of all the unique ip-addresses, a vast majority belongs to IPV4 and only small fraction to IPV6:
The following chart shows the ip-country breakdown of IPV4 visitors:
Unsurprisingly, the US is contributing nearly 35% of the IPV4 traffic. What is surprising though is to see Vietnam and South Korea among the top 5 countries, when China is only in the 5'th place with twice as few visitors compared to South Korea. This suggests that (some) of the campaigns must be targeted specifically to these countries.
Among all the IPV4 addresses there appears to be a number of addresses belonging to a private space (such as 192.168.0.1 and other address ranges mentioned under RFC1918). There are in total 464894 such addresses which constitutes to 1.25% of all observed IPV4 addresses.
About 28% of traffic from private range addresses can be associated with service monitoring via user agent string: "check_http/v1.4.15 (nagios-plugins 1.4.15)".IPV6 VISITORS
For the remaining part of the traffic we could only guess that it might represent some testing activity: 99,6% of the requests seem to be made directly at a typosquatted domain with no referrer, with the path set to root ("/") and no query. A large number of various user agents seem to speak in favor of that hypothesis. Or it could equally well represent a part of customer-initiated traffic from networks connected to the system via VPN-tunnelling. Computing correlation between daily requests from private and public ip-addresses of top 6 campaigns, we see that there is a very strong correlation for facebook.cm campaign (0.99), positive correlation for ilshelter.tk campaign (0.78) but very weak correlation for youjiis.com (0.12) and naver.cm (0.29). Such inconsistency would be explained by mix of testing and tunnelled user initiated traffic.
Although a small fraction of all the traffic it may still be interesting to see the originating country:
Unsurprisingly, the US is a sovereign leader here (41.2% of ipv6 addresses). India with the share of 31% is somewhat surprising though, as it does not show up in any other top lists.
From the visitor IPV4 country top 5 I get a slight hint that some of the campaigns may be targeting specific countries. By looking at the flows from "country" to "campaign" we can see if the same countries appear in the top 5 list:
Sure enough, Korea, Vietnam and US are there. Now let's look at the flows of the most popular campaigns:
Although the US is in the lead position, the overall distribution is not suggesting that this campaign is attracting users from a specific country. However, youtube.com.vn
looks quite different:
Here we can clearly see that the distribution of traffic from Vietnam suggests that the campaign is targeted, similar to ccTLD .vn
How do People End Up in a Typosquatted Domain?
Referrer values could give us some hints to answer this question. When computing distribution of unique referrers over all requests we get that 86.7% of all requests are made with no (empty) referrer value
- i.e these clients came to a typosquatted domain directly.
Next, a lot of requests are referred from search and redirection sites with a fractional share of traffic.
When such a large proportion of requests is made directly at all the campaigns then we should see a similar distribution also for individual campaigns. And indeed, computing the same distributions for facebook.cm we see 99.17%
of requests being made directly, youtube.com.vn 97.42%.
The only exception seems to be ilshelter.tk campaign, where 60% of requests are made with a reference to the campaign homepage
and only 30.85% of requests are made directly.
Application logs can reveal interesting information from an unexpected angle. For instance, if the log record fields contain data with a type that should not be there, it is usually ignored. But what such records actually can reveal is that the application has somehow failed to interpret the data as intended. And that is almost always a big red warning flag.
The second field of [dd-MM-yyyy-lang]-*.log
records is supposed to contain client information: one or more IPV4 or IPV6 address or hostname (i.e very similar to x-forwarded-for
header content). When looking at matching failures of this field, interesting stuff starts to appear:
22.214.171.124[CHR(0)]<!--#exec cmd=\"ls -la;factor 228000\" -->
-1));select pg_sleep(9); --
The latter decodes to:
You can see attempts of shell command execution, xss testing, attempts to install PHP plugin via Mysql driver, etc. Whether they have succeeded or not cannot be decided based on these logs.
Back to articles