Exploring BlueCoat Proxy Logs¶
Having successfully defined and verified virtual structure for BlueCoat W3C ELFF formatted logs we can now proceed to more interesting actual analysis (using logs from http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip ).
First let’s select interesting fields (stuff like s_sitename, s_hierarchy, etc are not relevant for us):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
/* Notes on usage: 1) The view depends on the pattern located at current directory ./bluecoat-w3c-elff.sxp. Adjust the value of $pattern according to where you have saved it 2) The view reads files (Demo_log_001.log Demo_log_004.log) from http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip. 3) When executing the view it saves the result data to /shared/demo/bluecoat/bcdata.sxt. If you prefer to query directly from the view then comment save() statement below and save the view script. If you want to use different path/filename change the save() path accordingly. */ $pattern = $[./bluecoat-w3c-elff.sxp]; LIST('http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip') | PARSE(pattern:$pattern, archive_src:'Demo_log_*.log') | filter(hdrLine is null) | select(dateTime ,c_ip //client ip addr ,sc_status //Protocol status code from appliance to client ,cs_method //Request method used from client to appliance ,cs_uri_scheme //Scheme from the 'log' URL ,cs_host //Hostname from the 'log' URL. ,cs_uri_path //Path from the 'log' URL. Does not include query. ,cs_uri_query //Query from the 'log' URL. ,cs_username //Relative username of a client authenticated to the proxy (i.e. not fully distinguished) ,s_supplier_name //Hostname of the upstream host (not available for a cache hit) ,rs_content_type //Response header: Content-Type ,cs_useragent //Request header: User-Agent ,sc_filter_result //Content filtering result: Denied, Proxied or Observed ,sc_filter_category //Content filtering category of the request URL ,x_virus_id //Identifier of a virus if one was detected ,x_virus_details) //Details of a virus if one was detected | save('/user/bcdata.sxt', true);
Reading raw data from the Internet can take a long time. We can speed up further queries by saving parsed results to /user/bcdata.sxt and performing further queries from that. Now even most complex queries take just a few seconds.
We have very little knowledge about the environment where the logs come from, we only know that the logs come from a lab (at Honeynet project) and they contain production (i.e live) and simulated (artificially generated) traffic. We don’t know how BlueCoat proxy was deployed, what sort of clients there are, etc. So let’s see what we can deduct from the logs itself.
Traffic Dynamics, Origin and Destination Distributions¶
One of the first things I always check is how the traffic dynamics look like:
1 2 3 4 5 6 7 8 9 10 11
@src = @[/user/bcdata.sxt]; //declare variable @src referring to stored parsed data @hourlyReq = @src //declare query using variable @hourlyReq //data from @src gets piped to next statement: | select( dateTime[1 hour] as hourTrunc, //truncate timestamp to hour precision, assign alias hourTrunc count(*) as requests) //calculate counts, assign alias requests | group(hourTrunc) //compute counts by hourTrunc groups ; @hourlyReq //execute query
We have approx a month worth of records (from April 4’th to May 7’th 2005). We can observe three distinct periods: first ~10 days (5’th to 14’th April) with very little traffic (few tens of hourly requests), then a period of complete silence from 13’th to 27’th April and finally the last 10 days with rapidly increasing traffic. Hmm … It is very unlikely that such a traffic increase is caused by some malicious activity unless it’s DDOS. The latter should show convergence to very few targets, so it is easy to verify. A more likely hypothesis is that lab guys tested initial set up during the first 10 days, then went on to improve it (or party?) and after 2 weeks they launched the real thing. Ok, let’s see what the target distribution is:
1 2 3 4 5 6 7 8 9 10 11 12 13
@hourlyTargets = @src | select( hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision, assign alias hourTrunc cs_host, count(*)) //compute unique hourTrunc and cs_host pairs | group(@1, @2) //group by first and second position fields (hourTtrunc and cs_host) | select(hourTrunc, targets:count(*)) //compute the count of hosts per hourTrunc | group(@1) | join(@hourlyReq on left.hourTrunc = right.hourTrunc) | select(tHour:left.hourTrunc, targets, requests) ; @hourlyTargets
As we can see the number of target hosts follows the same pattern as the number of requests during the “live” period. To be sure about it we can also compute the Pearson correlation coefficient between the two:
@hourlyTargets | select(CORR(targets, requests));
which gives us 0.98 - i.e we see a strong correlation between the number of requests and the number of different hosts. We can now safely rule out the DDOS hypothesis.
Next, let’s have a look at clients. Their distribution should give us some clue of how the proxy is set up.
1 2 3 4 5 6 7 8 9 10 11 12 13
@hourlyClients = @src | select( hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision, assign alias hourTrunc c_ip, count(*)) //compute unique hourTrunc and c_ip pairs | group(@1, @2) | select(hourTrunc, clients:count(*)) //count unique clients per hourTrunc | group(@1) | join(@hourlyReq on left.hourTrunc = right.hourTrunc) | select(tHour:left.hourTrunc, clients, requests) ; @hourlyClients;
Here the upper diagram displays the first period (area grayed on the lower diagram). The lower diagram shows the trend for the entire period.
We can see constant chattering of few tens of requests per hour from a few clients. This is mixed with occasional bursts of requests from still the same low number of clients. Considering that the number of targets remains low as well it then perhaps this is simulated traffic. Since the overall number of requests and targets is low we could look directly at requests from this period:
1 2 3 4 5 6
@src | filter(dateTime > T(2005,04,08) and dateTime < T(2005,04,12)) | select(cs_host, cs_uri_path, count(*)) | group(@1,@2) | sort(@3 desc) ;
This looks very much like setup and testing activities: updating McAfee AV engine, virus patterns, downloading and applying operating system updates, setting up BlueCoat, etc.
The clients’ graph of the “live” period fits the waves of requests and targets distribution (requests are not shown for better visibility):
Let’s compute the correlation to be sure:
@clientRequestCorrelation = @hourlyClients | select(CORR(targets, requests));
We get 0.98 again. Considering quite a high number of both clients and target hosts this looks like real live traffic. This is also suggested by the period of the waves on the graph: in GMT timezone traffic starts increasing at 1 pm and it dies off at 3 am.
IP Network Analysis¶
Let’s see the IP countries contribution to hourly traffic:
1 2 3 4 5
@hourlyCC = @src | select(ccIP:cc(c_ip), dayTime:count(HOUR(dateTime)>8 and HOUR(dateTime)<16), nightTime:count(HOUR(dateTime)<=8 or HOUR(dateTime)>=16)) | group(@1);
The biggest contributor is an unknown country. Internal address space? Let’s see traffic from class A networks:
1 2 3 4 5 6 7 8 9
@top5Aclass = @src | select(c_ip & 0xFF000000l as network, requests:count(*)) //compute count by /8 network | group(@1) | sort(@2 desc) |select(*, percentage:(DOUBLE(requests)/$totalRequests)*100) //compute the percentage of total | limit(5) ; @top5Aclass;
Funny that 98.4% of traffic originates from 220.127.116.11/8 range, which in 2005 belonged to Interop Show Network - an annual trade fair for information technology organized by UBM (https://en.wikipedia.org/wiki/Interop). Coincidentally NetWorld(SM)+Interop(R) 2005 conference was held May 1-6, 2005 at the Mandalay Bay Convention Center in Las Vegas (https://www.businesswire.com/news/home/20041207005103/en/NetworldInterop-2005-Showcase-Largest-Assembly-Exhibitors-North). Did Honeynet guys set up their proxy to intercept conference traffic?
Requests graph @hourlyReq is computed with a timezone set to GMT, which gives us activity increase at nighttime. When changing timezone to PST (that’s where Las Vegas happens to be) we observe activity starting to increase at 6-7 am and at 7 pm it has come down to nightly level. There is also a distinct “saddle” in around lunchtime too. It looks like normal human behavior to me.
What does BlueCoat content filtering say? If it was a computer and information technology related event we should see Computers/Internet in top5:
1 2 3 4
@contentFilter = @src | select(sc_filter_category, count(*)).group(@1).sort(@2 desc).limit(15) ; @contentFilter;
Sure enough, we do. As well as Business, Economy, News, and Email in prominent positions. Oh and porn is there too, at 13’th place. In my opinion, porn is a good indicator of human behavior - I have never seen a bot or trojan showing any interest in it. Let’s have a look, what is the favorite time for enjoying this sort of entertainment:
1 2 3 4 5 6 7 8
@hourlyPorn = @src | select( hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision, assign alias hourTrunc pornCnt:count(sc_filter_category = 'Pornography'), total:count(*)) | group(@1) ; @hourlyPorn;
Apparently, no porn watching at the testing period (seems Honeynet guys were truly committed to their work in setting up lab environment). Porn traffic appears only at the “live” period and at PST daytime. Are attendees bored with the conference? Let’s see related traffic from class A networks:
1 2 3 4 5 6 7
@AClassPornNetworks = @src | filter(sc_filter_category = 'Pornography') | select(c_ip & 0xFF000000l as network, requests:count(*)) //compute count by /8 network | group(@1) | sort(@2 desc) ; @AClassPornNetworks;
Yeah, 99% are from the conference. Still, we’re only humans and this confirms the hypothesis of real live traffic.
Moving on, let’s see the BlueCoat virus information:
1 2 3 4 5 6 7 8 9
@viruses = @src | select(x_virus_id, testPeriodCnt:count(dateTime < T(2005,04,27)), livePeriodCnt:count(dateTime >= T(2005,04,27)) ) | group(@1) | sort(@3 desc) ; @viruses;
|EICAR test file||11||81|
25 different viruses identified. They make 0,002% of all requests, and half of them are not real virus infections but testing (EICAR test file). What sites distribute this stuff?
1 2 3 4 5
@dropperSites = @src | filter(x_virus_id is not null AND x_virus_id != 'EICAR test file') | select(cs_host, count(*)).group(@1).sort(@2 desc) ; @dropperSites;
37 sites. And techlabs.bluecoat.com among them. How come? They’re supposed to be a decent vendor and not a dodgy place to catch viruses! Let’s look closer:
1 2 3 4 5 6
@bluecoatTechlabs = @src | filter(x_virus_id is not null AND x_virus_id != 'EICAR test file') | filter(cs_host = 'techlabs.bluecoat.com') | select(dateTime, sc_status, cs_uri_path, x_virus_id, x_virus_details) ; @bluecoatTechlabs;
|2005-04-28 17:34:12||200||/eicar/noisybear.cab||JV/g||Virus: JV/g; File: noisybear.cab; Sub File: NoisyBear.class; Vendor: Network Associates, Inc (McAfee); Engine error code: None; Engine version: 10.21; Pattern version: 18.104.22.168865; Pattern date: 04/28/2005 04:40:00|
noisybear.cab - the creation of Mark LaDue proof-of-concept malicious Java program (http://www.cigital.com/hostile-applets). Displays picture of a bear emitting annoying noise. Used for testing AV detection, similarly to eicar. We can safely exclude it from real dropper sites.
How many clients infected? With which viruses? When? The following query produces a list of client IP’s with real infection attempts. In an enterprise environment setup, this would serve a good starting point for CSIRT team mitigation actions.
1 2 3 4 5 6 7 8 9 10
@infectedClients = @src | filter(x_virus_id is not null AND sc_status = 200) | filter(x_virus_id != 'EICAR test file') | filter(cs_host != 'techlabs.bluecoat.com') | select( c_ip, x_virus_id, min_dateTime:min(dateTime)) //get the earliest client timestamp of infection with a virus | group(@1, @2) ;
The logs provided us records of web traffic from April 4’th to May 7’th 2005. It includes three distinct periods, which we identified as
- test period of initial setup (from 5’th to 14’th April),
- period of complete silence (from 13’th to 27’th April) and
- “live” period from April 27’th to May 7’th
The testing period is characterized by a relatively low number of requests (ten to few hundred per hour) to a few selected sites, related to setting up BlueCoat and antivirus detection engine, updating the operating system (Win XP SP1).
The silence period could mean the lab folks spending their time on setting up the surrounding environment for live period.
Finally, the “live” period is characterized by a rapidly increasing amount of traffic, peaking at 5’th May at 260 000 requests per hour and then gradually decreasing. The number of requests, originating clients and target hosts are high enough to make the artificial generation very complicated and unlikely. The distribution of originating clients and target hosts follows very closely the activity patterns of traffic.
The activity patterns seem to match normal daily human activity: rapid increase from early morning hours, decreasing in the afternoon with distinct lower “saddle” indicating lunch break between them. The timezone matching this pattern will place the originating activity to the US west coast (PST). Another aspect indicating human nature in the origin of traffic is the presence of adult site visits.
Network analysis suggests strongly that the traffic is originating from participants of NetWorld(SM)+Interop(R) conference held May 1 - 6 at the Mandalay Bay Convention Center in Las Vegas.
Observed virus infections were relatively low: only 0,002% of traffic relates to known viruses. Half of that relates to testing antivirus engines. From 3499 clients 49 were performing real virus downloads (i.e infection attempts).
You can download the whole script from here. Logs are available for download from: http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip