Exploring BlueCoat Proxy Logs

Having successfully defined and verified virtual structure for BlueCoat W3C ELFF formatted logs we can now proceed to more interesting actual analysis.

First let’s select interesting fields (stuff like s_sitename, s_hierarchy, etc are not really relevant for us):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$pattern = $[/shared/patterns/bluecoat-w3c-elff.sxp];
@src = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_*.log', pattern:$pattern);

@src
 .filter(hdrLine is null)
 .select(                       //select only fields of interest
         dateTime
         ,c_ip                  //client ip addr
         ,sc_status             //Protocol status code from appliance to client
         ,cs_method             //Request method used from client to appliance
         ,cs_uri_scheme         //Scheme from the 'log' URL
         ,cs_host               //Hostname from the 'log' URL.
         ,cs_uri_path           //Path from the 'log' URL. Does not include query.
         ,cs_uri_query          //Query from the 'log' URL.
         ,cs_username           //Relative username of a client authenticated to the proxy (i.e. not fully distinguished)
         ,s_supplier_name       //Hostname of the upstream host (not available for a cache hit)
         ,rs_content_type       //Response header: Content-Type
         ,cs_useragent          //Request header: User-Agent
         ,sc_filter_result      //Content filtering result: Denied, Proxied or Observed
         ,sc_filter_category    //Content filtering category of the request URL
         ,x_virus_id            //Identifier of a virus if one was detected
         ,x_virus_details           //Details of a virus if one was detected)
 .save('/shared/demo/bluecoat/bcdata.sxt')
;

Reading raw data from local hard disk, extract and transform (parse) of all 4 logs (int total 2,6Gb) takes about 10 seconds on my MacBook Pro with 8 CPU cores and 16 Gb RAM. Not too bad but we can get this even faster by saving parsed results and performing further queries from that. (Note that if our dataset had included live updated data we couldn’t have taken that approach). Now even most complex queries take just few seconds.

We have very little knowledge about the environment where the logs come from, we only know that the logs come from lab (at Honeynet project) and they contain production (i.e live) and simulated (artificially generated) traffic. We don’t know how BlueCoat proxy was deployed, what sort of clients there are, etc. Hence let’s see what we can deduct from the logs itself.

Sidenote: in the query examples I’m mostly using the piped command style. All of them could be written also in SQL style.

Traffic Dynamics, Origin and Destination Distributions

One of the first things I always check is how the traffic dynamics look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
@src = @[/shared/demo/bluecoat/bcdata.sxt]; //declare variable @src referring to stored parsed data

@hourlyReq = @src                           //declare query using variable @hourlyReq
                                            //data from @src gets piped to next statement:
 .select(
   dateTime[1 hour] as hourTrunc,           //truncate timestamp to hour precision, assign alias hourTrunc
   count(*) as requests)                    //calculate counts, assign alias requests
 .group(hourTrunc)                          //compute counts by hourTrunc groups
;

@hourlyReq                                  //execute query

/*the query written in SQL style will produce the exact same result:
@hourlyRequests =
SELECT
   dateTime[1 hour] as hourTrunc,           //truncate timestamp to hour precision, assign alias hourTrunc
   count(*) as requests)                    //calculate counts, assign alias requests
GROUP BY hourTrunc                          //compute counts by hourTrunc groups
;
*/
../../../_images/bcexp1.png

We have approx a month worth of records (from April 4’th to May 7’th 2005). We can observe three distinct periods: first ~10 days (5’th to 14’th April) with very little traffic (few tens of hourly requests), then a period of complete silence from 13’th to 27’th April and finally last 10 days with rapidly increasing traffic. Hmm ... It is very unlikely that such traffic increase is caused by some malicious activity unless it’s DDOS. The latter should show convergence to very few targets, so its easy to verify. More likely hypothesis is that lab guys tested initial setup during first 10 days, then went on to improve it (or party?) and after 2 weeks they launched real thing. Ok, let’s see what the target distribution is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@hourlyTargets = @src
 .select(
   hourTrunc:dateTime[1 hour],   //truncate timestamp to hour precision, assign alias hourTrunc
   cs_host,
   count(*))                     //compute unique hourTrunc and cs_host pairs
 .group(@1, @2)                  //group by first and second position fields (hourTtrunc and cs_host)
 .select(hourTrunc, targets:count(*))    //compute the count of hosts per hourTrunc
 .group(@1)
 .join(@hourlyReq on left.hourTrunc = right.hourTrunc)
 .select(tHour:left.hourTrunc, targets, requests)
;

@hourlyTargets
../../../_images/bcexp2.png

As we can see the number of target hosts follows clearly the same pattern as number of requests during “live” period. To be sure about it we can also compute the Pearson correlation coefficient between the two:

1
2
3
@hourlyTargets
 .select(CORR(targets, requests))    //compute Pearson correlation coefficient
;

which gives us 0.98 - i.e we see strong correlation between number of requests and number of different hosts. We can now safely rule out DDOS hypothesis.

Next let’s have a look at clients. Their distribution should give us some clue how the proxy is set up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@hourlyClients = @src
 .select(
   hourTrunc:dateTime[1 hour],   //truncate timestamp to hour precision, assign alias hourTrunc
   c_ip,
   count(*))                     //compute unique hourTrunc and c_ip pairs
 .group(@1, @2)
 .select(hourTrunc, clients:count(*)) //count unique clients per hourTrunc
 .group(@1)
 .join(@hourlyReq on left.hourTrunc = right.hourTrunc)
 .select(tHour:left.hourTrunc, clients, requests)
;

@hourlyClients
;

Here the upper diagram displays the first period (area grayed on lower diagram). Lower diagram shows trend for entire period.

../../../_images/bcexp3.png

We can see constant chattering of few tens of requests per hour from few clients. This is mixed with occasional bursts of requests from still the same low number of clients. Considering that the number of targets remain low as well it then perhaps this is simulated traffic. Since the overall number of requests and targets is low we could look directly at requests from this period:

1
2
3
4
5
6
7
8
@testPeriodTraffic = @src
 .filter(dateTime > T(2005,04,08) and dateTime < T(2005,04,12))
 .select(cs_host, cs_uri_path, count(*))
 .group(@1,@2)
 .sort(@3 desc)
;

@testPeriodTraffic;
cs_host cs_uri_path count_star
www.ositis.com /UpdateLocation/ProxyAV/McafeeAntivirusEngineLocation 361
www.ositis.com /UpdateLocation/ProxyAV/McafeeAntivirusPatternLocation 350
www.ositis.com /tests/testconnectivity.asp 171
10.2.13.200 /index.html 117
10.2.13.200 /favicon.ico 95
register.ositis.com /cgi/RegCGI.cgi 68
download.windowsupdate.com /v5a/windowsupdate/b/SelfUpdate/AU/x86/NetServer/en/wusetup.cab 37
10.2.13.200 / 33
download.windowsupdate.com /ws03sp1/windowsupdate/redir/wuredir.cab 23
download.bluecoat.com /release/ProxyAV/2.2.1/2000E/update-22.asp 22
stats.update.microsoft.com /ReportingWebService/ReportingWebService.asmx 20
. . .    

This looks very much like setup and testing related activities: updating McAfee AV engine, virus patterns, downloading and applying operating system updates, setting up BlueCoat, etc.

The clients graph of the “live” period fits the waves of requests and targets distribution (requests are not shown for better visibility):

../../../_images/bcexp4.png

Let’s compute the correlation to be sure:

1
2
3
4
5
@clientRequestCorrelation = @hourlyClients
 .select(CORR(targets, requests))    //compute Pearson correlation coefficient
;

@clientRequestCorrelation;

We get 0.98 again. Considering quite high number of both clients and target hosts this looks like real live traffic. This is also suggested by the period of the waves on the graph: in GMT timezone traffic starts increasing at 1pm and it dies off at 3am.

IP Network Analysis

Let’s see the ip countries contribution to hourly traffic:

1
2
3
4
5
@hourlyCC = @src
 .select(ccIP:cc(c_ip),
             dayTime:count(HOUR(dateTime)>8 and HOUR(dateTime)<16),
             nightTime:count(HOUR(dateTime)<=8 or HOUR(dateTime)>=16))
 .group(@1)
../../../_images/bcexp5.png

The biggest contributor is unknown country. Internal address space? Let’s see traffic from class A networks:

1
2
3
4
5
6
7
@top5Aclass = @src
 .select(c_ip & 0xFF000000l as network, requests:count(*))    //compute count by /8 network
 .group(@1)
 .sort(@2 desc)
 .select(*, percentage:(DOUBLE(requests)/$totalRequests)*100) //compute the percentage of total
 .limit(5)
;
network requests percentage
45.0.0.0 7999532 98.38808745736779
192.0.0.0 27032 0.3324727971770806
81.0.0.0 19310 0.23749813974139636
10.0.0.0 19181 0.23591153901500383
213.0.0.0 15171 0.18659162496202614

Funny that 98.4% of traffic originates from 45.0.0.0/8 range, which in 2005 belonged to Interop Show Network - an annual trade fair for information technology organized by UBM (https://en.wikipedia.org/wiki/Interop). Coincidentally NetWorld(SM)+Interop(R) 2005 conference was held May 1-6, 2005 at the Mandalay Bay Convention Center in Las Vegas (https://www.businesswire.com/news/home/20041207005103/en/NetworldInterop-2005-Showcase-Largest-Assembly-Exhibitors-North). Did Honeynet guys set up their proxy to intercept conference traffic?

Requests graph @hourlyReq is computed with timezone set to GMT, which gives us activity increase at nighttime. When changing timezone to PST (that’s where Las Vegas happens to be) we observe activity starting to increase at 6-7 am and at 7pm it has come down to nightly level. There is also distinct “saddle” in around lunchtime too. Looks like normal human behaviour to me.

Content Analysis

What does BlueCoat content filtering say? If it was a computer and information technology related event we should see Computers/Internet in top5:

1
2
3
4
@contentFilter = @src
 .select(sc_filter_category, count(*)).group(@1).sort(@2 desc).limit(15)
;
@contentFilter;
sc_filter_category count_star
Computers/Internet 1445157
Web%20Advertisements 1429511
none 969769
Business/Economy 863626
News/Media 680855
Search%20Engines/Portals 381611
Email 259397
Travel 257823
Shopping 252155
Sports/Recreation/Hobbies 231239
Arts/Entertainment 141181
Brokerage/Trading 138826
Pornography 118691
Web%20Communications 108456
Education 76700

Sure enough we do. As well as Business, Economy, News and Email in prominent positions. Oh and porn is there too, at 13’th place. In my opinion porn is good indicator of human behaviour - I have never seen a bot or trojan showing any interest in it. Let’s have a look, what is the favourite time for enjoying this sort of entertainment:

1
2
3
4
5
6
7
8
@hourlyPorn = @src
 .select(
   hourTrunc:dateTime[1 hour],   //truncate timestamp to hour precision, assign alias hourTrunc
   pornCnt:count(sc_filter_category = 'Pornography'),
   total:count(*))
 .group(@1)
;
@hourlyPorn;
../../../_images/bcexp6.png

Apparently no porn watching at testing period (seems Honeynet guys were truly committed to their work in setting up lab environment). Porn traffic appears only at “live” period and at PST daytime. Attendees bored of conference? Let’s see related traffic from class A networks:

1
2
3
4
5
6
7
@AClassPornNetworks = @src
 .filter(sc_filter_category = 'Pornography')
 .select(c_ip & 0xFF000000l as network, requests:count(*))    //compute count by /8 network
 .group(@1)
 .sort(@2 desc)
;
@AClassPornNetworks;
network requests
45.0.0.0 117813
204.0.0.0 878

Yeah, 99% are from conference. Still, we’re only humans and this confirms hypothesis of real live traffic.

Moving on, let’s see the BlueCoat virus information:

1
2
3
4
5
6
7
8
@viruses = @src
 .select(x_virus_id,
             testPeriodCnt:count(dateTime < T(2005,04,27)),
             livePeriodCnt:count(dateTime >= T(2005,04,27))
            )
 .group(@1).sort(@3 desc)
;
@viruses;
x_virus_id testPeriodCnt livePeriodCnt
NULL 11509 8118891
EICAR test file 11 81
W32/Sober.gen@MM 0 48
Exploit-MhtRedir.gen 0 8
BackDoor-CPT 0 5
Downloader-RK 0 5
Exploit-ByteVerify 0 4
JS/Exploit-MhtRedir.ldr 0 4
StartPage-DU.dll 0 4
Downloader-LG 0 2
Downloader-YH.dr 0 2
W32/Sober.p@MM!zip 0 2
Downloader-VG 0 1
Downloader-YK 0 1
Exploit-DcomRpc.gen 0 1
Exploit-HelpZonePass 0 1
Exploit-ObscuredHtml 0 1
FDoS-UDPFlood 0 1
Generic Downloader.g 0 1
JS/Noclose.gen 0 1
JV/g 0 1
Prutec 0 1
VBS/Inor 0 1
VBS/Psyme 0 1
Vundo 0 1
W32/Netsky.z@MM!zip 0 1

25 different viruses identified. They make 0,002% of all requests, and half of them are not real virus infections but testing (EICAR test file). What sites distribute this stuff?

1
2
3
4
5
@dropperSites = @src
 .filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
 .select(cs_host, count(*)).group(@1).sort(@2 desc)
;
@dropperSites;

37 sites. And techlabs.bluecoat.com among them. How come? They’re supposed to be decent vendor and not a dodgy place to catch viruses! Let’s look closer:

1
2
3
4
5
6
@bluecoatTechlabs = @src
 .filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
 .filter(cs_host = 'techlabs.bluecoat.com')
 .select(dateTime, sc_status, cs_uri_path, x_virus_id, x_virus_details)
;
@bluecoatTechlabs;
dateTime sc_status cs_uri_path x_virus_id x_virus_details
2005-04-28 17:34:12 200 /eicar/noisybear.cab JV/g Virus: JV/g; File: noisybear.cab; Sub File: NoisyBear.class; Vendor: Network Associates, Inc (McAfee); Engine error code: None; Engine version: 10.21; Pattern version: 4.3.20.124865; Pattern date: 04/28/2005 04:40:00

noisybear.cab - creation of Mark LaDue proof-of-concept malicious Java program (http://www.cigital.com/hostile-applets). Displays picture of a bear emitting annoying noise. Used for testing AV detection, similarly to eicar. We can safely exclude it from real dropper sites.

How many clients infected? With which viruses? When? Following query produces list of client ip’s with real infection attempts. In an enterprise environment setup this would serve a good starting point for CSIRT team mitigation actions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@infectedClients = @src
 .filter(x_virus_id is not null AND sc_status = 200)
 .filter(x_virus_id != 'EICAR test file')
 .filter(cs_host != 'techlabs.bluecoat.com')
 .select(
   c_ip,
   x_virus_id,
   min_dateTime:min(dateTime))      //get the earliest client timestamp of infection with a virus
 .group(@1, @2)
;

Conclusion

The logs provided us records of web traffic from April 4’th to May 7’th 2005. It includes three distinct periods, which we identified as

  1. test period of initial setup (from 5’th to 14’th April),
  2. period of complete silence (from 13’th to 27’th April) and
  3. “live” period from April 27’th to May 7’th

Testing period is characterized by relatively low number of requests (ten to few hundred per hour) to few selected sites, related to setting up BlueCoat and antivirus detection engine, updating operating system (Win XP SP1).

The silence period could mean the lab folks spending their time on setting up surrounding environment for live period.

Finally “live” period is characterized with rapidly increasing amount of traffic, peaking at 5’th May at 260 000 requests per hour and then gradually decreasing. The number of requests, originating clients and target hosts is high enough to make the artificial generation very complicated and unlikely. The distribution of originating clients and target hosts follows very closely the activity patterns of traffic.

The activity patterns seem to match normal daily human activity: rapid increase from early morning hours, decreasing in the afternoon with distinct lower “saddle” indicating lunchbreak between them. The timezone matching this pattern will place the originating activity to US west coast (PST). Another aspect indicating human nature in origin of traffic is the presence of adult sites visits.

Network analysis suggests strongly that the traffic is originating from participants of NetWorld(SM)+Interop(R) conference held May 1 - 6 at the Mandalay Bay Convention Center in Las Vegas.

Observed virus infections were relatively low: only 0,002% of traffic relates to known viruses. Half of that relates to testing antivirus engines. From 3499 clients 49 were performing real virus downloads (i.e infection attempts).

You can download the whole script from here. Logs are available for download from here: http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip