Distributed Grep of Creditcard Numbers and SSN’s

Sample scripts: scripts.tar.gz

Data privacy protection laws and regulations may require keeping logs clear of sensitive data, such as credit card numbers, social security numbers, etc. Finding these from logs is difficult with regular expressions since in addition to syntax rules they have also semantic rules (such as checksum).

Example1: finding Finnish social security numbers. First we need to write a function for verifying the checksum of the SSN.

/**** functions to verify checksum ****/
$lookup(checkSum) =
   CASE
     WHEN $checkSum == 0 THEN '0'
     WHEN $checkSum == 1 THEN '1'
     WHEN $checkSum == 2 THEN '2'
     WHEN $checkSum == 3 THEN '3'
     WHEN $checkSum == 4 THEN '4'
     WHEN $checkSum == 5 THEN '5'
     WHEN $checkSum == 6 THEN '6'
     WHEN $checkSum == 7 THEN '7'
     WHEN $checkSum == 8 THEN '8'
     WHEN $checkSum == 9 THEN '9'
     WHEN $checkSum == 10 THEN 'A'
     WHEN $checkSum == 11 THEN 'B'
     WHEN $checkSum == 12 THEN 'C'
     WHEN $checkSum == 13 THEN 'D'
     WHEN $checkSum == 14 THEN 'E'
     WHEN $checkSum == 15 THEN 'F'
     WHEN $checkSum == 16 THEN 'H'
     WHEN $checkSum == 17 THEN 'J'
     WHEN $checkSum == 18 THEN 'K'
     WHEN $checkSum == 19 THEN 'L'
     WHEN $checkSum == 20 THEN 'M'
     WHEN $checkSum == 21 THEN 'N'
     WHEN $checkSum == 22 THEN 'P'
     WHEN $checkSum == 23 THEN 'R'
     WHEN $checkSum == 24 THEN 'S'
     WHEN $checkSum == 25 THEN 'T'
     WHEN $checkSum == 26 THEN 'U'
     WHEN $checkSum == 27 THEN 'V'
     WHEN $checkSum == 28 THEN 'W'
     WHEN $checkSum == 29 THEN 'X'
     WHEN $checkSum == 30 THEN 'Y'
     ELSE 'Z'
   END
;

$verify(birthDate, seq, checkSum) =
  $lookup(PARSE('INT:i',$birthDate+$seq)%31) = checkSum;

Then the pattern to parse SSN:

/**** pattern to describe syntax of Finnish SSN ****/
$ssnPattern = <<<PATTERN_END
 <pos>:pos(                 // metafield containing position of following pattern
    DIGIT{6,6}:birthDate    // export birthDate part
    ('-' | '+' | 'A')       // century id
    DIGIT{3,3}:seq          // export registration sequence
    [0-9A-Y]:checkSum       // export checksum
 ):ssnCode;
PATTERN_END;

And now we’re ready for checking, let’s see if the web page describing the Finnish SSN structure, contains valid samples?

/* execute root statement */
PARSE(pattern:$ssnPattern,
      src:['http://www.tuomas.salste.net/doc/hetu/tunnus.html',
           'https://docs.spectx.com/_downloads/access.log']
 ).select(_uri,
         ssnCode,
         pos as locatedAt,
         $verify(birthDate, seq, checkSum) as isValid
 );
_uri ssnCode locatedAt isValid
http://www.tuomas.salste.net/doc/hetu/tunnus.html 311299-9872 8842 true
http://www.tuomas.salste.net/doc/hetu/tunnus.html 640823-3234 13593 false
http://www.tuomas.salste.net/doc/hetu/tunnus.html 640823+3234 13625 false
http://www.tuomas.salste.net/doc/hetu/tunnus.html 010101A999T 22750 false
http://www.tuomas.salste.net/doc/hetu/tunnus.html 210198-118E 34235 true
http://www.tuomas.salste.net/doc/hetu/tunnus.html 120672-063K 35263 false

Example 2: Find out if my Apache access log contains credit card numbers? They come in variety of lengths, formattings and even encodings (in particular, when they’re submitted via urls). Its actually very simple: just use CREDITCARD in the pattern.

$pattern = '<pos>:pos CREDITCARD:cc';

PARSE(pattern:$pattern, src:'https://docs.spectx.com/_downloads/access.log')
 .select(_uri, *);