Linux Syslog

Linux BSD formatted syslog is an example of log with events from different applications. For instance, the authorization log contains events about SSH logins, sudo attempts or user and group creations. They are written by different operating system components using the local Syslog server. And it also means that the format of the records are the same Syslog format but the message part of them is different based on which program is writing each particular line.

Example:

Nov 26 23:51:59 dong sudo:    iiton : TTY=pts/30 ; PWD=/home/iiton ; USER=root ; COMMAND=/usr/bin/apt install postgresql-client-common postgresql-client
Nov 28 11:48:31 dong sudo: pam_unix(sudo:auth): auth could not identify password for [iiton]
Dec  8 15:17:56 dong sshd[1564]: Accepted password for iiton from 83.166.53.102 port 51413 ssh2
Dec 15 06:25:03 dong sshd[14854]: Failed password for invalid user openssl from 35.224.205.57 port 49816 ssh2
Dec 15 06:25:26 dong sshd[14992]: Failed password for root from 49.88.112.69 port 57915 ssh2
Dec 15 06:25:31 dong sshd[14992]: last message repeated 2 times: [ Failed password for root from 49.88.112.69 port 57915 ssh2]
Dec 16 10:43:32 dong sshd[9559]: Accepted publickey for iiton from 90.191.124.115 port 59965 ssh2: RSA SHA256:jzw04DjKPoQnzNMFLoBjXkO0wojJDp+KgSfnT6laQgA

Hint

You can find sample log file by navigating with Input Data Browser to s3s://spectx-docs/formats/log/syslog/auth.log.gz

Parse

As you can see, the BSD Syslog format is pretty simple, the date-time is followed by hostname, process and (optionally) PID enclosed in square brackets. Last until the end of the line comes the message part.

Generic pattern to parse Syslog events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/*
    Pattern for parsing BSD syslog format https://tools.ietf.org/html/rfc3164
*/
/* declare the header fields:     */
$hdr =
TIME('MMM d HH:mm:ss', tz='UTC'):syslog_time // timestamp, followed by a space
' '
LD:host ' '                                  // host/ip-addr, followed by a space
[!:\[ ]+:process                             // process name
('[' INT:pid ']')?                           // optional pid, enclosed in sq. brackets
':'? ' '?                                    // followed by optional colon and space
;

/* declare the message field: */
$msg =
DATA{0,250000}:message                       // capture message of 0 to 250kB until
(EOF | (EOL >> TIME('MMM d HH:mm:ss')));     // the end of file/stream or next line
                                             // beginning with a timestamp
/*declare the repeated message fields: */
$repeated =                                  // export header fields separately
TIME('MMM d HH:mm:ss', tz='UTC'):rep_time    // repeated msg timestamp
' '
LD:rep_host ' '                              // repeated msg host/ip-addr
[!:\[ ]+:rep_process                         // repeated msg process name
('[' INT:rep_pid ']')?                       // repeated msg pid
':'? ' '?
LD*                                          // skip various prefixed phrases of
                                             // different syslog implementations
' message repeated ' INT:count ' times'      // extract repeated count
 LD*:rep_message EOL                         // extract original message (if any)
;

// the main pattern consists of:
// header followed by message followed optionally by repeated message
// (exported as tuple field named 'repeated')
$hdr $msg $repeated?:repeated

When parsing example data above with this pattern we get:

syslog_time host process pid message repeated
2019-11-26 23:51:59.000 +0000 dong sudo NULL iiton : TTY=pts/30 ; PWD=/home/iiton ; USER=root ; COMMAND=/usr/bin/apt install postgresql-client-common postgresql-client NULL
2019-11-28 11:48:31.000 +0000 dong sudo NULL pam_unix(sudo:auth): auth could not identify password for [iiton] NULL
2019-12-08 15:17:56.000 +0000 dong sshd 1564 Accepted password for iiton from 83.166.53.102 port 51413 ssh2 NULL
2019-12-15 06:25:03.000 +0000 dong sshd 14854 Failed password for invalid user openssl from 35.224.205.57 port 49816 ssh2 NULL
2019-12-15 06:25:26.000 +0000 dong sshd 14992 Failed password for root from 49.88.112.69 port 57915 ssh2 {rep_time=2019-12-15 08:25:31.000 +0200 rep_host=”dong” rep_process=”sshd” rep_pid=14992 count=2 rep_message=”: [ Failed password for root from 49.88.112.69 port 57915 ssh2]”}
2019-12-16 10:43:32.000 +0000 dong sshd 9559 Accepted publickey for iiton from 90.191.124.115 port 59965 ssh2: RSA SHA256:jzw04DjKPoQnzNMFLoBjXkO0wojJDp+KgSfnT6laQgA NULL

Note

There is one tricky part though - the “last message repeated n times” feature. Although this space-saving compression method is outdated and unreliable (see discussion here) it is still switched on by default in many Linux distributions. See discussion below of considerations of why and how to use it.

Normalizing Query

When “last message repeated” is in use we’d like to present the Syslog records with their true occurrence count. We use the user defined function $get_repeated_count to make sure the repeat count is genuine :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
LIST(src:'s3s://spectx-docs/formats/log/syslog/auth.log.gz')
| parse(pattern:FETCH('https://raw.githubusercontent.com/spectx/resources/master/examples/patterns/syslog/syslog.sxp'))
| select(syslog_time,
         host,
         process,
         pid,
         message,
         count:1 + $get_repeated_count(syslog_time, host, process, pid,
                        repeated[rep_time], repeated[rep_host],
                        repeated[rep_process], repeated[rep_pid],
                        repeated[count]))
;
syslog_time host method result username invalid_user ip port count
2019-12-08 15:17:56.000 +0000 dong password Accepted iiton false 83.166.53.102 51413 1
2019-12-15 06:25:03.000 +0000 dong password Failed openssl true 35.224.205.57 49816 1
2019-12-15 06:25:26.000 +0000 dong password Failed root false 49.88.112.69 57915 3
2019-12-16 10:43:32.000 +0000 dong publickey Accepted iiton false 90.191.124.115 59965 1

Note

You may have noticed that the syslog_time does not contain the year. SpectX automatically adjusts the parsed timestamp value to the year from file’s path_time or last_modified dates (taken from LIST command output).

By saving the script as /user/syslog.sx we create a normalised view for parsing any syslog format log. To analyze messages from a specific application we need to filter the messages from that process and parse the message field according to the expected structure.

Example: Extract information from the sshd process for analyzing unauthorized login attempts from the Linux /var/log/auth.log. The messages we’re going to extract are from the sshd records in the example data above (ignoring other messages from the sshd process).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@[/user/syslog.sx]
| filter (process = 'sshd')
| select(det:PARSE(FETCH('https://raw.githubusercontent.com/spectx/resources/master/examples/patterns/syslog/authlog-login-msg.sxp'),
                   message),*)
| filter(det is not null)
| select(syslog_time,
       host,
       det[method],
       det[result],
       det[username],
       invalid_user:IF(det[invalid_user] is not null, true, false),
       det[ip],
       det[port],
       count)
;

where line

  1. calls our saved Syslog view
  2. include only messages from sshd process
  3. parse messages
  4. filter out messages which failed to parse (i.e those what we ignore)
  5. select fields of relevance
syslog_time host method result username invalid_user ip port count
2019-11-17 16:30:06.000 +0000 dong password Failed root false 49.88.112.70 36486 1
2019-11-17 23:46:44.000 +0000 dong password Failed root false 49.88.112.70 28819 3
2019-11-18 08:13:30.000 +0000 dong password Failed root false 49.88.112.70 17368 3
2019-11-18 18:58:31.000 +0000 dong password Failed root false 218.92.0.181 49368 6
2019-11-19 04:16:18.000 +0000 dong password Failed root false 218.92.0.181 60477 1

Now from this output, it is very easy to query all kinds of statistics. Let’s save it so we can use it as a view /user/authlog-logins.sx Let’s find out how many attempts of authentication methods and the results we have:

1
2
3
4
5
[/user/authlog-logins.sx]
| select(method, result, sum:sum(count))
| group(@1)
| sort(sum DESC)
;
method result sum
password Failed 962.0
none Failed 37.0
publickey Accepted 4.0

Hint

You can download full code of the patterns and queries at https://github.com/spectx/resources/tree/master/examples/patterns/syslog

Repeated Messages

The decision whether to use repeated count or not must be given some consideration. The repeated message must follow immediately the original message in the log. However, it is not always so - the Syslog implementations have been reported generating these out of order, same can happen at the collection of relayed messages from remote hosts.

Largely due to this reason it has been suggested by many that the repeated message count should be ignored when analyzing logs. This may lead to significant errors though. For instance, it is not uncommon for a host exposed to Internet getting ssh login scannings in volume triggering Syslog information reduction. The counts in those cases usually exceed the count of logged attempts.

To allow to use repeated count more reliably we extract all elements of repeated record: time, host, process, PID and original message (see https://www.rsyslog.com/doc/v8-stable/configuration/global/index.html $RepeatedMsgContainsOriginalMsg). Ideally, the host, process, PID, and message must be the same when classifying a message as repeated. However the different versions and implementations of Syslog are behaving differently in providing process, PID and the original message.

Hence It becomes your choice whether and which one of the available elements you want to take into account when making calculations with the “repeated count”.

Example: user defined function for getting “repeated count” on condition that i)host, process, PID must be the same, and ii) the time between original and repeated messages is within the allowed period.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$get_repeated_count(time,   host,   process,   pid,
                    r_time, r_host, r_process, r_pid, rep_count) =
    IF( $r_time IS NOT NULL AND         //if no repeated info then return 0
      $time[2 min] = $r_time[2 min] AND //repeated time within allowed timeframe
      $host = $r_host AND               //repeated host is the same as original
      $process = $r_process AND         //repeated process is the same as original
      $pid = $r_pid                     //pid is the same as original
      ,$rep_count  //if all conditions above are true only then return repeated count
      ,0)          //else return 0
;