JSON

JavaScript Object Notation (JSON) has become a widely accepted format for data exchange. The following examples illustrate different ways of using JSON objects in datasets.

JSON Lines

Terms JSON lines (jsonl), Newline-delimited JSON (ndjson) and line-delimited JSON (ldjson) are all referring to a format where each line (separated by LF or CRLF) consists of a valid JSON object:

{"name":"Homer Simpson","age":40,"married":true,"cars":["pink sedan"]}<LF>
{"name":"Charles Montgomery Burns","age":104,"married":false,"cars":["936 Stutz Bearcat","Bugatti Royale","Ford Quadricycle"]}<LF>
{"name":"Nedward Flanders Jr.","age":60,"married":false,"cars":["Geo Metro", "Honda Civix CX"]}<LF>

The pattern for parsing this is very simple:

1
JSON:simpson_character (EOL | WINEOL)

Pretty Printed JSON

RFC 8259 allows JSON object to be formatted with embedded newline characters for better human-readability. The pretty-printed version of the example above:

{"name":"Homer Simpson",
    "age":40,
    "married":true,
    "cars":["pink sedan"]}
{
    "name":"Charles Montgomery Burns",
    "age":104,
    "married":false,
    "cars":["936 Stutz Bearcat","Bugatti Royale","Ford Quadricycle"]}
{
    "name":"Nedward Flanders Jr.",
    "age":60,
    "married":false,
    "cars":["Geo Metro", "Honda Civix CX"]
}

JSON Objects has a built-in capability to handle pretty-printed JSON objects, hence we can use exactly the same pattern:

1
JSON:simpson_character (EOL | WINEOL)

JSON Sequence

is another variant of serializations of large sequences of JSON objects (see RFC7464):

<RS>{"name":"Homer Simpson","age":40,"married":true,"cars":["pink sedan"]}<LF>
<RS>{"name":"Charles Montgomery Burns","age":104,"married":false,"cars":["936 Stutz Bearcat","Bugatti Royale","Ford Quadricycle"]}<LF>
<RS>{"name":"Nedward Flanders Jr.","age":60,"married":false,"cars":["Geo Metro", "Honda Civix CX"]}<LF>

It differs from line delimited JSON only by each JSON object being prefixed by a record-separator character (RS, 0x1E). Hence it is very easy to modify the pattern from the previous example:

1
'\x1e' JSON:simpson_character (EOL | WINEOL)

Concatenated JSON

is a form of streaming multiple JSON objects without any delimiters. Note that it allows also pretty-printed JSON objects (provided that the reader is able to parse objects correctly):

{"name":"Homer Simpson","age":40,"married":true,"cars":["pink sedan"]}{
"name":"Charles Montgomery Burns",
"age":104,
"married":false,
"cars":["936 Stutz Bearcat","Bugatti Royale","Ford Quadricycle"]}{
    "name":"Nedward Flanders Jr.",
    "age":60,
    "married":false,
    "cars":["Geo Metro", "Honda Civix CX"]
}

Parsing this format with JSON automatic extraction mode may cause data loss (i.e unmatched objects) at the borders of input data chunks [1]. It is highly recommended to define minimum set of top-level members as mandatory to avoid this situation. See also Semantic validation of a Json object.

1
2
3
4
5
6
JSON{
    STRING+:name,
    INT+:age,
    BOOLEAN+:married,
    JSON_ARRAY:cars
}:simpson_character

Large JSON Arrays

Sometimes you may encounter datasets, where Json formatted records are members of an array. Attempting to parse such datasets using JSON_ARRAY makes no sense because:

  • you want to get records as rows in a resultset, but JSON_ARRAY returns one array object
  • the overall size most likely exceeds the capacity of JSON_ARRAY (even when using maxlen configuration parameter)

Instead, you should threat such datasets as a list of JSON objects separated by a comma and ignore the square brackets of enclosing array.

Important

It is highly recommended using JSON object semantic validation to define the array member object. This avoids unmatched objects near the border of chunks of input data. [1]

Example. Consider an array with a large number of JSON objects as elements:

[{"id":"1","a":[3,6],"b":{"foo":"bar"}},{"id":"2","a":[30,2673,1]},{"id":"256000","a":[256,1193,2],"b":{"foo":"nobar"}}]

Following pattern parses Json object records by ignoring the encapsulating array.

1
2
3
    (BOF '[')?
    JSON_OBJECT{INT+:id}(greedy='others'):record ','?
    (']' EOF)?

where:

  1. Matches opening square brackets at the beginning of the file. The sequence is made optional, to allow pattern match for bytes in the middle of a file.
  2. Extracts Json object to resultset field record, followed by a comma (optional since the last object has no trailing comma). Note that the id is made mandatory to avoid data loss [1]. Other members are extracted automatically into others field.
  3. Matches closing brackets at the end of a file. The sequence is made optional, to allow pattern match for bytes in the middle of a file.
[1](1, 2, 3)

SpectX splits input data into 64Mb chunks and processes them in parallel. To guarantee non-duplication of parsed fields the pattern must match the smallest size of data block within its maximum length. I.e - a pattern which matches data block of certain size must not match a smaller data block inside it.

This condition is almost always satisfied, except when JSON is used in automatic extraction mode, either alone (when parsing Concatenated JSON) or with optional matchers (comma-separated list of JSON objects). Then the pattern may also match embedded objects within a JSON object.

To avoid this happening use JSON object semantic validation - to distinguish the largest extracted JSON object from its embedded objects, declare a minimum set of its top-level members as mandatory. The rest of the members can be extracted automatically by specifying greedy parameter to JSON matcher.

Non-standard JSON

By default, JSON matcher treats JSON object members strictly according to RFC 8259 syntax. However, under certain circumstances relaxing this validation may be a good idea. For instance, investigating a malicious behavior or debugging an application producing invalid JSON objects. This can be achieved using the strict configuration parameter set to false.

Example: Suppose an incoming HTTP request contains following JSON object containing hostname, department name, and IP-address. Note the unquoted name and string values, and IP-address presented as UNICODE character string:

{
host:"example.com"
    ,
"department" : HR    ,
    ip : "\u0031\u0039\u0032.\u0031\u0036\u0038.\u0034\u0037.\u0030"
}

Pattern:

1
JSON(strict=false):data

The result is (double-click on the resultset row to see the details):

name value type
data[host] example.com VARIANT<STRING>
data[department] HR VARIANT<STRING>
data[ip] 192.168.47.0 VARIANT<IPADDR>

See also

parsing JSONified Windows Event Log