JSON Arrays

JSON_ARRAY

JSON syntax allows constructing arrays in a variety of ways. They can range from very simple primitive type members to very complex nested object members. A JSON array can contain members of a different type. And it is perfectly valid that they can appear also outside of a JSON object. JSON_ARRAY is meant to handle all of them.

output type:

VARIANT_ARRAY (default) or ARRAY (depending on typed configuration parameter setting)

quantifier:

none

configuration:

charset = character set name enclosed in single or double quotes (for example charset="ISO-8859-1")

locale = string specifying IETF BCP 47 language tag enclosed in single or double quotes (see the list here ). The default locale is English.

typed = Boolean value, allowing to control JSON_ARRAY output type: “true” sets the output type to ARRAY, “false” sets output type to VARIANT_ARRAY. Default “false”

strict = Boolean value. “false” allows parsing arrays not following JSON specification. Unquoted JSON names and string values consisting of one word can be parsed. Default is “true”. See Parsing non-standard JSON objects.

maxlen = numeric value representing the maximum byte size of an array. Allows parsing large JSON arrays (exceeding default size of 128000 bytes).

The easiest is to use without any parameters. It will enumerate all array elements, transform them to SpectX data types from their declared Json type and returns parsed elements as VARIANT_ARRAY object.

Example. Consider the following array with different type elements:

[5,null,2.2,"17.9","5\u0037\u0037\u0037\u0037\u00372"]

Pattern:

1
JSON_ARRAY:array

Resulting VARIANT_ARRAY holds all parsed array elements (double-click on the resultset row to see the details):

name value type
array[0] 5 VARIANT<LONG>
array[1] null VARIANT
array[2] 2.2 VARIANT<DOUBLE>
array[3] 17.9 VARIANT<STRING>
array[4] 5777772 VARIANT<STRING>

Controlling Output Type and Conversion

JSON_ARRAY{ type }

Converts all array elements to explicitly set type.

Note

When conversion fails the entire output will be set to NULL.

Configuration parameter typed=true will change the output to ARRAY.

Conversion to explicit types allows you to use array expressions in accessing array elements in query statements (variant types require using Cast).

Example:

[223423,-343.8e7,null,"3.14"]

Following pattern attempts converting all array elements to DOUBLE, with output set to ARRAY:

1
JSON_ARRAY{ DOUBLE }(typed=true):double_arr

The result (double-click on the resultset row to see the details).

name value type
double_arr[0] 223423.0 DOUBLE
double_arr[1] -3.438E9 DOUBLE
double_arr[2] NULL DOUBLE
double_arr[3] 3.14D DOUBLE

Parsing Large Arrays

Sometimes you may encounter datasets, where Json formatted records are members of an array. Attempting to parse such datasets using JSON_ARRAY makes no sense because:

  • you want to get records as rows in a resultset, but JSON_ARRAY returns one array object
  • the overall size most likely exceeds the capacity of JSON_ARRAY (even when using maxlen configuration parameter)

Instead, you should threat such datasets as a list of JSON objects separated by a comma and ignore the square brackets of enclosing array.

Important

It is highly recommended using JSON object semantic validation to define the array member object. This avoids unmatched objects near the border of chunks of input data. [1]

Example. Consider an array with a large number of JSON objects as elements:

[{"id":"1","a":[3,6],"b":{"foo":"bar"}},{"id":"2","a":[30,2673,1]},{"id":"256000","a":[256,1193,2],"b":{"foo":"nobar"}}]

Following pattern parses Json object records by ignoring the encapsulating array.

1
2
3
(BOF '[')?
JSON_OBJECT{INT+:id}(greedy='others'):record ','?
(']' EOF)?

where:

  1. Matches opening square brackets at the beginning of the file. The sequence is made optional, to allow pattern match for bytes in the middle of a file.
  2. Extracts Json object to resultset field record, followed by a comma (optional since the last object has no trailing comma). Note that the id is made mandatory to avoid data loss [1]. Other members are extracted automatically into others field.
  3. Matches closing brackets at the end of a file. The sequence is made optional, to allow pattern match for bytes in the middle of a file.

Result:

record _unmatched
{id=1 others={‘a’:[3,6],’b’:{‘foo’:’bar’}}} NULL
{id=2 others={‘a’:[30,2673,1]}} NULL
 
{id=256000 others={‘a’:[256,1193,2],’b’:{‘foo’:’nobar’}}} NULL
[1](1, 2)

SpectX splits input data into 64Mb chunks and processes them in parallel. To guarantee non-duplication of parsed fields the pattern must match the smallest size of data block within its maximum length. I.e - a pattern which matches data block of certain size must not match a smaller data block inside it.

This condition is almost always satisfied, except when JSON is used in automatic extraction mode, either alone (when parsing Concatenated JSON) or with optional matchers (comma-separated list of JSON objects). Then the pattern may also match embedded objects within a JSON object.

To avoid this happening use JSON object semantic validation - to distinguish the largest extracted JSON object from its embedded objects, declare a minimum set of its top-level members as mandatory. The rest of the members can be extracted automatically by specifying greedy parameter to JSON matcher. See the example above.