CSV

Specification

RFC 4180 proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term “CSV” might refer to any file that:

  1. is plain text using a character set such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
  4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore, without additional information (such as whether RFC 4180 is honored), a file claimed simply to be in “CSV” format is not fully specified. As a result, many applications supporting CSV files allow users to preview the first few lines of the file and then specify the delimiter character(s), quoting rules, etc. If a particular CSV file’s variations fall outside what a particular receiving program supports, it is often feasible to examine and edit the file by hand (i.e., with a text editor) or write a script or program to produce a conforming format.

The following passage lists a couple of CSV data examples according to the RFC4180 requirements and a SpectX pattern to parse them.

Examples

Example 1: Basic CSV Data.

1997,Ford,E350

Pattern:

1
INT:year ',' LD:make ',' LD:model EOL

Result record:

field value type
year 1997 INTEGER
make Ford STRING
model E350 STRING

Example 2: Any field may be quoted (that is, enclosed within double-quote characters).

"1997","Ford","E350"

Pattern:

1
CSVDQS:year ',' CSVDQS:make ',' CSVDQS:model EOL

Result record:

field value type
year 1997 STRING
make Ford STRING
model E350 STRING

Example 3: Fields with embedded commas or double-quote characters must be quoted.

1997,Ford,E350,"Super, luxurious truck"

Pattern:

1
CSVDQS:year ',' CSVDQS:make ',' CSVDQS:model EOL

Result record:

field value type
year 1997 STRING
make Ford STRING
model Super, luxurious truck STRING

Example 4: Each of the embedded double-quote characters must be represented by a pair of double-quote characters.

1997,Ford,E350,"Super, ""luxurious"" truck"

Pattern:

1
 CSVDQS:year ',' CSVDQS:make ',' CSVDQS:model CSVDQS:txt EOL

Result record:

field value type
year 1997 STRING
make Ford STRING
model E350 STRING
txt Super, “luxurious” truck STRING

Example 5: Fields with embedded line breaks must be quoted (although, many CSV implementations do not support embedded line breaks).

1997,Ford,E350,"Go get one now
they are going fast"

Pattern:

1
LD:year ',' LD:make ',' LD:model ',' CSVDQS:txt EOL

Result record:

field
value
type
year
1997
STRING
make
Ford
STRING
model
Super, “luxurious” truck
STRING
txt
Go get one now
they are going fast
STRING