CSVeed

Easy-to-use CSV to Java Bean utility

View project onGitHub

RFC 4180

Original RFC

The RFC was intended to have a proper implementation guide for the text/csv mimetype, so we must be careful not to read more into this 'specification' than it intended to solve. Nevertheless, the reality is that RFC 4180 does hold some sway. That said, here are the seven rules it lays down for a CSV file:

1. Every record has its own line

Each record is located on a separate line, delimited by a line break (CRLF). For example:

aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF

2. Optional end-of-line for the last record

The last record in the file may or may not have an ending line break. For example:

aaa,bbb,ccc CRLF
zzz,yyy,xxx

3. Optional file header

There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example:

field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF

4. Cells separated by a separator

Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:

aaa,bbb,ccc

5. Cells may be delimited by double quotes

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

"aaa","bbb","ccc" CRLF
zzz,yyy,xxx

6. Cells containing special characters must be delimited by double quotes

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx

7. A double quote in a cell must be escaped by another double quote

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

"aaa","b""bb","ccc"

Objections to using RFC 4180 as a standard

RFC 4180 is a great starting point for identifying legitimate RFC 4180. I like to look at it as a subset of a fuller set. There are a number of reasons why it would be a bad idea to only consider the RFC 4180 for legitimate CSV files:

Its symbols are fixed

USA uses ',' as a separator between cells, whereas Northern Europe is accustomed to a ';' symbol. It probably started when the Microsoft Excel people tried to cope with the European comma symbol instead of the dot symbol for money amounts.

Nevertheless, the legacy is there and for us to cope with. Both kinds of CSV are perfectly legitimate, yet RFC 4180 caters for only one type.

No trailing and leading spaces

Even recently I saw it and have many times before that. Take the following example:

  first name, surname, street, city, trademark
  'Stephen', 'Hawking', '110th Avenue', 'New York', 'History of the \'Universe\''
  'Albert', 'Einstein', '', 'Berlin', '\'E=mc2\''

The spaces before the quotes must be read, not lead to an error and then ignored.

Delimiter and escape cannot vary

Admittedly, it happens rarely, if ever, but the double double quote kind of offends the sensitivities of those raised in the world of Unix. Surely, there must be a better way to deal with escaping than this, no?

At any rate, CSVeed is ready for those brave trailblazers who aim to change the face of CSV.

No comment lines

Does not happen a lot and if it happens, it is mostly above the header row. Comment lines in CSVeed have a comment symbol at the first position of the record.

No customary start line

A variant on the comment line, though where you have no comment symbol, but instead you know at what line the content starts.

No skipping empty lines

A reader must have the flexibility to skip empty lines, since these happen.

No rule on the number of columns

All the aspects mentioned above can be dealt with in a satisfying way. However, CSV files which contain a varying number of columns per record are a definitive no-go area. CSVeed will reject those records.