Fork me on GitHub

What is CSV?

The comma-separated values (CSV) format is a widely used text file format often used to exchange data between applications. It contains multiple records (one per line), and each field is delimited by a comma. Wikipedia has a good explanation of the CSV format and its history.

There is no definitive standard for CSV, however the most commonly accepted definition is RFC 4180 - the MIME type definition for CSV. Super CSV is 100% compliant with RFC 4180, while still allowing some flexibility where CSV files deviate from the definition.

The following shows each rule defined in RFC 4180, and how it is treated by Super CSV.

Rule 1

1. Each record is located on a separate line, delimited by a line
   break (CRLF).  For example:

   aaa,bbb,ccc CRLF
   zzz,yyy,xxx CRLF

Super CSV accepts all line breaks (Windows, Mac or Unix) when reading CSV files, and uses the end of line symbols specified by the user (via the CsvPreference object) when writing CSV files.

Rule 2

2. The last record in the file may or may not have an ending line
   break.  For example:

   aaa,bbb,ccc CRLF
   zzz,yyy,xxx

Super CSV will add a line break when writing the last line of a CSV file, but a line break on the last line is optional when reading.

Rule 3

3. There maybe an optional header line appearing as the first line
   of the file with the same format as normal record lines.  This
   header will contain names corresponding to the fields in the file
   and should contain the same number of fields as the records in
   the rest of the file (the presence or absence of the header line
   should be indicated via the optional "header" parameter of this
   MIME type).  For example:

   field_name,field_name,field_name CRLF
   aaa,bbb,ccc CRLF
   zzz,yyy,xxx CRLF

Super CSV provides methods for reading and writing headers, if required. It also makes use of the header for mapping between CSV and POJOs (see CsvBeanReader/CsvBeanWriter).

Rule 4

4. Within the header and each record, there may be one or more
   fields, separated by commas.  Each line should contain the same
   number of fields throughout the file.  Spaces are considered part
   of a field and should not be ignored.  The last field in the
   record must not be followed by a comma.  For example:

   aaa,bbb,ccc

The delimiter in Super CSV is configurable via the CsvPreference object, though it is typically a comma.

Super CSV expects each line to contain the same number of fields (including the header). In cases where the number of fields varies, CsvListReader/CsvListWriter should be used, as they contain methods for reading/writing lines of arbitrary length.

By default, Super CSV considers spaces part of a field. However, if you require that surrounding spaces should not be part of the field (unless within double quotes), then you can enable surroundingSpacesNeedQuotes in your CsvPreference object. This will ensure that surrounding spaces are trimmed when reading (if not within double quotes), and that quotes are applied to a field with surrounding spaces when writing.

Rule 5

5. Each field may or may not be enclosed in double quotes (however
   some programs, such as Microsoft Excel, do not use double quotes
   at all).  If fields are not enclosed with double quotes, then
   double quotes may not appear inside the fields.  For example:

   "aaa","bbb","ccc" CRLF
   zzz,yyy,xxx

By default Super CSV only encloses fields in double quotes when they require escaping (see Rule 6), but it is possible to enable quotes always, for particular columns, or for some other reason by supplying a QuoteMode in the CsvPreference object.

The quote character is configurable via the CsvPreference object, though is typically a double quote (").

Rule 6

6. Fields containing line breaks (CRLF), double quotes, and commas
   should be enclosed in double-quotes.  For example:

   "aaa","b CRLF
   bb","ccc" CRLF
   zzz,yyy,xxx

Super CSV handles multi-line fields (as long as they're enclosed in quotes) when reading, and encloses a field in quotes when writing if it contains a newline, quote character or delimiter (defined in the CsvPreference object).

Rule 7

7. If double-quotes are used to enclose fields, then a double-quote
   appearing inside a field must be escaped by preceding it with
   another double quote.  For example:

   "aaa","b""bb","ccc"

Super CSV escapes double-quotes with a preceding double-quote. Please note that the sometimes-used convention of escaping double-quotes as \" (instead of "") is not supported.