Why you possibly can’t parse CSV with a daily expression

Common expressions are a really great tool in a programmer’s toolbox. However they will’t do the whole lot. And one of many issues they will’t do is to reliably parse CSV (comma separated worth) information. It’s because a daily expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.

For instance, contemplate this (very brief) CSV file (3 double quotes + 1 comma + 3 double quotes):

“””,”””

That is accurately interpreted as:

quote to begin the information worth + escaped quote + comma + escaped quote + quote to finish the information worth

E.g. a single worth of:

“,”

How every character is interpreteted depends upon what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside information’ state. The second quote places you right into a ‘could be an escaped for the next character or could be finish of knowledge’ state. The third quote places you again right into a ‘inside information’ state.

Regardless of how sophisticated a regex you give you, it would all the time be attainable to create a CSV file that your regex can’t accurately parse. And as soon as the parsing goes mistaken, the whole lot after that time might be rubbish.

You possibly can write a regex that may deal with CSV file the place you might be assured there aren’t any commas, quotes or carriage returns within the information values. However commas, quotes or carriage returns within the information values are completely legitimate in CSV information. So it’s only ever going to deal with a subset of all of the attainable well-formed CSV information.

Be aware that you simply can parse a TSV (tab separated worth) file with a regex, as TSV information are (typically!) not allowed to comprise tabs or carriage returns in information and due to this fact don’t want escaping.

See additionally on Stackoverflow:

Using regular expressions to parse HTML: why not?