This is an ongoing series of interesting code snippets, what they do and why I used them.
Here is a CSV parser. I wrote this for an interview one time, actually, I wrote it wrong twice, and then wrote this, which is less wrong and more succinct.
I like this because it takes the essential building blocks of any CSV file, namely a row and a field, and iterates through those to split_elements
for each of them. The split_elements
method serves both purposes and is a general purpose method that is likable. Finally, when we’ve reached the lowest divisable level by splitting the split rows, we call the less likeable method unquote
. I think Ruby has provided us ham-fisted tools for what this method does, or maybe I was ham fisted when I wrote this having expended my creative energies on the split_elements
method. The unquote
here has the uneviable task of matching arbitrary quote chars which are not preceeded by escape chars. Think about something like "foo says \"bar\""
. As we iterate through each character of the string after the first (which we expect to be a quote char if the field contains a quote), we have three choices.
- We’re dealing with the char after a quote char
- We’re dealing with an escape char, something like
\
. - We’re dealing with some other character
At first glance, the splitting of elements shouldn’t be so complicated, after all can’t we just iterate through until we find an unescaped quote character?. The answer is no, and this was why I implemented this wrongly twice before writing this. It is wrong becuase of nested quotes. We must split rows, then fields holistically, and cannot determine if we are in a row or field without looking at the document as a whole. Since we do have to look at the whole document, this exposes a potential attack vector for this piece of code. Think about what would happen if this code was given a massive document, with millions of rows (or millions of fields in one row). Attempting to split and process each of the documents elements would exhaust the Ruby VM’s memory. It is for this reason that the solution here is a fun thing to submit for an interview, but is not something you’d want to use for real.
I did not take the job, but I appreciated the nature of the code challenge.