Parsing embedded JSON out of a CSV file - ruby

I'm receiving tab delimited files with embedded JSON data in one of the columns. My goal is to split the columns, then do some work to process the JSON.
When I try to use the built-in Ruby CSV library (with Ruby 2.2.3) I get the following error:
Illegal quoting in line 1. (CSV::MalformedCSVError)
Here's an minimalist example that helps demonstrate the problem. The following lines work fine:
puts 'red,"blue",green'.parse_csv
puts 'red,{blue},green'.parse_csv
But this line produces the MalformedCSVError message:
puts 'red,{"blue"},green'.parse_csv
Any idea how I can parse that file and treat the middle value (which happens to be JSON) as a string literal?
thanks in advance!

Double quotes (") is, by default, the character used to surround fields that may contain the CSV column delimiter (tab in your case).
You can get around this by setting the :quote_char option to something else, such as backticks or \0. Additionally, for tab-delimited data you're going to need to set :col_sep.
This should give you what you're looking for,
'red,{"blue"},green'.parse_csv(quote_char: '`')
=> ["red", "{\"blue\"}", "green"]
%Q{red\t{"blue"}\tgreen}.parse_csv(quote_char: '`', col_sep: "\t")
=> ["red", "{\"blue\"}", "green"]
Note that this breaks if either
The JSON column contains tabs and not surrounded by :quote_char or
The JSON data contains :quote_char (e.g. it contains a backtick).

Related

How to escape both " and ' when importing each row

I import a text file and save each row as a new record:
CSV.foreach(csv_file_path) do |row|
# saving each row to a new record
end
Strangely enough, the following escapes double quotes, but I have no clue how to escape different characters:
CSV.foreach(csv_file_path, {quote_char: "\""}) do |row|
How do I escape both the characters " and '?
Note that you have additional options available to configure the CSV handler. The useful options for specifying character delimiter handling are these:
:col_sep - defines the column separator character
:row_sep - defines the row separator character
:quote_char - defines the quote separator character
Now, for traditional CSV (comma-separated) files, these values default to { col_sep: ",", row_sep: "\n", quote_char: "\"" }. These will satisfy many needs, but not necessarily all. You can specify the right set to suit your well-formed CSV needs.
However, for non-standard CSV input, consider using a two-pass approach to reading your CSV files. I've done a lot of work with CSV files from Real Estate MLS systems, and they're basically all broken in some fundamental way. I've used various pre- and post-processing approaches to fixing the issues, and had quite a lot of success with files that were failing to process with default options.
In the case of handling single quotes as a delimiter, you could possibly strip off leading and trailing single quotes after you've parsed the file using the standard double quotes. Iterating on the values and using a gsub replacement may work just fine if the single quotes were used in the same way as double quotes.
There's also an "automatic" converter that the CSV parser will use when trying to retrieve values for individual columns. You can specify the : converters option, like so: { converters: [:my_converter] }
To write a converter is pretty simple, it's just a small function that checks to see if the column value matches the right format, and then returns the re-formatted value. Here's one that should strip leading and trailing single quotes:
CSV::Converters[:strip_surrounding_single_quotes] = lambda do |field|
return nil if field.nil?
match = field ~= /^'([^']*)'$/
return match.nil? ? field : match[1]
end
CSV.parse(input, { converters: [:strip_surrounding_single_quotes] }
You can use as many converters as you like, and they're evaluated in the order that you specify. For instance, to use the pre-defined :all along with the custom converter, you can write it like so:
CSV.parse(input, { converters: [:all, :strip_surrounding_single_quotes] }
If there's an example of the input data to test against, we can probably find a complete solution.
In general, you can't, because that will create a CSV-like record that is not standard CSV (Wikipedia has the rules in a bit easier to read format). In CSV, only double quotes are escaped - by doubling, not by using a backslash.
What you are trying to write is not a CSV; you should not use a CSV library to do it.

Freemarker <compress> tag is trimming data inside ${} also

I have code like this :
FTL:
<#compress>
${doc["root/uniqCode"]}
</#compress>
Input is XML Nodemodel
The xml element is having data like: ID_234 567_89
When it is processed the out is: "ID_234 567_89"
The three white spaces between 234 and 567 is trimmed down to one white-space and lost all the white spaces at the end of the value.
I need the value as it is :"ID_234 567_89 "
When i removed the tags it works as expected irrespective of newFactory.setIgnoringElementContentWhitespace(true).
Why should tag trims data resulted from ${}?
Please help.
You could simply replace the characters you don't want manually (in the following example tabs, carriage returns and newlines), e.g.
${doc["root/uniqCode"]?replace("[\\t\\r\\n]", "", "rm")}
See ?replace built-in for strings: http://freemarker.org/docs/ref_builtins_string.html#ref_builtin_replace

Reading a specific column of data from a text file in Ruby

I have tried Googling, but I can only find solutions for other languages and the ones about Ruby are for CSV files.
I have a text file which looks like this
0.222222 0.333333 0.4444444 this is the first line.
There are many lines in the same format. All of the numbers are floats.
I want to be able to read just the third column of data (0.444444, the values under that) and ignore the rest of the data.How can I accomplish this?
You can still use CSV; just set the column separator to the space character:
require 'csv'
CSV.open('data', :col_sep=>" ").each do |row|
puts row[2].to_f
end
You don't need CSV, however, and if the whitespace separating fields is inconsistent, this is easiest:
File.readlines('data').each do |line|
puts line.split[2].to_f
end
I'd recommend breaking the task down mentally to:
How can I read the lines of a file?
How can I split a string around whitespace?
Those are two problems that are easy to learn how to handle.

Tokenise lines with quoted elements in Ruby

I need to tokenise strings in Ruby - string.split is almost perfect, except some of the strings may be enclosed in double-quotes, and within them, whitespace should be preserved. In the absence of lex for Ruby (correct?), writing a character-by-character tokenizer seems silly. What are my options?
I want a loop that's essentially:
while !file.eof:
line = file.readline
tokens = line.tokenize() # like split() but handles "some thing" as one token
end
I.e an an array of white-space delimited fields, but with correct handling of quoted sequences. Note there is no escape sequence for the quotes I need to handle.
The best I can imagine so far, is repeatedly match()ing a reg-exa which matches either the quotes sequence or everything until the next whitespace character, but even then I'm not sure how to formulate than neatly.
Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep and :quote_char options.
If you insist to parse manually you may use the following pattern in a more ruby way:
file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end
split accepts a regex so you could just write the regexp you want and call split on the line you just read.
line.split(/\w+/)
Try using Ruby's CSV library, and use a space (" ") as the :col_sep
:col_sep
The String placed between each field. This String will be transcoded
into the data’s Encoding before parsing.

Overcoming a basic problem with CSV parsing using the FasterCSV gem

I have found a CSV parsing issue with FasterCSV (1.5.0) which seems like a genuine bug, but which I'm hoping there's a workaround for.
Basically, adding a space after the separator (in my case a comma) when the fields are enclosed in quotes generates a MalformedCSVError.
Here's a simple example:
# No quotes on fields -- works fine
FasterCSV.parse_line("one,two,three")
=> ["one", "two", "three"]
# Quotes around fields with no spaces after separators -- works fine
FasterCSV.parse_line("\"one\",\"two\",\"three\"")
=> ["one", "two", "three"]
# Quotes around fields but with a space after the first separator -- fails!
FasterCSV.parse_line("\"one\", \"two\",\"three\"")
=> FasterCSV::MalformedCSVError: Illegal quoting on line 1.
Am I going mad, or is this a bug in FasterCSV?
The MalformedCSVError is correct here.
Leading/trailing spaces in CSV format are not ignored, they are considered part of a field. So this means you have started a field with a space, and then included unescaped double quotes in that field, which would cause the illegal quoting error.
Maybe this library is just more strict than others you have used.
Maybe you could set the :col_sep: option to ', ' to make it parse files like that.
I had hoped that the :col_sep option might allow a regular expression, but it seems to be used for both reading and writing, which is a shame. The documentation doesn't hold out much hope and your need is probably more immediate than could be satisfied by requesting a change or submitting a patch ;-)
If you're calling #parse_line explicitly, then you could always call
gsub(/,\s*/, ',')
on your input line. That regular expression might need to change significantly if you anticipate the possibility of comma-space within quoted strings. (I'd suggest reposting such a question here with a suitable tag and let the RegEx mavens loose on it should that be the case).

Resources