I import a text file and save each row as a new record:
CSV.foreach(csv_file_path) do |row|
# saving each row to a new record
end
Strangely enough, the following escapes double quotes, but I have no clue how to escape different characters:
CSV.foreach(csv_file_path, {quote_char: "\""}) do |row|
How do I escape both the characters " and '?
Note that you have additional options available to configure the CSV handler. The useful options for specifying character delimiter handling are these:
:col_sep - defines the column separator character
:row_sep - defines the row separator character
:quote_char - defines the quote separator character
Now, for traditional CSV (comma-separated) files, these values default to { col_sep: ",", row_sep: "\n", quote_char: "\"" }. These will satisfy many needs, but not necessarily all. You can specify the right set to suit your well-formed CSV needs.
However, for non-standard CSV input, consider using a two-pass approach to reading your CSV files. I've done a lot of work with CSV files from Real Estate MLS systems, and they're basically all broken in some fundamental way. I've used various pre- and post-processing approaches to fixing the issues, and had quite a lot of success with files that were failing to process with default options.
In the case of handling single quotes as a delimiter, you could possibly strip off leading and trailing single quotes after you've parsed the file using the standard double quotes. Iterating on the values and using a gsub replacement may work just fine if the single quotes were used in the same way as double quotes.
There's also an "automatic" converter that the CSV parser will use when trying to retrieve values for individual columns. You can specify the : converters option, like so: { converters: [:my_converter] }
To write a converter is pretty simple, it's just a small function that checks to see if the column value matches the right format, and then returns the re-formatted value. Here's one that should strip leading and trailing single quotes:
CSV::Converters[:strip_surrounding_single_quotes] = lambda do |field|
return nil if field.nil?
match = field ~= /^'([^']*)'$/
return match.nil? ? field : match[1]
end
CSV.parse(input, { converters: [:strip_surrounding_single_quotes] }
You can use as many converters as you like, and they're evaluated in the order that you specify. For instance, to use the pre-defined :all along with the custom converter, you can write it like so:
CSV.parse(input, { converters: [:all, :strip_surrounding_single_quotes] }
If there's an example of the input data to test against, we can probably find a complete solution.
In general, you can't, because that will create a CSV-like record that is not standard CSV (Wikipedia has the rules in a bit easier to read format). In CSV, only double quotes are escaped - by doubling, not by using a backslash.
What you are trying to write is not a CSV; you should not use a CSV library to do it.
Related
I'm writing a serialize method that converts a Tree into a string for storage. I was looking for a delimiter to use in the serialization and wasn't sure what to use.
I can't use , because that might exist as a data value in a node. e.g.
A
/ \
B ,
would serialize to A, B, ,, and break my deserialization method. Can I use non-printable ASCII characters, or should I just guess what character(s) are unlikely to show up as input and use those as my delimiters?
Here's what my serialize method looks like, if you're curious:
def serialize(root)
if root.nil?
""
else
root.val + DELIMITER +
serialize(root.left) + DELIMITER +
serialize(root.right)
end
end
There are several common methods I can think of:
Escaping: you define an escape symbol that "escapes" from the "special" interpretation. Think about how \ acts as an escape character in Ruby string literals.
Fixed Fields / Length Encoding: you know in advance where a field begins and ends. (Fixed fields are basically a special-case of length encoding where you can leave out the length because it is always the same.)
Example for escaping:
def serialize(root)
if root.nil?
""
else
"#{escape(root.val)},#{serialize(root.left)},#{serialize(root.right)}" # using ,
end
end
private def escape(str) str.gsub('\', '\\').gsub(',', '\,') end
Example for length encoding:
def serialize(root)
if root.nil?
"0,"
else
"#{root.val.size},#{root.val}#{serialize(root.left)}#{serialize(root.right)}" # using length encoding
end
end
Any , you find within size characters belongs to the value. Fixed fields would basically just concatenate the values and assume that they are all the same fixed length.
You might want to look at how existing serialization formats handle it, like OGDL: Ordered Graph Data Language, YAML: YAML Ain't Markup Language, JSON, CSV (Character-Separated Values), XML (eXtensible Markup Language).
If you want to look at binary formats, you can check out Ruby's Marshal format or ASN.1.
Your idea of finding a seldom-used character is good, even if you use escaping, you will still need less escaping with a less used character. Just imaginee what it would look likee if 'ee' was the eescapee characteer. However, I think using a non-printable character goes too far: unless you specifically want to design a binary format (such as Ruby's Marshal, Python's Pickle, or Java's Serialization), "less debuggability" (i.e. debugging by simply inspecting the output with less) is a nice property to have and one that you should not give up easily.
I'm receiving tab delimited files with embedded JSON data in one of the columns. My goal is to split the columns, then do some work to process the JSON.
When I try to use the built-in Ruby CSV library (with Ruby 2.2.3) I get the following error:
Illegal quoting in line 1. (CSV::MalformedCSVError)
Here's an minimalist example that helps demonstrate the problem. The following lines work fine:
puts 'red,"blue",green'.parse_csv
puts 'red,{blue},green'.parse_csv
But this line produces the MalformedCSVError message:
puts 'red,{"blue"},green'.parse_csv
Any idea how I can parse that file and treat the middle value (which happens to be JSON) as a string literal?
thanks in advance!
Double quotes (") is, by default, the character used to surround fields that may contain the CSV column delimiter (tab in your case).
You can get around this by setting the :quote_char option to something else, such as backticks or \0. Additionally, for tab-delimited data you're going to need to set :col_sep.
This should give you what you're looking for,
'red,{"blue"},green'.parse_csv(quote_char: '`')
=> ["red", "{\"blue\"}", "green"]
%Q{red\t{"blue"}\tgreen}.parse_csv(quote_char: '`', col_sep: "\t")
=> ["red", "{\"blue\"}", "green"]
Note that this breaks if either
The JSON column contains tabs and not surrounded by :quote_char or
The JSON data contains :quote_char (e.g. it contains a backtick).
I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.
Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.
I need to tokenise strings in Ruby - string.split is almost perfect, except some of the strings may be enclosed in double-quotes, and within them, whitespace should be preserved. In the absence of lex for Ruby (correct?), writing a character-by-character tokenizer seems silly. What are my options?
I want a loop that's essentially:
while !file.eof:
line = file.readline
tokens = line.tokenize() # like split() but handles "some thing" as one token
end
I.e an an array of white-space delimited fields, but with correct handling of quoted sequences. Note there is no escape sequence for the quotes I need to handle.
The best I can imagine so far, is repeatedly match()ing a reg-exa which matches either the quotes sequence or everything until the next whitespace character, but even then I'm not sure how to formulate than neatly.
Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep and :quote_char options.
If you insist to parse manually you may use the following pattern in a more ruby way:
file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end
split accepts a regex so you could just write the regexp you want and call split on the line you just read.
line.split(/\w+/)
Try using Ruby's CSV library, and use a space (" ") as the :col_sep
:col_sep
The String placed between each field. This String will be transcoded
into the data’s Encoding before parsing.
I'm using CLPB_IMPORT function module to get clipboard to internal table. it's ok. I'am copying two column Excel data. So it fills the table with delimiter '#', like;
4448#3000
4449#4000
4441#5000
But the problem is splitting these strings. I'm trying;
LOOP AT foytab.
SPLIT foytab-tab AT '#' INTO temp1 temp2.
ENDLOOP.
But it doesn't split. it puts whole line into temp1. I think the delimiter is not what I thought ('#'). Because when I write a string manually with delimiter '#' it splits.
Do you have any idea how to split this ?
You should not use CLPB_IMPORT since it's explicitly marked as obsolete. Use CL_GUI_FRONTEND_SERVICES=>CLIPBOARD_IMPORT instead.
The data is probably not separated by # but by a tab character. You can check this in the hex view of the debugger. # is just a replacement symbol the UI uses for any unprintable character. If the delimiter is the tab character, you can use the constant CL_ABAP_CHAR_UTILITIES=>HORIZONTAL_TAB.