Parsing delimited text with escape characters - ruby

I'm trying to parse (in Ruby) what's effectively the UNIX passwd file-format: comma delimiters, with an escape character \ such that anything escaped should be considered literally. I'm trying to use a regular expression for this, but I'm coming up short — even when using Oniguruma for lookahead/lookbehind assertions.
Essentially, all of the following should work:
a,b,c # => ["a", "b", "c"]
\a,b\,c # => ["a", "b,c"]
a,b,c\
d # => ["a", "b", "c\nd"]
a,b\\\,c # => ["a", "b\,c"]
Any ideas?
The first response looks pretty good. With a file containing
\a,,b\\\,c\,d,e\\f,\\,\
g
it gives:
[["\\a,"], [","], ["b\\\\\\,c\\,d,"], ["e\\\\f,"], ["\\\\,"], ["\\\ng\n"], [""]]
which is pretty close. I don't need the unescaping done on this first pass, as long as everything splits correctly on the commas. I tried Oniguruma and ended up with (the much longer):
Oniguruma::ORegexp.new(%{
(?: # - begins with (but doesn't capture)
(?<=\A) # - start of line
| # - (or)
(?<=,) # - a comma
)
(?: # - contains (but doesn't capture)
.*? # - any set of characters
[^\\\\]? # - not ending in a slash
(\\\\\\\\)* # - followed by an even number of slashes
)*?
(?: # - ends with (but doesn't capture)
(?=\Z) # - end of line
| # - (or)
(?=,)) # - a comma
},
'mx'
).scan(s)

Try this:
s.scan(/((?:\\.|[^,])*,?)/m)
It doesn't translate the characters following a \, but that can be done afterwards as a separate step.

I'd give the CSV class a try.
And a regex solution (hack?) might look like this:
#!/usr/bin/ruby -w
# contents of test.csv:
# a,b,c
# \a,b\,c
# a,b,c\
# d
# a,b\\\,c
file = File.new("test.csv", "r")
tokens = file.read.scan(/(?:\\.|[^,\r\n])*|\r?\n/m)
puts "-----------"
tokens.length.times do |i|
if tokens[i] == "\n" or tokens[i] == "\r\n"
puts "-----------"
else
puts ">" + tokens[i] + "<"
end
end
file.close
which will produce the output:
-----------
>a<
>b<
>c<
-----------
>\a<
>b\,c<
-----------
>a<
>b<
>c\
d<
-----------
>a<
>b\\\,c<
-----------

Related

How can I get a Ruby variable from a line read from a file?

I need to reuse a textfile that is filled with one-liners such:
export NODE_CODE="mio12"
How can I do that in my Ruby program the var is created and assign as it is in the text file?
If the file were a Ruby file, you could require it and be able to access the variables after that:
# variables.rb
VAR1 = "variable 1"
VAR2 = 2
# ruby.rb
require "variables"
puts VAR1
If you're not so lucky, you could read the file and then loop through the lines, looking for lines that match your criteria (Rubular is great here) and making use of Ruby's instance_variable_set method. The gsub is to deal with extra quotes when the matcher grabs a variable set as a string.
# variables.txt
export VAR1="variable 1"
export VAR2=2
# ruby.rb
variable_line = Regexp.new('export\s(\w*)=(.*)')
File.readlines("variables.txt").each do |line|
if match = variable_line.match(line)
instance_variable_set("##{match[1].downcase}", match[2].gsub("\"", ""))
end
end
puts #var1
puts #var2
Creating a hash from this file can be a fairly simple thing.
For var.txt:
export BLAH=42
export WOOBLE=67
File.readlines("var.txt").each_with_object({}) { |line, h|
h[$1] = $2 if line =~ /^ export \s+ (.+?) \s* \= \s* (.+) $/x
}
# => {"BLAH"=>"42", "WOOBLE"=>"67"}

what would the regular expression to extract the 3 from be?

I basically need to get the bit after the last pipe
"3083505|07733366638|3"
What would the regular expression for this be?
You can do this without regex. Here:
"3083505|07733366638|3".split("|").last
# => "3"
With regex: (assuming its always going to be integer values)
"3083505|07733366638|3".scan(/\|(\d+)$/)[0][0] # or use \w+ if you want to extract any word after `|`
# => "3"
Try this regex :
.*\|(.*)
It returns whatever comes after LAST | .
You could do that most easily by using String#rindex:
line = "3083505|07733366638|37"
line[line.rindex('|')+1..-1]
#=> "37"
If you insist on using a regex:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
(.+) # match one or more characters in capture group 1
/x # extended mode
line[r,1]
#=> "37"
Alternatively:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
\K # forget everything matched so far
.+ # match one or more characters
/x # extended mode
line[r]
#=> "37"
or, as suggested by #engineersmnky in a comment on #shivam's answer:
r = /
(?<=\|) # match a pipe in a positive lookbehind
\d+ # match any number of digits
\z # match end of string
/x # extended mode
line[r]
#=> "37"
I would use split and last, but you could do
last_field = line.sub(/.+\|/, "")
That remove all chars up to and including the last pipe.

Check if string1 is before string2 on the same line

I am trying to match comment lines in a c#/sql code. CREATE may come before or after /*. They can be on the same line.
line6 = " CREATE /* this is ACTIVE line 6"
line5 = " charlie /* CREATE inside this is comment 5"
In the first case, it will be an active line; in the second, it will be a comment. I probably can do some kind of charindex, but maybe there is a simpler way
regex1 = /\/\*||\-\-/
if (line1 =~ regex1) then puts "Match comment___" + line6 else puts '____' end
if (line1 =~ regex1) then puts "Match comment___" + line5 else puts '____' end
With the regex
r = /
\/ # match forward slash
\* # match asterisk
\s+ # match > 0 whitespace chars
CREATE # match chars
\b # match word break (to avoid matching CREATED)
/ # extended mode for regex def
you can return an array of the comment lines thus:
[line6, line5].select { |l| l =~ r }
#=> [" charlie /* CREATE inside this is comment 5"]

Replacing escape quotes with just quotes in a string

So I'm having an issue replacing \" in a string.
My Objective:
Given a string, if there's an escaped quote in the string, replace it with just a quote
So for example:
"hello\"74" would be "hello"74"
simp"\"sons would be simp"sons
jump98" would be jump98"
I'm currently trying this: but obviously that doesn't work and messes everything up, any assistance would be awesome
str.replace "\\"", "\""
I guess you are being mistaken by how \ works. You can never define a string as
a = "hello"74"
Also escape character is used only while defining the variable its not part of the value. Eg:
a = "hello\"74"
# => "hello\"74"
puts a
# hello"74
However in-case my above assumption is incorrect following example should help you:
a = 'hello\"74'
# => "hello\\\"74"
puts a
# hello\"74
a.gsub!("\\","")
# => "hello\"74"
puts a
# hello"74
EDIT
The above gsub will replace all instances of \ however OP needs only to replace '" with ". Following should do the trick:
a.gsub!("\\\"","\"")
# => "hello\"74"
puts a
# hello"74
You can use gsub:
word = 'simp"\"sons';
print word.gsub(/\\"/, '"');
//=> simp""sons
I'm currently trying str.replace "\\"", "\"" but obviously that doesn't work and messes everything up, any assistance would be awesome
str.replace "\\"", "\"" doesn't work for two reasons:
It's the wrong method. String#replace replaces the entire string, you are looking for String#gsub.
"\\"" is incorrect: " starts the string, \\ is a backslash (correctly escaped) and " ends the string. The last " starts a new string.
You have to either escape the double quote:
puts "\\\"" #=> \"
Or use single quotes:
puts '\\"' #=> \"
Example:
content = <<-EOF
"hello\"74"
simp"\"sons
jump98"
EOF
puts content.gsub('\\"', '"')
Output:
"hello"74"
simp""sons
jump98"

Read files line by line with \r, \n or \r\n as line separator

I want to process files line by line. However, these files have different line separators: "\r", "\n" or "\r\n". I don't know which one they use or which kind of OS they come from.
I have two solutions:
using bash command to translate these separators to "\n".
cat file |
tr '\r\n' '\n' |
tr '\r' '\n' |
ruby process.rb
read the whole file and gsub these separators
text=File.open('xxx.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
do some thing
end
but the second solution is not good when the file is huge. See reference. Is there any other ruby idiomatic and efficient solution?
I suggest you first determine the line separator. I've assumed that you can do that by reading characters until you encounter "\n" or "\r" (or reach the end of the file, in which case we can regard "\n" as the line separator). If the character "\n" is found, I assume that to be the separator; if "\r" is found I attempt to read the next character. If I can do so and it is "\n", I return "\r\n" as the separator. If "\r" is the last character in the file or is followed by a character other than "\n", I return "\r" as the separator.
def separator(fname)
f = File.open(fname)
enum = f.each_char
c = enum.next
loop do
case c[/\r|\n/]
when "\n" then break
when "\r"
c << "\n" if enum.peek=="\n"
break
end
c = enum.next
end
c[0][/\r|\n/] ? c : "\n"
end
Then process the file line-by-line
def process(fname)
sep = separator(fname)
IO.foreach(fname, sep) { |line| puts line }
end
I haven't converted "\r" or "\r\n" to "\n", but of course you could do that easily. Just open a file for writing and in process read each line and write it to the output file with the default line separator.
Let's try it (for clarity I show the value returned by separator):
fname = "temp"
IO.write(fname, "slash n line 1\nslash n line 2\n")
#=> 30
separator(fname)
#=> "\n"
process(fname)
# slash n line 1
# slash n line 2
IO.write(fname, "slash r line 1\rslash r line 2\r", )
#=> 30
separator(fname)
#=> "\r"
process(fname)
# slash r line 1
# slash r line 2
IO.write(fname, "slash r slash n line 1\r\nslash r slash n line 2\r\n")
#=> 48
separator(fname)
#=> "\r\n"
process(fname)
# slash r slash n line 1
# slash r slash n line 2

Resources