Converting a multi line string to an array in Ruby using line breaks as delimiters - ruby

I would like to turn this string
"P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
into an array that looks like in ruby.
["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
using split doesn't return what I would like because of the line breaks.

This is one way to deal with blank lines:
string.split(/\n+/)
For example,
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
string.split(/\n+/)
#=> ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS",
# "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
To accommodate files created under Windows (having line terminators \r\n) replace the regular expression with /(?:\r?\n)+/.

I like to use this as a pretty generic method for handling newlines and returns:
lines = string.split(/\n+|\r+/).reject(&:empty?)

string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
Using CSV::parse
require 'csv'
CSV.parse(string).flatten
# => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Another way using String#each_line :-
ar = []
string.each_line { |line| ar << line.strip unless line == "\n" }
ar # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]

Building off of #Martin's answer:
lines = string.split("\n").reject(&:blank?)
That'll give you only the lines that are valued

Split can take a parameter in the form of the character to use to split, so you can do:
lines = string.split("\n")

I think it should be noted that in some situations, line breaks can include not only newlines (\n) but also carriage returns (\r) and that there could potentially be any combination or quantity thereof. Let's take the following string for example:
str = "Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4... \n
Useful Line 5\r \n
Useful Line 6\n\r
Useful Line 7\n\r\n\r
Useful Line 8 \r\n\r\n
Useful Line 9\r\r\r Useful Line 10\n\n\n\n\nUseful Line 11 \r Useful Line 12"
To deal with all instances of \n and \r, I would do the following to replace all instances of \r with \n using gsub, and then I would combine all consecutive instances of \n using squeeze(arg):
str.gsub("\r", "\n").squeeze("\n")
which would result in :
#=>
"Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Useful Line 9
Useful Line 10
Useful Line 11
Useful Line 12"
...which brings me to our next issue. Sometimes those extra line breaks contain unwanted whitespace and not truly blank or empty lines. To deal with not only line breaks but also unwanted empty lines, I would add the each_line, reject, and strip method like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join
which would result in the desired string:
#=>
Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Usefule Line 9
Useful Line 10
Useful Line 11
Useful Line 12
Now more specifically to the OP, we could then simply use split("\n") to finish it all off (as was already mentioned by others):
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join.split("\n")
or we could simply skip straight to the desired array by replacing each_line with map and leaving off the unnecessary join like so:
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}
both of which would result in:
#=>
["Useful Line 1 ....", " Useful Line 2", "Useful Line 3", " Useful Line 4... ", "Useful Line 5", " Useful Line 6", "Useful Line 7", " Useful Line 8 ", "Usefule Line 9", " Useful Line 10", "Useful Line 11 ", " Useful Line 12"]
NOTE:
You may also want to strip off leading and trailing whitespace from each line in which case we could replace .join.split("\n") with .map(&:strip) like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.map(&:strip)
or
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}.map(&:strip)
which would both result in:
#=>
["Useful Line 1 ....", "Useful Line 2", "Useful Line 3", "Useful Line 4...", "Useful Line 5", "Useful Line 6", "Useful Line 7", "Useful Line 8", "Usefule Line 9", "Useful Line 10", "Useful Line 11", "Useful Line 12"]

Related

Multiline RegEx: match last occurrence only

I have a string containing a Python stack trace like this (with some irrelevant text before and after):
Traceback (most recent call last):
File "/workspace/r111.py", line 232, in test_assess
exec(code)
File "a111.py", line 17, in
def reset(self):
File "/workspace/r111.py", line 123, in failed
raise AssertionError(msg)
AssertionError: Dein Programm funktioniert nicht. Python sagt:
Traceback (most recent call last):
File "a111.py", line 6, in
File "/workspace/r111.py", line 111, in runcaptured
exec(c, variables)
File "", line 1, in
ZeroDivisionError: division by zero
Now I want to extract the line in which the error occurred (1 extracted from File "", line 1) using a multiline RegEx (in Ruby).
/File ".*", line ([0-9]+)/ works nicely, but matches all occurrences. I only want the last. Iterating over the matches in the target environment is not a valid solution, as I can't change the business logic there.
You may use
/(?m:.*)(?-m:File ".*", line ([0-9]+))/
Details
(?m:.*) - a modifier group where the multiline flag is on and the dot matches any char including line break chars that matches any zero or more chars as many as possible up to the last occurrence of the subsequent subpatterns
(?-m:File ".*", line ([0-9]+)) - another modifier group where the multiline flag is off and the dot now matches any char but line break chars:
File - a literal substring with a space after it
".*" - a double quote, any zero or mmore chars other than linebreaks and then another double quote
, line - comma, space, "line" substring
([0-9]+) -Group 1 capturing one or more digits.

Ruby question mark in filename

I have a little piece of ruby that creates a file containing tsv content with 2 columns, a date, and a random number.
#!/usr/bin/ruby
require 'date'
require 'set'
startDate=Date.new(2014,11,1)
endDate=Date.new(2015,9,1)
dates=File.new("/PATH_TO_FILE/dates_randoms.tsv","w+")
rands=Set.new
while startDate <= endDate do
random=rand(1000)
while rands.add?(random).nil? do
random=rand(1000)
end
dates.puts("#{startDate.to_s.gsub("-","")} #{random}")
startDate=startDate+1
end
Then, from another program, i read this file and create a file out of the random number:
dates_file=File.new(DATES_FILE_PATH,"r")
dates_file.each_line do |line|
parts=line.split("\t")
random=parts.at(1)
table=File.new("#{TMP_DIR}#{random}.tsv","w")
end
But when i go and check the file i see 645?.tsv for example.
I initially thought that was the line separator in the tsv file (the one containing the date and the random) but its run in the same unix filesystem, its not a transaction from dos to unix
Some lines from the file:
head dates_randoms.tsv
20141101 356
20141102 604
20141103 680
20141104 668
20141105 995
20141106 946
20141107 354
20141108 234
20141109 429
20141110 384
Any advice?
parts = line.split("\t")
random = parts.at(1)
line there will contain a trailing newline char. So for a line
"whatever\t1234\n"
random will contain "1234\n". That newline char then becomes a part of filename and you see it as a question mark. The simplest workaround is to do some sanitization:
random = parts.at(1).chomp
# alternatively use .strip if you want to remove whitespaces
# from beginning of the value too

How to write some value to a text file in ruby based on position

I need some help is some unique solution. I have a text file in which I have to replace some value based on some position. This is not a big file and will always contain 5 lines with fixed number of length in all the lines at any given time. But I have to specficaly replace soem text in some position only. Further, i can also put in some text in required position and replace that text with required value every time. I am not sure how to implement this solution. I have given the example below.
Line 1 - 00000 This Is Me 12345 trying
Line 2 - 23456 This is line 2 987654
Line 3 - This is 345678 line 3 67890
Consider the above is the file I have to use to replace some values. Like in line 1, I have to replace '00000' with '11111' and in line 2, I have to replace 'This' with 'Line' or any require four digit text. The position will always remain the same in text file.
I have a solution which works but this is for reading the file based on position and not for writing. Can someone please give a solution similarly for wrtiting aswell based on position
Solution for reading the file based on position :
def read_var file, line_nr, vbegin, vend
IO.readlines(file)[line_nr][vbegin..vend]
end
puts read_var("read_var_from_file.txt", 0, 1, 3) #line 0, beginning at 1, ending at 3
#=>308
puts read_var("read_var_from_file.txt", 1, 3, 6)
#=>8522
I have also tried this solution for writing. This works but I need it to work based on position or based on text present in the specific line.
Explored solution to wirte to file :
open(Dir.pwd + '/Files/Try.txt', 'w') { |f|
f << "Four score\n"
f << "and seven\n"
f << "years ago\n"
}
I made you a working sample anagraj.
in_file = "in.txt"
out_file = "out.txt"
=begin
=>contents of file in.txt
00000 This Is Me 12345 trying
23456 This is line 2 987654
This is 345678 line 3 67890
=end
def replace_in_file in_file, out_file, shreds
File.open(out_file,"wb") do |file|
File.read(in_file).each_line.with_index do |line, index|
shreds.each do |shred|
if shred[:index]==index
line[shred[:begin]..shred[:end]]=shred[:replace]
end
end
file << line
end
end
end
shreds = [
{index:0, begin:0, end:4, replace:"11111"},
{index:1, begin:6, end:9, replace:"Line"}
]
replace_in_file in_file, out_file, shreds
=begin
=>contents of file out.txt
11111 This Is Me 12345 trying
23456 Line is line 2 987654
This is 345678 line 3 67890
=end

Is there a SnakeYaml DumperOptions setting to avoid double-spacing output?

I seem to see double-spaced output when parsing/dumping a simple YAML file with a pipe-text field.
The test is:
public void yamlTest()
{
DumperOptions printOptions = new DumperOptions();
printOptions.setLineBreak(DumperOptions.LineBreak.UNIX);
Yaml y = new Yaml(printOptions);
String input = "foo: |\n" +
" line 1\n" +
" line 2\n";
Object parsedObject = y.load(new StringReader(input));
String output = y.dump(parsedObject);
System.out.println(output);
}
and the output is:
{foo: 'line 1
line 2
'}
Note the extra space between line 1 and line 2, and after line 2 before the end of the string.
This test was run on Mac OS X 10.6, java version "1.6.0_29".
Thanks!
Mark
In the original string you use literal style - it is indicating by the '|' character. When you dump your text, you use single-quoted style which ignores the '\n' characters at the end. That is why they are repeated with the empty lines.
Try to set different styles in DumperOptions:
// and others - FOLDED, DOUBLE_QUOTED
DumperOptions.setDefaultScalarStyle(ScalarStyle.LITERAL)

replace every occurrence of 'line 2' with line_2 with regex

I'm parsing some text from an XML file which has sentences like
"Subtract line 4 from line 1.", "Enter the amount from line 5"
i want to replace all occurrences of line with line_
eg. Subtract line 4 from line 1 --> Subtract line_4 from line_1
Also, there are sentences like "Are the amounts on lines 4 and 8 the same?" and "Skip lines 9 through 12; go to line 13."
I want to process these sentences to become
"Are the amounts on line_4 and line_8 the same?"
and
"Skip line_9 through line_12; go to line_13."
Here's a working implementation with rspec test. You call it like this: output = LineIdentifier[input]. To test, spec file.rb after installing rspec gem.
require 'spec'
class LineIdentifier
def self.[](input)
output = input.gsub /line (\d+)/, 'line_\1'
output.gsub /lines (\d+) (and|from|through) (line )?(\d+)/, 'line_\1 \2 line_\4'
end
end
describe "LineIdentifier" do
it "should identify line mentions" do
examples = {
#Input Output
'Subtract line 4 from line 1.' => 'Subtract line_4 from line_1.',
'Enter the amount from line 5' => 'Enter the amount from line_5',
'Subtract line 4 from line 1' => 'Subtract line_4 from line_1',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
it "should identify line ranges" do
examples = {
#Input Output
'Are the amounts on lines 4 and 8 the same?' => 'Are the amounts on line_4 and line_8 the same?',
'Skip lines 9 through 12; go to line 13.' => 'Skip line_9 through line_12; go to line_13.',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
end
This works for the specific examples including the ones in the OP comments. As is often the case when using regex to do parsing, it becomes a hodge-podge of additional cases and tests to handle ever-increasing known inputs. This handles the lists of line numbers using a while loop with a non-greedy match. As written, it is simply processing an input line-by-line. To get series of line numbers across line boundaries, it would need to be changed to process it as one chunk with matching across lines.
open( ARGV[0], "r" ) do |file|
while ( line = file.gets )
# replace both "line ddd" and "lines ddd" with line_ddd
line.gsub!( /(lines?\s)(\d+)/, 'line_\2' )
# Now replace the known sequences with a non-greedy match
while line.gsub!( /(line_\d+[a-z]?,?)(\sand\s|\sthrough\s|,\s)(\d+)/, '\1\2line_\3' )
end
puts line
end
end
Sample Data: For this input:
Subtract line 4 from line 1.
Enter the amount from line 5
on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
... on line 10 Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
It produces this output:
Subtract line_4 from line_1.
Enter the amount from line_5
on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
... on line_10 Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
sed is your friend:
lines.sed:
#!/bin/sed -rf
s/lines? ([0-9]+)/line_\1/g
s/\b([0-9]+[a-z]?)\b/line_\1/g
lines.txt:
Subtract line 4 from line 1.
Enter the amount from line 5
Are the amounts on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
Enter the total of the amounts from Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
demo:
$ cat lines.txt | ./lines.sed
Subtract line_4 from line_1.
Enter the amount from line_5
Are the amounts on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
Enter the total of the amounts from Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
You can also make this into a sed one-liner if you prefer, although the file is more maintainable.

Resources