How to parse text by new lines? - ruby

I have some text that spans multiple lines, and I want to organize it by each new line. An example text is:
Save $5.00 on Candy with Your Pickup Purchase
Other
when you purchase $15.00 worth of candy. Offer valid only when
Exp 02/09/2019
I'm looking to put each new line in a different array, but not sure how to differentiate the new lines from each other.

You can use:
> str = <<e
> First Line
> Second line
>
>
> Fifth Line
>
> Seventh Line
> e
# => "First Line\nSecond line\n\n\nFifth Line\n\nSeventh Line\n"
> str.split("\n")
# => ["First Line", "Second line", "", "", "Fifth Line", "", "Seventh Line"]
It will split the string into an array separated by new line characters.
Each element in array represents text line, empty text line represents empty line.

<<~_.lines
Save $5.00 on Candy with Your Pickup Purchase
Other
when you purchase $15.00 worth of candy. Offer valid only when
Exp 02/09/2019
_
# =>
# [
# "Save $5.00 on Candy with Your Pickup Purchase\n",
# "\n",
# "Other\n",
# "\n",
# "when you purchase $15.00 worth of candy. Offer valid only when \n",
# "Exp 02/09/2019\n"
# ]

Related

Regex-negation to exclude an element

I have the following strings:
"4 sprigs of fresh rosemary"
"1 x 600 g jar of quality white beans"
and I would like to exclude everything that's before "of" like this:
"fresh rosemary"
"quality white beans"
I tried using gsub, but I can't find the proper regex.
Not sure why you are using gsub. It does not make sense.
"4 sprigs of fresh rosemary".sub(/.*(?=of)/, "")
# => "of fresh rosemary"
"1 x 600 g jar of quality white beans".sub(/.*(?=of)/, "")
# => "of quality white beans"
By the way, what you described and what you are expecting do not match.
Or,
"1 x 600 g jar of quality white beans".sub(/.*of\s*/, "")
# => "quality white beans"
"4 sprigs of fresh rosemary".sub(/.*of\s*/, "")
# => "fresh rosemary"
Not using a regex:
"4 sprigs of fresh rosemary".split('of ').last
# => "fresh rosemary"
Would not work if the sentence had the word "of" more than once.
You could use the match method on a string using a regex.
"4 sprigs of fresh rosemary".match(/of (.+)/)[1]
=> "fresh rosemary"
The brackets around .+ deterimne a substring to return in MatchData that you then call with [1]
Here is another option if you always want the end of the line after "of" even if there are multiples of "of" in a sentence
paragraph = "4 sprigs of fresh rosemary
1 x 600 g jar of quality white beans
I have a love of many things including a love of fresh rosemary
I have an of for all things but of the things I of, I of quality white beans the most
I will reject this offensive flower"
paragraph.scan(/(?<=\sof\s)(?!.*\sof\s).+/)
#=> ["fresh rosemary", "quality white beans",
# "fresh rosemary", "quality white beans the most"]
This regex say:
(?<=\sof\s) : Look behind for the literal " of "
(?!.*\sof\s) : Look ahead to make sure there are no more occurrences of " of "
.+ expect one or more characters after the final " of "
Example

Parse multiline text with pattern

here is a little example:
02-09-17 1:01 PM - Some User (Add comments)
Hello,
How are you?
Regards,
02-09-17 3:29 PM - Another User (Add comments)
Hey,
Thanks, all is fine.
Some another text here.
02-09-17 4:30 AM - Just a User (Add comments)
some text
with
multiline
I want to parse and process this three comments. What is the best way for this?
Tried regex like this - http://www.rubular.com/r/k1CHJ1STTD but have problems with /m flag. Without multiline flag for regex - can`t catch "body" of comment.
Also tried to split by regex:
text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/)
=> ["",
"02-09-17 1:01 PM - Some User (Add comments)",
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n",
"02-09-17 3:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 4:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n",
"02-09-17 5:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 6:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n"]
But this is not comfortable solution.
Ideally I want to get regex captures with three or two group matches, for example:
1. 02-09-17 1:01 PM
2. Some User (Add comments)
3. Hello,
How are you?
Regards,
for each comment, or, Array of comments:
[['02-09-17 1:01 PM - Some User (Add comments) Hello,
How are you?
Regards,'],[...]]
Any ideas? Thanks.
You can keep it simple using two splits (one for the whole string and one for each block):
text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }
You can also use the scan method, but it's a little more fastidious:
text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)
slice_before might be easier to understand than a huge scan, and it has the advantage of keeping the pattern (split removes it)
data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block|
time, user = block.shift.strip.split(' - ')
[time, user, block.join.strip]
end
p data
# [["02-09-17 1:01 PM",
# "Some User (Add comments)",
# "Hello,\n\nHow are you?\n\nRegards,"],
# ["02-09-17 3:29 PM",
# "Another User (Add comments)",
# "Hey,\n\nThanks, all is fine.\n\nSome another text here."],
# ["02-09-17 4:30 AM",
# "Just a User (Add comments)",
# "some text\nwith\nmultiline"]]
You can use this regular expression:
(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)
(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) matches the first group, the date and time. The date must consist of three numbers, separated by a dash, followed by the time with AM/PM
(.*?)\r?\n((?:.|\r?\n)+?) matches the username up to the first line break (\r?\n) as the second group. Afterwards, anything including linebreaks is matching and building the third group, the comment.
This won't work, because it would handle everything from the beginning of the comment up to the end of the file as a comment. Therefore, you need to select the next date/time format, so that it stops there. You can do this just by repeating the date/time format after the comment and matching non-greedy, but this will include the next datetime already in the current match and therefore exclude it in the next match (which will lead to a skip of every second match). To circumvent this, you can use a positive lookahead: (?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$). This matches a number afterwards, but does not include it in the match. The last comment must then end at the end of the string $.
You need to use the global flag /g but mustn't use the multi-line flag /g, because the matching of the comment goes over multiple lines.
Here is a live example: https://regex101.com/r/o63GQE/2

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Converting a multi line string to an array in Ruby using line breaks as delimiters

I would like to turn this string
"P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
into an array that looks like in ruby.
["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
using split doesn't return what I would like because of the line breaks.
This is one way to deal with blank lines:
string.split(/\n+/)
For example,
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
string.split(/\n+/)
#=> ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS",
# "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
To accommodate files created under Windows (having line terminators \r\n) replace the regular expression with /(?:\r?\n)+/.
I like to use this as a pretty generic method for handling newlines and returns:
lines = string.split(/\n+|\r+/).reject(&:empty?)
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
Using CSV::parse
require 'csv'
CSV.parse(string).flatten
# => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Another way using String#each_line :-
ar = []
string.each_line { |line| ar << line.strip unless line == "\n" }
ar # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Building off of #Martin's answer:
lines = string.split("\n").reject(&:blank?)
That'll give you only the lines that are valued
Split can take a parameter in the form of the character to use to split, so you can do:
lines = string.split("\n")
I think it should be noted that in some situations, line breaks can include not only newlines (\n) but also carriage returns (\r) and that there could potentially be any combination or quantity thereof. Let's take the following string for example:
str = "Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4... \n
Useful Line 5\r \n
Useful Line 6\n\r
Useful Line 7\n\r\n\r
Useful Line 8 \r\n\r\n
Useful Line 9\r\r\r Useful Line 10\n\n\n\n\nUseful Line 11 \r Useful Line 12"
To deal with all instances of \n and \r, I would do the following to replace all instances of \r with \n using gsub, and then I would combine all consecutive instances of \n using squeeze(arg):
str.gsub("\r", "\n").squeeze("\n")
which would result in :
#=>
"Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Useful Line 9
Useful Line 10
Useful Line 11
Useful Line 12"
...which brings me to our next issue. Sometimes those extra line breaks contain unwanted whitespace and not truly blank or empty lines. To deal with not only line breaks but also unwanted empty lines, I would add the each_line, reject, and strip method like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join
which would result in the desired string:
#=>
Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Usefule Line 9
Useful Line 10
Useful Line 11
Useful Line 12
Now more specifically to the OP, we could then simply use split("\n") to finish it all off (as was already mentioned by others):
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join.split("\n")
or we could simply skip straight to the desired array by replacing each_line with map and leaving off the unnecessary join like so:
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}
both of which would result in:
#=>
["Useful Line 1 ....", " Useful Line 2", "Useful Line 3", " Useful Line 4... ", "Useful Line 5", " Useful Line 6", "Useful Line 7", " Useful Line 8 ", "Usefule Line 9", " Useful Line 10", "Useful Line 11 ", " Useful Line 12"]
NOTE:
You may also want to strip off leading and trailing whitespace from each line in which case we could replace .join.split("\n") with .map(&:strip) like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.map(&:strip)
or
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}.map(&:strip)
which would both result in:
#=>
["Useful Line 1 ....", "Useful Line 2", "Useful Line 3", "Useful Line 4...", "Useful Line 5", "Useful Line 6", "Useful Line 7", "Useful Line 8", "Usefule Line 9", "Useful Line 10", "Useful Line 11", "Useful Line 12"]

replace every occurrence of 'line 2' with line_2 with regex

I'm parsing some text from an XML file which has sentences like
"Subtract line 4 from line 1.", "Enter the amount from line 5"
i want to replace all occurrences of line with line_
eg. Subtract line 4 from line 1 --> Subtract line_4 from line_1
Also, there are sentences like "Are the amounts on lines 4 and 8 the same?" and "Skip lines 9 through 12; go to line 13."
I want to process these sentences to become
"Are the amounts on line_4 and line_8 the same?"
and
"Skip line_9 through line_12; go to line_13."
Here's a working implementation with rspec test. You call it like this: output = LineIdentifier[input]. To test, spec file.rb after installing rspec gem.
require 'spec'
class LineIdentifier
def self.[](input)
output = input.gsub /line (\d+)/, 'line_\1'
output.gsub /lines (\d+) (and|from|through) (line )?(\d+)/, 'line_\1 \2 line_\4'
end
end
describe "LineIdentifier" do
it "should identify line mentions" do
examples = {
#Input Output
'Subtract line 4 from line 1.' => 'Subtract line_4 from line_1.',
'Enter the amount from line 5' => 'Enter the amount from line_5',
'Subtract line 4 from line 1' => 'Subtract line_4 from line_1',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
it "should identify line ranges" do
examples = {
#Input Output
'Are the amounts on lines 4 and 8 the same?' => 'Are the amounts on line_4 and line_8 the same?',
'Skip lines 9 through 12; go to line 13.' => 'Skip line_9 through line_12; go to line_13.',
}
examples.each do |input, output|
LineIdentifier[input].should == output
end
end
end
This works for the specific examples including the ones in the OP comments. As is often the case when using regex to do parsing, it becomes a hodge-podge of additional cases and tests to handle ever-increasing known inputs. This handles the lists of line numbers using a while loop with a non-greedy match. As written, it is simply processing an input line-by-line. To get series of line numbers across line boundaries, it would need to be changed to process it as one chunk with matching across lines.
open( ARGV[0], "r" ) do |file|
while ( line = file.gets )
# replace both "line ddd" and "lines ddd" with line_ddd
line.gsub!( /(lines?\s)(\d+)/, 'line_\2' )
# Now replace the known sequences with a non-greedy match
while line.gsub!( /(line_\d+[a-z]?,?)(\sand\s|\sthrough\s|,\s)(\d+)/, '\1\2line_\3' )
end
puts line
end
end
Sample Data: For this input:
Subtract line 4 from line 1.
Enter the amount from line 5
on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
... on line 10 Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
It produces this output:
Subtract line_4 from line_1.
Enter the amount from line_5
on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
... on line_10 Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
sed is your friend:
lines.sed:
#!/bin/sed -rf
s/lines? ([0-9]+)/line_\1/g
s/\b([0-9]+[a-z]?)\b/line_\1/g
lines.txt:
Subtract line 4 from line 1.
Enter the amount from line 5
Are the amounts on lines 4 and 8 the same?
Skip lines 9 through 12; go to line 13.
Enter the total of the amounts from Form 1040A, lines 7, 8a, 9a, 10, 11b, 12b, and 13
Add lines 2, 3, and 4
demo:
$ cat lines.txt | ./lines.sed
Subtract line_4 from line_1.
Enter the amount from line_5
Are the amounts on line_4 and line_8 the same?
Skip line_9 through line_12; go to line_13.
Enter the total of the amounts from Form 1040A, line_7, line_8a, line_9a, line_10, line_11b, line_12b, and line_13
Add line_2, line_3, and line_4
You can also make this into a sed one-liner if you prefer, although the file is more maintainable.

Resources