Unable to split data properly with Ruby Regex Rubular - ruby

I am trying to organize and break up contents within emails that has been extracted through Net::POP3. In the code, when I use
p mail.pop
I get
****************************\r\n>>=20\r\n>>11) <> Summary: Working with Vars on Social Influence =\r\nplatform=20\r\n>>=20\r\n>> Name: Megumi Lindon \r\n>>=20\r\n>> Category: Social Psychology=20\r\n>>=20\r\n>> Email: information#example.com =\r\n<mailto:information#example.com>=20\r\n>>=20\r\n>> Journal News: Saving Grace \r\n>>=20\r\n>> Deadline: 10:00 PM EST - 15 February=20\r\n>>=20\r\n>> Query:=20\r\n>>=20\r\n>> Lorem ipsum dolor sit amet \r\n>> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n>>=20\r\n>> Duis aute irure dolor in reprehenderit in voluptate \r\n>> velit esse cillum dolore eu fugiat nulla pariatur. =20\r\n>>=20\r\n>> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.=20\r\n>> Requirements:=20\r\n>>=20\r\n>> Psychologists; anyone with good knowdledge\r\n>> with sociology and psychology.=20\r\n>>=20\r\n>> Please do send me your article and profile\r\n>> you want to be known as well. Thank you!=20\r\n>> Back to Top <x-msg://30/#top> Back to Category Index =\r\n<x-msg://30/#SocialPsychology>\r\n>>-----------------------------------\r\n>>=20\r\n>>
I am trying to break it up and organize it to
11) Summary: Working with Vars on Social Influence
Name: Megumi Lindon
Category: Social Psychology
Email: information#example.com
Journal News: Saving Grace
Deadline: 10:00 PM EST - 15 February
Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.
So far, I have been using rubular, but with varying results as I am still learning how to use regex, gsub and split properly. My code thus far is as below.
p mail.pop.scan(/Summary: (.+) Name:/)
p mail.pop.scan(/Name: (.+) Category:/)
p mail.pop.scan(/Category: (.+) Email:/)
p mail.pop.scan(/Email: (.+) Journal News:/)
p mail.pop.scan(/Journal News: (.+) Deadline:/)
p mail.pop.scan(/Deadline: (.+) Questions:/)
p mail.pop.scan(/Questions:(.+) Requirements:/)
p mail.pop.scan(/Requirements:(.+) Back to Top/)
But I have been getting empty arrays.
[]
[]
[]
[]
[]
[]
[]
[]
Wondering how I can do this better. Thanks in advance.

Oh, my! What a mess!
There are many ways to approach this problem, of course, but I expect they all involve multiple steps and lots of trial and error. I can only say how I went about it.
Lots of little steps is a good thing, for a couple of reasons. Firstly, it breaks the problem down into manageable tasks whose solutions can be tested individually. Secondly, the parsing rules may change in the future. If you have several steps you may only have to change and/or add one or two operations. If you have few steps and complex regular expressions, you may as well start over, particular if the code was written by someone else.
Let's say text is a variable containing your string.
Firstly, I don't like all those newlines, because they complicate regex's, so the first thing I'd do is get rid of them:
s1 = text.gsub(/\n/, '')
Next, there are many "20\r"'s which can be troublesome, as we may want to keep other text that contains numbers, so we can remove those (as well as "7941\r"):
s2 = s1.gsub(/\d+\r/, '')
Now let's look at the fields you want and the immediately-preceding and immediately-following text:
puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
# <> Summary: Working with V
#=>> Name: Megumi Lindon
#=>> Category: Social Psychol
#=>> Email: information#ex
#<mailto:information#exa
#=>> Journal News: Saving Grace
#=>> Deadline: 10:00 PM EST -
#=>> Query:=>>=>> Lorem ip
#=>> Requirements:=>>=>> Psycholo
# <x-msg://30/#top> Back
#<x-msg://30/#SocialPsy
We see that the fields of interest begin with "> " and the field name is followed by ": " or ":=". Let's simplify by changing ":=" to ": " after the field name and "> " to " :" before the field name:
s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
In the regex for s3, (?<=\w) is a "positive lookbehind": the match must be immediately preceded by a word character (which is not included as part of the match); in the regex for s4, (?=(?:\w+\s+)*\w+: ) is a "positive lookahead": the match must be immediately followed by one or more words followed by a colon then a space. Note that s3 and s4 must be calculated in the given order.
We can now remove all the non-word characters other than punctuation characters and spaces:
s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
and then (finally) split on the fields:
a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
# => ["11) :", "Summary: ", "Working with Vars on Social Influence platform :",
# "Name: ", "Megumi Lindon :",
# "Category: ", "Social Psychology :",
# "Email: ", "informationexample.com mailto:informationexample.com :",
# "Journal News: ", "Saving Grace :",
# "Deadline: ", "10:00 PM EST 15 February :",
# "Query: ", "Lorem ipsum ...laborum. :",
# "Requirements: ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"]
Note that I have enclosed (?<= :)(?:\w+\s+)*\w+:\s+ in a capture group, so that String#split will include the bits that it splits on in the resulting array.
All that remains is some cleaning-up:
a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
#=> "11) Summary: "
a3 = a2.each_slice(2).to_a
#=> [["11) Summary: ", "Working with Vars on Social Influence platform "],
# ["Name: ", "Megumi Lindon "],
# ["Category: ", "Social Psychology "],
# ["Email: ", "informationexample.com mailto:informationexample.com "],
# ["Journal News: ", "Saving Grace "],
# ["Deadline: ", "10:00 PM EST 15 February "],
# ["Query: ", "Lorem...est laborum. "],
# ["Requirements: ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]]
idx = a3.index { |n,_| n =~ /Email: / }
#=> 3
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
#=> "informationexample.com "
Join the strings and remove extra spaces:
a4 = a3.map { |b| b.join(' ').split.join(' ') }
#=> ["11) Summary: Working with Vars on Social Influence platform",
# "Name: Megumi Lindon",
# "Category: Social Psychology",
# "Email: informationexample.com",
# "Journal News: Saving Grace",
# "Deadline: 10:00 PM EST 15 February",
# "Query: Lorem...laborum.",
# "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"]
"Requirements" is still problematic, but without additional rules, nothing more can be done. We cannot limit all category values to a single sentence because "Query" can have more than one. If you wish to limit "Requirements" to one sentence:
idx = a4.index { |n,_| n =~ /Requirements: / }
#=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
# => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."
If you wish to combine these operations:
def parse_it(text)
a1 = text.gsub(/\n/, '')
.gsub(/\d+\r/, '')
.gsub(/(?<=\w):=/, ": ")
.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
.map { |s| s.chomp(':') }
a1[0] = a1.shift + a1.first
a2 = a1.each_slice(2).to_a
idx = a2.index { |n,_| n =~ /Email: / }
a2[idx][1] = a2[idx][1][/.*?\s/] if idx
a3 = a2.map { |b| b.join(' ').split.join(' ') }
idx = a3.index { |n,_| n =~ /Requirements: / }
a3[idx] = a3[idx][/.*?[.!?]/] if idx
a3
end

Related

XPath: extract element nodes before/after substring

I have an XML file which basically is a list of entry elements:
...
<entry> Lorem ipsum <name n="1">dolor</name> sit amet, <name n="2">consectetur</name> adipiscing
elit, XXXXX sed do <name n="3">eiusmod</name> tempor incididunt ut labore et dolore magna
aliqua. YYYYY Ut enim ad minim veniam, quis <name n="3">nostrud</name> exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.
</entry>
...
Each entry element has text with an arbitrary number of name elements. The text of each entry is split in three sections by the strings XXXXX and YYYYY. I'd like to access the name elements as three distinct lists:
all names of section I (from start of text to XXXXX)
all names of section II (from XXXXX to YYYYY)
all names of section III (from YYYYY to end of text)
My first instinct for 1. was to split the string at XXXXX, and look into results for extracting the name nodes, but //entry/tokenize(string(), "XXXXX") returns a list of strings, without the name nodes.
Is there any Xpath expression to solve this?

Parsing lines of text from external file in Ruby

I am trying to parse a raw email. The desired result is a hash of the lines that contain specific headers.
This is the Ruby file:
raw_email = File.open("sample-email.txt", "r")
parsed_email = Hash.new('')
raw_email.each do |line|
puts line
header = line.chomp(":")
puts header
if header == "Delivered-To"
parsed_email[:to] = line
elsif header == "From"
parsed_email[:from] = line
elsif header == "Date"
parsed_email[:date] = line
elsif header == "Subject"
parsed_email[:subject] = line
end
end
puts parsed_email
And this is the raw email:
Delivered-To: user1#example.com
From: John Doe <user2#example.com>
Date: Tue, 12 Dec 2017 13:30:14 -0500
Subject: Testing the parser
To: user1#example.com
Content-Type: multipart/alternative;
boundary="123456789abcdefghijklmnopqrs"
--123456789abcdefghijklmnopqrs
Content-Type: text/plain; charset="UTF-8"
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer nec
odio. Praesent libero. Sed cursus ante dapibus diam. Sed nisi. Nulla
quis sem at nibh elementum imperdiet. Duis sagittis ipsum.
--123456789abcdefghijklmnopqrs
Content-Type: text/html; charset="UTF-8"
<div dir="ltr">Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Integer nec odio. Praesent libero. Sed cursus ante dapibus diam.
Sed nisi. Nulla quis sem at nibh elementum imperdiet. Duis sagittis
ipsum.<br clear="all">
</div>
--089e082c24dc944a9f056028d791--
The puts statements are just for my own testing to see if data is being passed along.
What I am getting is each full line put twice and an empty hash put at the end.
I have also tried changing different bits to strings or arrays and I've also tried using line.split(":", 1) instead of line.chomp(":")
Can someone please explain why this isn't working?
Try this
raw_email = File.open("sample-email.txt", "r")
parsed_email = {}
raw_email.each do |line|
case line.split(":")[0]
when "Delivered-To"
parsed_email[:to] = line
when "From"
parsed_email[:from] = line
when "Date"
parsed_email[:date] = line
when "Subject"
parsed_email[:subject] = line
end
end
puts parsed_email
=> {:to=>"Delivered-To: user1#example.com\n", :from=>"From: John Doe <user2#example.com>\n", :date=>"Date: Tue, 12 Dec 2017 13:30:14 -0500\n", :subject=>"Subject: Testing the parser\n"}
Explanation
You need to split line on : and select first. Like this line.split(":")[0]

Getting the MoreLinq MaxBy function to return more than one element

I have a situation in which I have a list of objects with an int property and I need to retrieve the 3 objects with the highest value for that property. The MoreLinq MaxBy function is very convenient to find the single highest, but is there a way I can use that to find the 3 highest? (Not necessarily of the same value). The implementation I'm currently using is to find the single highest with MaxBy, remove that object from the list and call MaxBy again, etc. and then add the objects back into the list once I've found the 3 highest. Just thinking about this implementation makes me cringe and I'd really like to find a better way.
Update: In version 3, MaxBy (including MinBy) of MoreLINQ was changed to return a sequence as opposed to a single item.
Use MoreLINQ's PartialSort or PartialSortBy. The example below uses PartialSortBy to find and print the longest 5 words in a given text:
var text = #"
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam gravida nec mauris vitae sollicitudin. Suspendisse
malesuada urna eu mi suscipit fringilla. Donec ut ipsum
aliquet, tincidunt mi sed, efficitur magna. Nulla sit
amet congue diam, at posuere lectus. Praesent sit amet
libero vehicula dui commodo gravida eget a nisi. Sed
imperdiet arcu eget erat feugiat gravida id non est.
Nam malesuada nibh sit amet nisl sollicitudin vestibulum.";
var words = Regex.Matches(text, #"\w+");
var top =
from g in words.Cast<Match>()
.Select(m => m.Value)
.GroupBy(s => s.Length)
.PartialSortBy(5, m => m.Key, OrderByDirection.Descending)
select g.Key + ": " + g.Distinct().ToDelimitedString(", ");
foreach (var e in top)
Console.WriteLine(e);
It will print:
14: malesuadafsfjs
12: sollicitudin
11: consectetur, Suspendisse
10: adipiscing, vestibulum
9: malesuada, fringilla, tincidunt, efficitur, imperdiet
in this case, you could simply do
yourResult.OrderByDescending(m => m.YourIntProperty)
.Take(3);
Now, this will retrieve you 3 objects.
So if you've got 4 objects sharing the same value (which is the max), 1 will be skipped. Not sure if that's what you want, or if it's not a problem...
But MaxBy will also retrieve only one element if you have many elements with the same "max value".

how to extract file names with .txt from a text?

I have a text like below,
Lorem ipsum dolor sit amet, consectetur sample1.txt adipiscing elit. Morbi nec urna non ante varius semper eget vitae ipsum. Pellentesque habitant sample2.txt morbi tristique senectus et netus et malesuada fames.
I have sample1.txt and sample2.txt in the above text. Name vary from sample1 and sample2. i just need to fetch the file name using c#.
Can anyone please help me on this ?
Since you tagged it LINQ:
var filesnames = text.Split(new char[] { }) // split on whitespace into words
.Where(word => word.EndsWith(".txt"));
Try something like this
var filesnames = text.Split(' ')
.Where(o => o.EndsWith(".txt")).Select(o => o.SubString(o.LastIndexOf('.'))).ToList();
It may be possible with a regular expression if there's a good way to capture what your file names will look like. I'm assuming here it's always blah.txt with alphanumeric characters:
var matches = Regex.Matches(input, #"\b[a-zA-Z0-9]+\.txt\b");

Why are different delimiters used in percent notation?

I have seen different people use different types of braces/brackets for this. I tried them out in script console, and they all work. Why do they all work and does it matter which is used?
%w|one two|
%w{one two}
%w[one two]
%w(one two)
Actually, much more varaiety of characters can be used. Any non-alphanumeric character except = can be used.
%w!a!
%w#b#
%w#c#
%w$d$
%w%e%
%w^f^
%w&g&
%w*h*
%w(i)
%w_j_
%w-k-
%w+l+
%w\m\
%w|n|
%w`o`
%w~p~
%w[q]
%w{r}
%w;s;
%w:t:
%w'u'
%w"v"
%w,w,
%w<x>
%w.y.
%w/z/
%w?aa?
No difference. The reason for the flexibility is so that you can pick delimiters that won't appear within your %w() string.
You get to choose your own delimiter. Pick the one that saves you from having to escape characters.
There is no difference, just a personal preference for multiline strings. Some people like to use...
<<eos
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud.
eos
As long as the beginning and end are the same then you are fine.

Resources