Parsing lines of text from external file in Ruby - ruby

I am trying to parse a raw email. The desired result is a hash of the lines that contain specific headers.
This is the Ruby file:
raw_email = File.open("sample-email.txt", "r")
parsed_email = Hash.new('')
raw_email.each do |line|
puts line
header = line.chomp(":")
puts header
if header == "Delivered-To"
parsed_email[:to] = line
elsif header == "From"
parsed_email[:from] = line
elsif header == "Date"
parsed_email[:date] = line
elsif header == "Subject"
parsed_email[:subject] = line
end
end
puts parsed_email
And this is the raw email:
Delivered-To: user1#example.com
From: John Doe <user2#example.com>
Date: Tue, 12 Dec 2017 13:30:14 -0500
Subject: Testing the parser
To: user1#example.com
Content-Type: multipart/alternative;
boundary="123456789abcdefghijklmnopqrs"
--123456789abcdefghijklmnopqrs
Content-Type: text/plain; charset="UTF-8"
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer nec
odio. Praesent libero. Sed cursus ante dapibus diam. Sed nisi. Nulla
quis sem at nibh elementum imperdiet. Duis sagittis ipsum.
--123456789abcdefghijklmnopqrs
Content-Type: text/html; charset="UTF-8"
<div dir="ltr">Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Integer nec odio. Praesent libero. Sed cursus ante dapibus diam.
Sed nisi. Nulla quis sem at nibh elementum imperdiet. Duis sagittis
ipsum.<br clear="all">
</div>
--089e082c24dc944a9f056028d791--
The puts statements are just for my own testing to see if data is being passed along.
What I am getting is each full line put twice and an empty hash put at the end.
I have also tried changing different bits to strings or arrays and I've also tried using line.split(":", 1) instead of line.chomp(":")
Can someone please explain why this isn't working?

Try this
raw_email = File.open("sample-email.txt", "r")
parsed_email = {}
raw_email.each do |line|
case line.split(":")[0]
when "Delivered-To"
parsed_email[:to] = line
when "From"
parsed_email[:from] = line
when "Date"
parsed_email[:date] = line
when "Subject"
parsed_email[:subject] = line
end
end
puts parsed_email
=> {:to=>"Delivered-To: user1#example.com\n", :from=>"From: John Doe <user2#example.com>\n", :date=>"Date: Tue, 12 Dec 2017 13:30:14 -0500\n", :subject=>"Subject: Testing the parser\n"}
Explanation
You need to split line on : and select first. Like this line.split(":")[0]

Related

Paginating imported XML Data

All,
I'm having a hell of a time getting this to work. I have a very basic XML structure:
<root>
<item>
<header>NEW HEADER</header>
<body>NEW BODY - Sed auctor justo et erat rutrum, nec molestie neque placerat. Quisque efficitur condimentum velit nec volutpat. Nunc sed magna vel mauris convallis sodales</body>
<footer>NEW - Footer: Donec in nibh risus. Sed placerat felis non pellentesque placerat. In non risus a elit malesuada consectetur.</footer>
</item>
<item>
<header>NEW HEADER 2</header>
<body>NEW BODY - Sed auctor justo et erat rutrum, nec molestie neque placerat. Quisque efficitur condimentum velit nec volutpat. Nunc sed magna vel mauris convallis sodales</body>
<footer>NEW - Footer: Donec in nibh risus. Sed placerat felis non pellentesque placerat. In non risus a elit malesuada consectetur.</footer>
</item>
</root>
I've created an InDesign template with tagged text-area placeholders. What I want to achieve is create a new page for each <item> tag and populate the data appropriately. When I load my XML, it loads each <item> but it doesn't generate a new page for each one.
Any help would be appreciated.
that's because you need to understand some basic rules. Number one is that xml is just about text within InDesign. In your case, your template has to dispose from a generic set of tags and a page break character. You will ask InDesign to duplicate that set and character at every occurence of the repeated incoming node. I wrote a blog post that talk about all those peculiarities. Especially for rookies ;) : http://www.ozalto.com/en/5-errors-you-will-do-with-indesign-xml/
You'll want to take a look at the "Merge Mode" section of Adobe's Importing XML documentation here:
https://helpx.adobe.com/indesign/using/importing-xml.html
From that page:
Merge mode not only makes automated layout possible, it provides more
advanced import options, including the ability to filter incoming text
and clone elements for repeating data.
it sounds like you need the "clone elements" feature.
To get new page for each <item> put a page break at the end of <item>
Then make sure to set a "Primary Text Frame" on your master page.
https://helpx.adobe.com/indesign/using/whats-new-cs6.html#id_16192
With this set, InDesign will simply create a new page as needed.

Unable to split data properly with Ruby Regex Rubular

I am trying to organize and break up contents within emails that has been extracted through Net::POP3. In the code, when I use
p mail.pop
I get
****************************\r\n>>=20\r\n>>11) <> Summary: Working with Vars on Social Influence =\r\nplatform=20\r\n>>=20\r\n>> Name: Megumi Lindon \r\n>>=20\r\n>> Category: Social Psychology=20\r\n>>=20\r\n>> Email: information#example.com =\r\n<mailto:information#example.com>=20\r\n>>=20\r\n>> Journal News: Saving Grace \r\n>>=20\r\n>> Deadline: 10:00 PM EST - 15 February=20\r\n>>=20\r\n>> Query:=20\r\n>>=20\r\n>> Lorem ipsum dolor sit amet \r\n>> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n>>=20\r\n>> Duis aute irure dolor in reprehenderit in voluptate \r\n>> velit esse cillum dolore eu fugiat nulla pariatur. =20\r\n>>=20\r\n>> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.=20\r\n>> Requirements:=20\r\n>>=20\r\n>> Psychologists; anyone with good knowdledge\r\n>> with sociology and psychology.=20\r\n>>=20\r\n>> Please do send me your article and profile\r\n>> you want to be known as well. Thank you!=20\r\n>> Back to Top <x-msg://30/#top> Back to Category Index =\r\n<x-msg://30/#SocialPsychology>\r\n>>-----------------------------------\r\n>>=20\r\n>>
I am trying to break it up and organize it to
11) Summary: Working with Vars on Social Influence
Name: Megumi Lindon
Category: Social Psychology
Email: information#example.com
Journal News: Saving Grace
Deadline: 10:00 PM EST - 15 February
Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.
So far, I have been using rubular, but with varying results as I am still learning how to use regex, gsub and split properly. My code thus far is as below.
p mail.pop.scan(/Summary: (.+) Name:/)
p mail.pop.scan(/Name: (.+) Category:/)
p mail.pop.scan(/Category: (.+) Email:/)
p mail.pop.scan(/Email: (.+) Journal News:/)
p mail.pop.scan(/Journal News: (.+) Deadline:/)
p mail.pop.scan(/Deadline: (.+) Questions:/)
p mail.pop.scan(/Questions:(.+) Requirements:/)
p mail.pop.scan(/Requirements:(.+) Back to Top/)
But I have been getting empty arrays.
[]
[]
[]
[]
[]
[]
[]
[]
Wondering how I can do this better. Thanks in advance.
Oh, my! What a mess!
There are many ways to approach this problem, of course, but I expect they all involve multiple steps and lots of trial and error. I can only say how I went about it.
Lots of little steps is a good thing, for a couple of reasons. Firstly, it breaks the problem down into manageable tasks whose solutions can be tested individually. Secondly, the parsing rules may change in the future. If you have several steps you may only have to change and/or add one or two operations. If you have few steps and complex regular expressions, you may as well start over, particular if the code was written by someone else.
Let's say text is a variable containing your string.
Firstly, I don't like all those newlines, because they complicate regex's, so the first thing I'd do is get rid of them:
s1 = text.gsub(/\n/, '')
Next, there are many "20\r"'s which can be troublesome, as we may want to keep other text that contains numbers, so we can remove those (as well as "7941\r"):
s2 = s1.gsub(/\d+\r/, '')
Now let's look at the fields you want and the immediately-preceding and immediately-following text:
puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
# <> Summary: Working with V
#=>> Name: Megumi Lindon
#=>> Category: Social Psychol
#=>> Email: information#ex
#<mailto:information#exa
#=>> Journal News: Saving Grace
#=>> Deadline: 10:00 PM EST -
#=>> Query:=>>=>> Lorem ip
#=>> Requirements:=>>=>> Psycholo
# <x-msg://30/#top> Back
#<x-msg://30/#SocialPsy
We see that the fields of interest begin with "> " and the field name is followed by ": " or ":=". Let's simplify by changing ":=" to ": " after the field name and "> " to " :" before the field name:
s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
In the regex for s3, (?<=\w) is a "positive lookbehind": the match must be immediately preceded by a word character (which is not included as part of the match); in the regex for s4, (?=(?:\w+\s+)*\w+: ) is a "positive lookahead": the match must be immediately followed by one or more words followed by a colon then a space. Note that s3 and s4 must be calculated in the given order.
We can now remove all the non-word characters other than punctuation characters and spaces:
s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
and then (finally) split on the fields:
a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
# => ["11) :", "Summary: ", "Working with Vars on Social Influence platform :",
# "Name: ", "Megumi Lindon :",
# "Category: ", "Social Psychology :",
# "Email: ", "informationexample.com mailto:informationexample.com :",
# "Journal News: ", "Saving Grace :",
# "Deadline: ", "10:00 PM EST 15 February :",
# "Query: ", "Lorem ipsum ...laborum. :",
# "Requirements: ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"]
Note that I have enclosed (?<= :)(?:\w+\s+)*\w+:\s+ in a capture group, so that String#split will include the bits that it splits on in the resulting array.
All that remains is some cleaning-up:
a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
#=> "11) Summary: "
a3 = a2.each_slice(2).to_a
#=> [["11) Summary: ", "Working with Vars on Social Influence platform "],
# ["Name: ", "Megumi Lindon "],
# ["Category: ", "Social Psychology "],
# ["Email: ", "informationexample.com mailto:informationexample.com "],
# ["Journal News: ", "Saving Grace "],
# ["Deadline: ", "10:00 PM EST 15 February "],
# ["Query: ", "Lorem...est laborum. "],
# ["Requirements: ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]]
idx = a3.index { |n,_| n =~ /Email: / }
#=> 3
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
#=> "informationexample.com "
Join the strings and remove extra spaces:
a4 = a3.map { |b| b.join(' ').split.join(' ') }
#=> ["11) Summary: Working with Vars on Social Influence platform",
# "Name: Megumi Lindon",
# "Category: Social Psychology",
# "Email: informationexample.com",
# "Journal News: Saving Grace",
# "Deadline: 10:00 PM EST 15 February",
# "Query: Lorem...laborum.",
# "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"]
"Requirements" is still problematic, but without additional rules, nothing more can be done. We cannot limit all category values to a single sentence because "Query" can have more than one. If you wish to limit "Requirements" to one sentence:
idx = a4.index { |n,_| n =~ /Requirements: / }
#=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
# => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."
If you wish to combine these operations:
def parse_it(text)
a1 = text.gsub(/\n/, '')
.gsub(/\d+\r/, '')
.gsub(/(?<=\w):=/, ": ")
.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
.map { |s| s.chomp(':') }
a1[0] = a1.shift + a1.first
a2 = a1.each_slice(2).to_a
idx = a2.index { |n,_| n =~ /Email: / }
a2[idx][1] = a2[idx][1][/.*?\s/] if idx
a3 = a2.map { |b| b.join(' ').split.join(' ') }
idx = a3.index { |n,_| n =~ /Requirements: / }
a3[idx] = a3[idx][/.*?[.!?]/] if idx
a3
end

Getting the MoreLinq MaxBy function to return more than one element

I have a situation in which I have a list of objects with an int property and I need to retrieve the 3 objects with the highest value for that property. The MoreLinq MaxBy function is very convenient to find the single highest, but is there a way I can use that to find the 3 highest? (Not necessarily of the same value). The implementation I'm currently using is to find the single highest with MaxBy, remove that object from the list and call MaxBy again, etc. and then add the objects back into the list once I've found the 3 highest. Just thinking about this implementation makes me cringe and I'd really like to find a better way.
Update: In version 3, MaxBy (including MinBy) of MoreLINQ was changed to return a sequence as opposed to a single item.
Use MoreLINQ's PartialSort or PartialSortBy. The example below uses PartialSortBy to find and print the longest 5 words in a given text:
var text = #"
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Etiam gravida nec mauris vitae sollicitudin. Suspendisse
malesuada urna eu mi suscipit fringilla. Donec ut ipsum
aliquet, tincidunt mi sed, efficitur magna. Nulla sit
amet congue diam, at posuere lectus. Praesent sit amet
libero vehicula dui commodo gravida eget a nisi. Sed
imperdiet arcu eget erat feugiat gravida id non est.
Nam malesuada nibh sit amet nisl sollicitudin vestibulum.";
var words = Regex.Matches(text, #"\w+");
var top =
from g in words.Cast<Match>()
.Select(m => m.Value)
.GroupBy(s => s.Length)
.PartialSortBy(5, m => m.Key, OrderByDirection.Descending)
select g.Key + ": " + g.Distinct().ToDelimitedString(", ");
foreach (var e in top)
Console.WriteLine(e);
It will print:
14: malesuadafsfjs
12: sollicitudin
11: consectetur, Suspendisse
10: adipiscing, vestibulum
9: malesuada, fringilla, tincidunt, efficitur, imperdiet
in this case, you could simply do
yourResult.OrderByDescending(m => m.YourIntProperty)
.Take(3);
Now, this will retrieve you 3 objects.
So if you've got 4 objects sharing the same value (which is the max), 1 will be skipped. Not sure if that's what you want, or if it's not a problem...
But MaxBy will also retrieve only one element if you have many elements with the same "max value".

how to extract file names with .txt from a text?

I have a text like below,
Lorem ipsum dolor sit amet, consectetur sample1.txt adipiscing elit. Morbi nec urna non ante varius semper eget vitae ipsum. Pellentesque habitant sample2.txt morbi tristique senectus et netus et malesuada fames.
I have sample1.txt and sample2.txt in the above text. Name vary from sample1 and sample2. i just need to fetch the file name using c#.
Can anyone please help me on this ?
Since you tagged it LINQ:
var filesnames = text.Split(new char[] { }) // split on whitespace into words
.Where(word => word.EndsWith(".txt"));
Try something like this
var filesnames = text.Split(' ')
.Where(o => o.EndsWith(".txt")).Select(o => o.SubString(o.LastIndexOf('.'))).ToList();
It may be possible with a regular expression if there's a good way to capture what your file names will look like. I'm assuming here it's always blah.txt with alphanumeric characters:
var matches = Regex.Matches(input, #"\b[a-zA-Z0-9]+\.txt\b");

How to highlight multiple selections?

For example I have some text in ace-editor and a list of ranges of rows and lines in text where highlightings should happened. Like this (they're bolded):
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nam cursus.
Morbi ut mi. Nullam enim leo, egestas id, condimentum at, laoreet mattis,
massa. Sed eleifend nonummy diam. Praesent mauris ante, elementum et,
bibendum at, posuere sit amet, nibh.
How to highlight these words by using ace-editor API?
How to highlight multiple lines?
Finally I've got the answer.
Highlight the word:
var range = new Range(rowStart, columnStart, rowEnd, columnEnd);
var marker = editor.getSession().addMarker(range,"ace_selected_word", "text");
Remove the highlighted word:
editor.getSession().removeMarker(marker);
Highlight the line:
editor.getSession().addMarker(range,"ace_active_line","background");

Resources