Ruby ARGF & RegEx: How to split on paragraph carriage return "\r\n" but not end of line "\r\n" - ruby

I am trying to pre-process some text using regex in ruby to input into a mapper job and would like to split on the carriage return denoting the paragraph.
The text will be coming into the mapper using ARGF.each as part of a hadoop streaming job
"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n" # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"
Once I have done this I will chomp the newline /carriage return of each line.
This will look something like this:
ARGF.each do |text|
paragraph = text.split(INSERT_REGEX_HERE)
#some more blah will happen beyond here
end
UPDATE:
The desired output then is an array as follows:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
[1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]
Ultimately what I want is the following array with no carriage returns within the array:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
"daughter of James Stevenson, Esq. of South Park, in the county of"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
"1789\""
[1] "Precisely such had the paragraph originally stood from the printer's"
]
Thanks in advance for any insights.

Beware when you do ARGF.each do |text|, the text will be every single line, NOT the whole text block.
You can provide ARGF.each a special line separator, it will return you two "lines", which are the two paragraphs in your case.
Try this:
paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}
First, split input into two paragraphs, then use gsub to remove unwanted line breaks.

To split the text use:
result = text.gsub(/(?<!\")\\r\\n|(?<=\\\")\\r\\n/, '').split(/[\r\n]+\"\\r\\n\".*?[\r\n]+/)

Related

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def encode(batch):
return tokenizer.encode_plus(batch["abstract"], max_length=32, add_special_tokens=True, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=False, return_tensors="pt")
dataset.set_transform(encode)
When I run this code, I have
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Instead of having a list of strings, I have a list of lists of strings. Here is the content of batch["article"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .', "said dpp alison saunders had ` damaged public confidence ' in justice .", 'ms saunders ruled lord janner unfit to stand trial over child abuse claims .', 'the cps has pursued at least 19 suspected paedophiles with dementia .'], ['an increasing number of surveys claim to reveal what makes us happiest .', 'but are these generic lists really of any use to us ?', 'janet street-porter makes her own list - of things making her unhappy !'], ["author of ` into the wild ' spoke to five rape victims in missoula , montana .", "` missoula : rape and the justice system in a college town ' was released april 21 .", "three of five victims profiled in the book sat down with abc 's nightline wednesday night .", 'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football players .', "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .", 'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable cause .', 'mr krakauer wrote book after realizing close friend was a rape victim .'], ['tesco announced a record annual loss of £ 6.38 billion yesterday .', 'drop in sales , one-off costs and pensions blamed for financial loss .', 'supermarket giant now under pressure to close 200 stores nationwide .', 'here , retail industry veterans , plus mail writers , identify what went wrong .'], ..., ['snp leader said alex salmond did not field questions over his family .', "said she was not ` moaning ' but also attacked criticism of women 's looks .", 'she made the remarks in latest programme profiling the main party leaders .', 'ms sturgeon also revealed her tv habits and recent image makeover .', 'she said she relaxed by eating steak and chips on a saturday night .']]
How could I fix this issue?

Saving text file info into clojure data structure [duplicate]

I have the following data in a .txt file:
1|John Smith|123 Here Street|456-4567
2|Sue Jones|43 Rose Court Street|345-7867
3|Fan Yuhong|165 Happy Lane|345-4533
I get the data and convert it to a vector using the following code:
(def custContents (slurp "cust.txt"))
(def custVector (clojure.string/split custContents #"\||\n"))
(def testing (into [] (partition 4 custVector )))
Which gives me the following vector:
[(1 John Smith 123 Here Street 456-4567) (2 Sue Jones 43 Rose Court Street
345-7867) (3 Fan Yuhong 165 Happy Lane 345-4533)]
I would like to convert it into a vector of vectors like this:
[[1 John Smith 123 Here Street 456-4567] [2 Sue Jones 43 Rose Court Street
345-7867] [3 Fan Yuhong 165 Happy Lane 345-4533]]
I would do it slightly differently, so you first break it up into lines, then process each line. It also makes the regex simpler:
(ns tst.demo.core
(:require
[clojure.string :as str] ))
(def data
"1|John Smith|123 Here Street|456-4567
2|Sue Jones|43 Rose Court Street|345-7867
3|Fan Yuhong|165 Happy Lane|345-4533")
(let [lines (str/split-lines data)
line-vecs-1 (mapv #(str/split % #"\|" ) lines)
line-vecs-2 (mapv #(str/split % #"[|]") lines)]
...)
with result:
lines => ["1|John Smith|123 Here Street|456-4567"
"2|Sue Jones|43 Rose Court Street|345-7867"
"3|Fan Yuhong|165 Happy Lane|345-4533"]
line-vecs-1 =>
[["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]
line-vecs-2 =>
[["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]
Note that there are 2 ways of doing the regex. line-vecs-1 shows a regex where the pipe character is escaped in the string. Since regex varies on different platform (e.g. on Java one would need "\|"), line-vecs-2 uses a regex class of a single character (the pipe), which sidesteps the need for escaping the pipe.
Update
Other Clojure Learning Resources:
Brave Clojure
Clojure CheatSheet
ClojureDocs.org
Clojure-Doc.org (similar but different)
> (mapv vec testing)
=> [["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]

No padding for hours in strftime

When using strftime #tzformat = "%F,%l:00 %p":
I want exactly one space between the comma and the hour. But %l gives no space for 10, 11 and 12 whereas if I put “ %l” I get two spaces for 0-9 (one from the padding and another from the space I add).
Month has no-padding option. I don’t see the same for hour.
What am I missing?
The - modifier removes padding. If you use %-l instead of %l it will not put a space at all, and you can manually add a space.
Time.now.strftime #tzformat = "%F, %-l:00 %p" #=> "2015-01-29, 8:00 PM"
(Time.now + 3600*2).strftime #tzformat = "%F, %-l:00 %p" #=> "2015-01-29, 10:00 PM"

How to find out the date of second Monday of each month of given year?

My customer has an event each second Monday of each month.
I need to mark them with red in calendar.
How do i "cleanly" find out the date of that Mondays?
Here's my version.
If the eighth of the month is a Monday, then it is the second Monday. If it is not a Monday, then how many days until the next Monday?
oct_2012 = Date.new 2012, 10, 8
oct_2012.wday # => 1, We're done!
nov_2012 = Date.new 2012, 11, 8
nov_2012.wday # => 4
nov_2012 + (8 - nov_2012.wday) # => 2012-11-12
Does that help?
Edit
Easier version: Just add and be done. This algorithm works even if the month starts on a Monday.
oct_2012 = Date.new 2012, 10, 1
oct_2012 + (8 - oct_2012.wday) # => 2012-10-08
nov_2012 = Date.new 2012, 11, 1
nov_2012 + (8 - nov_2012.wday) # => 2012-11-12
One rule and done!
You second Monday will always fall within the 8th and 14th of each month.

Business date/holiday handling

I've posted this question for C# but I may be working in Ruby instead. So I'm asking the same question about Ruby:
I'm looking for a Ruby class/library/module that works similarly to the Perl module Date::Manip as far as business/holiday dates. Using that module in Perl, I can pass it a date and find out whether it's a business day (ie, Mon-Fri) or a holiday. Holidays are very simple to define in a config file (see Date::Manip::Holidays). You can enter a 'fixed' date that applies to every year like:
12/25 = Christmas
or 'dynamic' dates for every year like:
last Monday in May = Memorial Day
or 'fixed' dates for a given year like:
5/22/2010 = Bob's Wedding
You can also pass in a date and get back the next/previous business day (which is any day that's not a weekend and not a holiday).
Does anyone know of anything like that in the Ruby world?
You may use the holidays-gem.
http://rubygems.org/gems/holidays
Some national (and regional) holidays are already predefined, you may define your own holiday definitions.
The business_time gem should do what you need.
The example at bottom of the README doc is a good starting example:
require 'rubygems'
require 'active_support'
require 'business_time'
# We can adjust the start and end time of our business hours
BusinessTime::Config.beginning_of_workday = "8:30 am"
BusinessTime::Config.end_of_workday = "5:30 pm"
# and we can add holidays that don't count as business days
# July 5 in 2010 is a monday that the U.S. takes off because
# our independence day falls on that Sunday.
three_day_weekend = Date.parse("July 5th, 2010")
BusinessTime::Config.holidays << three_day_weekend
friday_afternoon = Time.parse("July 2nd, 2010, 4:50 pm")
tuesday_morning = 1.business_hour.after(friday_afternoon)
You probably going to need the chronic gem to help you build the holiday dates from your config file. However YMMV because your example last monday in may doesn't work in chronic. Hackaround is do something like this:
# last monday in May (2010)
Chronic.parse('last monday', :now => Time.parse('2010-06-01'))
And look at the tickle gem which works on top of chronic for a way to add recurring events.
/I3az/
You could take a look at my Workpattern gem. It allows you to specify working and resting times. It was aimed at producing a "Calendar" like is used in planning tools such as Microsoft Project and Primavera P6, so you can specify right down to the minute.
Here is a simple example:
Create a new Workpattern mywp=Workpattern.new('My Workpattern',2011,10) This is for 10 years from 2011 but you can make it longer or shorter.
Tell it you want the Weekends to be resting and that you also want to rest during the week so you work between 9 and 12 in the morning and 1 and 6 in the afternoon.
mywp.resting(:days => :weekend)
mywp.resting(:days =>:weekday, :from_time=>Workpattern.clock(0,0),:to_time=>Workpattern.clock(8,59))
mywp.resting(:days =>:weekday, :from_time=>Workpattern.clock(12,0),:to_time=>Workpattern.clock(12,59))
mywp.resting(:days =>:weekday, :from_time=>Workpattern.clock(18,0),:to_time=>Workpattern.clock(23,59))
Now just calculate using minutes
mydate=DateTime.civil(2011,9,1,9,0)
result_date = mywp.calc(mydate,1920) # => 6/9/11#18:00
1920 is 4 days * 8 hours a day * 60 minutes and hour.
I wrote the gem to learn Ruby - only scratched the surface.
Check out the biz gem.
Here's an example configuration:
require 'biz'
Biz.configure do |config|
config.hours = {
mon: {'09:00' => '17:00'},
tue: {'00:00' => '24:00'},
wed: {'09:00' => '17:00'},
thu: {'09:00' => '12:00', '13:00' => '17:00'},
sat: {'10:00' => '14:00'}
}
config.holidays = [Date.new(2014, 1, 1), Date.new(2014, 12, 25)]
config.time_zone = 'America/Los_Angeles'
end
When you use the optional core extensions, it's as easy as the following to find out if a date is a business day:
require 'biz/core_ext'
Date.new(2014, 12, 25).business_day? # => false

Resources