Multiple line regex in ruby - ruby

I am trying to strip some repeated text out of my Kindle clippings that look like this:
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time
commentators (a euphemism for prolific writers with little experience
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time
I am trying to strip out everthing between "Essentials" and "Time". The regexp I am playing with right now looks like this:
Essentials([^,]+)Time
But obviously it is not working:
http://rubular.com/r/gwSJFgOQai
Any help for this nuby would be massively appreciated!

You need the /m modifier which makes . match a newline:
/Essentials(.*?)Time/m
See it working here:
http://rubular.com/r/qgmkWnLzW6

Why don't you use this:
/Essentials(.*?)Time/m
Updated. Forgot the m for multiline.

Regex are powerful, but you'll find they also often add needless complexity to a problem.
This is how I'd go about the problem:
text = <<EOT
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 25 | Added on Friday, 25 November 11 10:53:36 Greenwich Mean Time
commentators (a euphemism for prolific writers with little experience
==========
Essentials of Licensing Intellectual Property (Alexander I. Poltorak, Paul J. Lerner)
- Highlight on Page 26 | Added on Friday, 25 November 11 10:54:29 Greenwich Mean Time
EOT
text.each_line do |l|
l.chomp!
next if ((l =~ /\AEssentials/) .. (l =~ /Time\z/))
puts l
end
Which outputs:
The starting point,obviously,is a thorough analysis ofthe intellectual property portfolio,the contents ofwhich can be broadly divided into two categories:property that is in use and property that is not in use
==========
commentators (a euphemism for prolific writers with little experience
==========
This works because the .., AKA range operator, gains new capability when used with an if, and turns into what we call the flip-flop operator. In operation what happens is ((l =~ /\AEssentials/) .. (l =~ /Time\z/)) returns false, until (l =~ /\AEssentials/) matches. From then until (l =~ /Time\z/) matches it returns true. Once the final regex matches it returns to returning false.
This behavior works really well for extracting sections from text.
If you are aggregating text, for subsequent output, replace the puts l with something to append l to a buffer, then output that buffer at the end of your run.

Related

What is a simple way in Ruby to show the date and time?

I would like something in Ruby roughly equivalent to time.asctime() in Python:
import time
print(time.asctime())
outputs:
Sun Sep 11 10:12:48 2022
I'd like to avoid having to use strftime and having to remember or look up the formats. Also, ideally I'd like both day of the week (e.g., Sun) and the UTC timezone difference (e.g., -0400), but I'd settle for just day of the week.
puts Time.now.asctime
outputs:
Sun Sep 11 10:24:46 2022
Simple String Output
Ruby supports lots of Time, Date, and DateTime objects and output formats. While I think the first answer is closer to the output format you want, the following is potentially simpler and possibly sufficient for many needs when just considering standard output or standard error:
p Time.now.to_s
#=> 2022-09-11 14:10:57 -0400
# using interpolation
p "Time: #{Time.now.to_s}"
#=> "Time: 2022-09-11 14:15:51 -0400"
Other Considerations
Note that if you want to use the results for any sort of comparison or calculation, you'll likely need to convert the result to one of the three object types described above. That's the main reason I mention them. Unless it's just printing to the screen, you should think about how you plan to use the result before deciding which of the objects will be most useful for you.

Need an algorithm that detects diffs between two files for additions and reorders

I am trying to figure out if there are existing algorithms that can detect changes between two files in terms of additions but also reorders. I have an example below:
1 - User1 commit
processes = 1
a = 0
allactive = []
2 - User2 commit
processes = 2
a = 0
allrecords = range(10)
allactive = []
3 - User3 commit
a = 0
allrecords = range(10)
allactive = []
processes = 2
I need to be able to say that for example user1 code is the three initial lines of code, user 2 added the "allrecords = range(10)" part (as well as a number change), and user 3 did not change anything since he/she just reordered the code.
Ideally, at commit 3, I want to be able to look at the code and say that from character 0 to 20 (this is user1's code), 21-25 user2's code, 26-30 user1's code etc.
I know there are two popular algorithms, Longest common subsequence and longest common substring but I am not sure which one can correctly count additions of new code but be able also to identify reorders.
Of course this still leaves out the question of having the same substring existing twice in a text. Are there any other algorithms that are better suited to this problem?
Each "diff" algorithm defines a set of possible code-change edit types, and then (typically) tries to find the smallest set of such changes that explains how the new file resulted from the old. Usually such algorithms are defined purely syntactically; semantics are not taken into account.
So what you want, based on your example, is an algorithm that allow "change line", "insert line", "move line" (and presumably "delete line" [not in your example but necessary for a practical set of edits]). Given this you ought to be able to define a dynamic programming algorithm to find a smallest set of edits to explain how one file differs from another. Note that this set is defined in terms of edits to whole-lines, rather like classical "diff"; of course classical diff does not have "change line" or "move line" which is why you are looking for something else.
You could pick different types of deltas. Your example explicitly noted "number change"; if narrowly interpreted, this is NOT an edit on lines, but rather within lines. Once you start to allow partial line edits, you need to define how much of a partial line edit is allowed ("unit of change"). (Will your edit set allow "change of digit"?)
Our Smart Differencer family of tools defines the set of edits over well-defined sub-phrases of the targeted language; we use formal language grammar (non)terminals as the unit of change. [This makes each member of the family specific to the grammar of some language] Deltas include programmer-centric concepts such as "replace phrase by phrase", "delete listmember", "move listmember", "copy listmember", "rename identifier"; the algorithm operates by computing a minimal tree difference in terms of these operations. To do this, the SmartDifferencer needs (and has) a full parser (producing ASTs) for the language.
You didn't identify the language for your example. But in general, for a language looking like that, the SmartDifferencer would typically report that User2 commit changes were:
Replaced (numeric literal) "1" in line 1 column 13 by "2"
Inserted (statement) "allrecords = range(10)" after line 2
and that User3 commit changes were:
Move (statement) at line 1 after line 4
If you know who contributed the original code, with the edits you can straightforwardly determine who contributed which part of the final answer. You have to decide the unit-of-reporting; e.g., if you want report such contributions on a line by line basis for easy readability, or if you really want to track that Mary wrote the code, but Joe modified the number.
To detect that User3's change is semantically null can't be done with purely syntax-driven diff tool of any kind. To do this, the tool has to be able to compute the syntactic deltas somehow, and then compute the side effects of all statements (well, "phrases"), requiring a full static analyzer of the language to interpret the deltas to see if they have such null effects. Such a static analyzer requires a parser anyway so it makes sense to do a tree based differencer, but it also requires a lot more than just parser [We have such language front ends and have considered building such tools, but haven't gotten there yet].
Bottom line: there is no simple algorithm for determining "that user3 did not change anything". There is reasonable hope that such tools can be built.

Rewriting sentences while retaining semantic meaning

Is it possible to use WordNet to rewrite a sentence so that the semantic meaning of the sentence still ways the same (or mostly the same)?
Let's say I have this sentence:
Obama met with Putin last week.
Is it possible to use WordNet to rephrase the sentence into alternatives like:
Obama and Putin met the previous week.
Obama and Putin met each other a week ago.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms?
For example:
Obama met Putin the previous week.
If the question is the possibility to use WordNet to do sentence paraphrases. It is possible with much grammatical/syntax components. You would need system that:
First get the individual semantics of the tokens and parse the sentence for its syntax.
Then understand the overall semantics of the composite sentence (especially if it's metaphorical)
Then rehash the sentence with some grammatical generator.
Up till now I only know of ACE parser/generator that can do something like that but it takes a LOT of hacking the system to make it work as a paraphrase generator. http://sweaglesw.org/linguistics/ace/
So to answer your questions,
Is it possible to use WordNet to rephrase the sentence into alternatives? Sadly, WordNet isn't a silverbullet. You will need more than semantics for a paraphrase task.
If changing the sentence structure is not possible, can WordNet be used to replace only the relevant synonyms? Yes this is possible. BUT to figure out which synonym is replace-able is hard... And you would also need some morphology/syntax component.
First you will run into a problem of multiple senses per word:
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
print i, len(possible_senses), possible_senses
[out]:
Obama 0 []
met 13 [Synset('meet.v.01'), Synset('meet.v.02'), Synset('converge.v.01'), Synset('meet.v.04'), Synset('meet.v.05'), Synset('meet.v.06'), Synset('meet.v.07'), Synset('meet.v.08'), Synset('meet.v.09'), Synset('meet.v.10'), Synset('meet.v.11'), Synset('suffer.v.10'), Synset('touch.v.05')]
Putin 1 [Synset('putin.n.01')]
the 0 []
previous 3 [Synset('previous.s.01'), Synset('former.s.03'), Synset('previous.s.03')]
week 3 [Synset('week.n.01'), Synset('workweek.n.01'), Synset('week.n.03')]
Then even if you know the sense (let's say the first sense), you get multiple words per sense and not every word can be replaced in the sentence. Moreover, they are in the lemma form not a surface form (e.g. verbs are in their base form (simple present tense) and nouns are in singular):
from nltk.corpus import wordnet as wn
sent = "Obama met Putin the previous week"
for i in sent.split():
possible_senses = wn.synsets(i)
if possible_senses:
print i, possible_senses[0].lemma_names
else:
print i
[out]:
Obama
met ['meet', 'run_into', 'encounter', 'run_across', 'come_across', 'see']
Putin ['Putin', 'Vladimir_Putin', 'Vladimir_Vladimirovich_Putin']
the
previous ['previous', 'old']
week ['week', 'hebdomad']
One approach is grammatical analysis with nltk read more here and after analysis convert your sentence in to active voice or passive voice.

How can I extract the meaning of a paragraph?

I need to develop method that extracts the meaning from a string for a record in a database. Here is an example of the a string:
MyString = "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
Given the string, I need to process it in such a way that I can create a race_record:
race_record[:purse] = 75000
race_record[:race_type] = "Maidens"
race_record[:sex] = "Fillies And Mares"
race_record[:age] = "Three Year Old And Upward"
race_record[:distance] = "One And One Eighth Miles"
race_record[:surface] = "inner turf"
I was planning on to use ruby and a series of regular expressions to extract the data. For example:
race_record[:purse] = Mystring.scan(/(?<=\Purse\s[$])(.*?)(?=\.)/)
race_record[:race_type] = Mystring.sub(....)
etc.
My question isn't so much what the correct regular expressions are. Given the objective, is the approach I proposed the right way to go, or is there a better approach or even a gem that can do the heavy lifting?
You could use one regex to extract all the relevant parts into capturing groups at once;
regexp =
/Purse\s\$ # Leading text
([\d,]+) # Group 1
.*?For\s # Intervening text
(\w+) # Group 2
,\s # Intervening text
(\w+\sAnd\s\w+) # Group 3, etc. etc.
\s
([^.]*)
\.[^;]*;[^.]*\.\s
([^.]*)
\.\s\(
([^()]*)
\)/x
Then you can do
irb(main):025:0> match = regexp.match(mystring)
=> #<MatchData "Purse $75,000. (up To $14,250 Nysbfoa) For Maidens, Fillies And Mares Three Years Old And Upward. Three Year Olds, 118 Lbs.; Older, 123 Lbs. One And One Eighth Miles. (Inner turf)"
1:"75,000" 2:"Maidens" 3:"Fillies And Mares" 4:"Three Years Old And Upward"
5:"One And One Eighth Miles" 6:"Inner turf">
irb(main):026:0> match[1]
=> "75,000"
irb(main):027:0> match[2]
=> "Maidens"
...etc.
If your input is fairly structured, i.e. it has a specific and know grammar, you could build a 'parser' to parse the grammar.
In the old days, we'd do this with yacc and lex, two old unix tools used to build compilers. Yacc and Lex have Ruby implementations. While the original intent was to output lower level code (such as machine assembly codes when building a real compiler), there is nothing that prevents you from calling any ruby code when a specific grammatical construct has been recognized by your parser.
NOTE: even though there is a Yacc/lex Ruby gem out there, I wouldn't say it will 'DO THE HEAVY LIFTING', learning yacc and lex has a small learning curve. Using something like yacc/lex would make your life easier in the long run, especially if you have a large grammar and must constantly adjust it.

Is there programmatical way to get short day names in windows?

Is there a way to get a 2 character day-name of the week such as MO/TU/WE/TH/FR/SA/SU?
Currently I only know of using FormatDateTime():
"ddd" returns "Fri"
"dddd" returns "Friday"
The main reason is that I want to obtain localized version of the 1 or 2 character day names:
Say FRIDAY in "ddd" would return:
French Windows = "Vendredi", the 2 char would be "VE", note it's the 1st and 2nd char.
Chinese Windows = "星期五", the char would be "五", note it's the 3rd char.
Japanese Windows = "金曜日", the char would be "金", note it's the 1st char.
Edit1:
Currently using Delphi, but i think applies to other languages too.
Edit2:
Simply put, I'm looking to obtain the shorter version of "ShortDayName" through the use of some functions or constants, so that I don't have to build a table of constants containing the 7 day "Shorter" day names for every possible windows language.
I wonder if such functions really exist.
Maybe the calendar 1 or 2 char day names in Outlook are hard-coded themselves, right?
You can get the local names for the days of the week with ShortDayNames and LongDayNames, and you can use DayOfWeek to get the numeric value for the day.
ShortDayNames[Index]; //Returns Fri
or
LongDayNames[Index]; //Returns Friday
The only way I know to shorten them to two chars would be to trim the resulting string
LeftStr(LongDayNames[Index],2);//Returns Fr
So today's Day would be
LeftStr(LongDayNames[DayOfWeek(date)],2); //Returns Fr
Click Here
Depicts the standards in custom date formatting.
You may also use the 'ddd' standard and trim it.
Delphi's routines does nothing special - they just ask OS.
Here is how to to it: Retrieving Time and Date Information. I looked through MSDNs docs and found this.
Note, that there is no really such thing as "2 character day-name" or "3 character day-name" here. There are: native ("long" in Delphi), abbreviated ("short" in Delphi) or short (Vista and above, not present in Delphi) formats.
For example, abbreviated name of the day of the week for Monday: Mon (3 chars, en-US), Пн (2 chars, ru-RU).
So, you probably look for LOCALE_SSHORTESTDAYNAMEX format (which is called "short" by MSDN and doesn't appear in Delphi), but it is availavle only on Vista and above.
For example, the following code:
const
LOCALE_SSHORTESTDAYNAME1 = $60;
procedure TForm1.Button1Click(Sender: TObject);
begin
SetThreadLocale($409);
ShowMessage(
GetLocaleStr(GetThreadLocale, LOCALE_SSHORTESTDAYNAME1, '') + #13#10 +
GetLocaleStr(GetThreadLocale, LOCALE_SABBREVDAYNAME1, '')
);
end;
will show you:
Mo
Mon
But doing this for Russian will output:
Пн
Пн
Hope my edits make answer more clear ;)

Resources