Ruby and RegExp - ruby

Sorry if this has already been asked.
I have about 1 million text documents contained in psql
I am trying to see if they contain certain word, for example cancer, or died or heart_attack etc. This list is also quite long.
The document only needs to contain one of the words.
If they contain a word, I then try to copy them to a different folder.
My current code is:
directory = "disease" #Creates a directory called heart attacks
FileUtils.mkpath(directory) # Makes the directory if it doesn't exists
cancer = Eightk.where("text ilike '%cancer%'")
died = Eightk.where("text ilike '%died%'")
cancer.each do |filing| #filing can be used instead of eightks
filename = "#{directory}/#{filing.doc_id}.html"
File.open(filename,"w").puts filing.text
puts "Storing #{filing.doc_id}..."
died.each do |filing| #filing can be used instead of eightks
filename = "#{directory}/#{filing.doc_id}.html"
File.open(filename,"w").puts filing.text
puts "Storing #{filing.doc_id}..."
end
end
But this is not working for the following
Doesn't match the exact word
Is very time consuming since it contains lots of coping the same code and changing just one word.
So I have tried using Regexp.union as follows but am a bit lost
directory = "disease" #Creates a directory called heart attacks
FileUtils.mkpath(directory) # Makes the directory if it doesn't exists
keywords = [/dead/,/killed/,/cancer/]
re = regexp.union(keywords)
So I am trying to search the text files for these keywords and then copy the text documents.
Any help is really appreciated.

Since you said:
I have about 1 million text documents contained in psql
and use "iLike" text search operator to search words in those documents.
IMHO, that is an inefficient implementation because your data is huge, your query will process all 1 million text documents for every search and it will be very slow.
Before moving forward, I think you should take a look at PG Full Text Searching first. (if you simply want to use built-in full text search in PG) or you could also take a look at some other products like elasticsearch, solr etc. that are dedicated to text search problem.
Regarding PG full text search, in Ruby, you could use pg_serach gem. Though, if you use Rails, I wrote a post about simple full text search implementaion with PG in Rails.
I hope you may find this useful.

Related

Finding and Editing Multiple Regex Matches on the Same Line

I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:
This is the key phrase.
Becomes
This is the [[key phrase|Glossary#key phrase]].
I have a list of key phrases such as:
keywords = ["golden retriever", "pomeranian", "cat"]
And a document:
Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.
I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:
File.foreach(target_file) do |line|
glosses.each do |gloss|
len = gloss.length
# Create the regex. Avoid anything that starts with [
# or (, ends with ] or ), and ignore case.
re = /(?<![\[\(])#{gloss}(?![\]\)])/i
# Find every instance of this gloss on this line.
positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
positions.each do |pos|
line.insert(pos, "[[")
# +2 because we just inserted 2 ahead.
line.insert(pos+len+2, "|#{page}\##{gloss}]]")
end
end
puts line
end
However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.
Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?
After looking at #BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.
First, I needed to make regexes of my glosses:
glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.
Then, I needed to make a union of all of them:
re = Regexp.union(glosses)
After that, it's as simple as doing gsub on every line, and filling in my matches:
File.foreach(target_file) do |line|
line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
puts line
end

Stemming and partial search using MongoDB 2.4

What is the correct way of doing full text search and partial searches in MongoDB?
E.g. the norwegian word "sokk" (sock).
When searching for "sokk" I want to match on "sokker" (sock in plural), "sokk" and "sokkepose"
A search for "sokker" should match "sokk" and "sokker".
I get the wanted result by using this ruby snippet:
def self.search(q)
result = []
# Full text search first
result << Ad.text_search(q).to_a
# Then search for parts of the word
result << Ad.any_of({ title: /.*#{q}.*/i }, { description: /.*#{q}.*/i} ).to_a
result.flatten!
result.uniq
end
Any suggestions? :)
Cheers,
Martin Stabenfeldt
Martin,
A few suggestions / recommendations / corrections:
Full Text Search in 2.4 is not production ready and should not be deployed in production without knowing the tradeoffs being made. You can find more details at - http://docs.mongodb.org/manual/tutorial/enable-text-search/
For Text Search to work, you need to provide appropriate language for the document while adding it (or specific fields in 2.6). This ensures the words are appropriately stemmed and stopped words are removed from indexing that field.
Specify language while searching for a specific field so that it is appropriately stemmed and top words removed for searching and ranking the results appropriately. You can find more details about both indexing and searching at http://docs.mongodb.org/manual/reference/command/text/ . You can also see the languages that are supported by the MongoDB FTS on that webpage.
Ideally you would not be using regular expressions while doing a full text search, but rather specify the words / strings that you are looking for along with the language.

Include both single and multiple text strings with Regex (in Ruby)?

so I have this problem where I am to list every country in a list in Excel by using Open-URI. Everything is working properly but I can't seem to figure how to get my RegExp-"string" to include single-named countries (like "Sweden") but also countries like South Africa that is separated with a whitespace etc. I hope i've made myself understood fairly and below I will include the relevant pieces of code.
the text I want to match is the following (for example):
Wallis and Futuna
Yemen
I am currently stuck with this Regexp:
/a.+="\w{2}.html">(\w*)<.+{1}/
As you see, there is no problem with matching 'Yemen'.
Though I still want the code to be able to match both "Wallis and Futuna AND Yemen.
Perhaps if there was a way to include everything inside the given ">blabla bla<"?
Any thoughts? I would be very grateful!
It is generally bad to use Regex when dealing with HTML entity extraction
require 'nokogiri'
parser = Nokogiri::HTML.parse(your_html)
country_links = parser.css("a")
country_links.each{|link| puts link['href']; puts link.text;}
For your test sample,
/<a[^>]+href="\w{2}.html">([\w\s]+)<\/a>/

Matching keywords with sentence database, how to avoid duplicated keywords in results?

I'm very new to programming and am a beginner in Ruby. I've done a lot of searching to try to find the answers I need, but nothing seems to match what I'm looking for.
I need to make a program for work that will:
Get keywords from the user
Match those keywords with the same keywords in a database of sentences, and then
Spit out randomized sentences that:
contain all the keywords 1 time
do NOT contain keywords not listed
do NOT duplicate keywords
Important to know: Sentences all have a mix of several keywords, NOT one per sentence
1 & 2 are OK, I've been able to do those. My problem is with part 3. I've tried long lists of "if include?" parameters, but it never ends up working and I know there must be a better way to do this.
My grasp of Ruby (and programming generally) is basic and I don't really know what it can and can't do, so any tips or hints in what functions would be useful would be very very much appreciated.
If the match is found, why don't you consecutively pop it out of your array/db? It will ensure no duplication, since that record would not be present to be matched later. No?
Consider this snippet:
db=%q(It is hot today), %q(It is going to rain), %q(Where are you, sonny?), %q(sentence contains is and are)
keyw=%w(is am are)
de=[]
keyw.each do |word|
for index in 0...db.length
if db[index].include?(word)
puts "Matched #{word} with #{db[index]}"
de<<index
end
end
until de.empty?
db.delete_at(de.pop)
end
end
db is database example and keyw contains keywords.
Corresponding output:
Matched is with It is hot today
Matched is with It is going to rain
Matched is with sentence contains is and are
Matched are with Where are you, sonny?
No duplication. :)

Putting spaces back into a string of text with unreliable space information

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!
Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.
PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.
If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

Resources