Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
This is an interesting issue I have been playing around with but unable to find an answer for.
I have a text file of unstructured data that includes emails as well as full names. I already have the emails extracted but I want to map first and last names to each email as well.
So suppose the email is ksmith#gmail.com, and somewhere on the page is 'Kevin Smith'.
I'd want to use whatever is in front of '#' to map the full name from somewhere in the text. But obviously searching for 'ksmith' will return no match. So then, starting from the left, I'd search for one less character, ie 'smith', which would match.
But then when I find 'Smith', I also want to find the first name. so maybe assume this will always be the last name (since most emails have last but not first names) and search to the left from 'Smith' until reaching the next space (in front of 'Kevin') and figuring that what is between the space before 'Smith' and before 'Kevin' is the first name.
But then, what if the full name is "Kevin Michael Smith" or "Kevin P. Smith"? In which case I don't want "Michael" or "P.", but Kevin as the first name.
Or what if the email structure is smithk#gmail.com, in which case shrinking the substring from the left will never be a match and I'd need to try from the other side as well.
Basically I need a method smart enough to recognize these full names in a number of cases.
Any help would be appreciated!
I am trying to do this in Ruby, if that helps
When you find the last name, you move back to the first name so instead of moving left of 'Smith' until reaching the next space, you should see if there is space behind the first alphabet of next name for example your algorithm for "Kevin P. Smith" will find "P." but if you check if there is space behind "P" then find next part of the name. So for "Kevin Micheal John Smith" you will get Kevin because first you reach "John" then you see there is space behind "J" so you move back to "Micheal" again there is space bind "M" so you move to "Kevin". As there is no space behind Kevin so you have the first name.
Easiest solution is to use the Split function for example
string_=string_.split(" ");
firstName=string_[0];
my suggestion is to write an algorithm, which make an array of full name. for exmple :
a = ["kevin smit", "andrew john", "thom devid", "M. K. Add","k smith"]
b= "ksmith#gmail.com"
c = b.split('#')[0]
=> "ksmith"
first = c[0]
=> "k"
last = c[1..c.length]
=> "smith"
a.each do |i|
if i.gsub(" ").count == 1
if (i.split(" ")[0][0] == first && i.split(" ")[1] == last) || (i.split(" ")[0][0] == last && i.split(" ")[1] == first)
p i
end
elsif i.gsub(" ").count == 2
if (i.split(" ")[0][0] == first && i.split(" ")[2] == last) || (i.split(" ")[0][0] == last && i.split(" ")[2] == first)
p i.split(" ")[0] + i.split(" ")[2]
end
end
end
This will works for you. And you can use switch-case insted of if-else if there are multiple scenarios
Related
I have 4 chars, first one is letter 'L' for example, the other two are numbers and the last one is letter again, all of them are separated by one space. User is entering them in the Ruby console. I need to check that they are separated by one space and don't have other weird characters and that there is nothing after the last letter.
So if a user enters for example gets.chomp = 'L 5 7 A', I need to check that everything is ok and separated by only one space and return input[1], input[2], input[3]. How can I do that? Thanks.
You can do something like this:
puts "Enter string"
input = gets.chomp
r = /^(L)\s(\d)\s(\d)\s([A-Z])$/
matches = input.match r
puts matches ? "inputs: #{$1}, #{$2}, #{$3}, #{$4}" : "input-format incorrect"
Here $1 is the first capture, similarly for $2, $3 etc. If you want to store the result in an array you can use:
matches = input.match(r).to_a
then the first element is the entire match, followed by each capture.
Try
/^\w\s(\d)\s(\d)\s(\w)$/
Rubular is a good sandbox site for experimenting with and debugging regexes.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I need to split a string, for food products, such as "Chocolate Biscuits 200g"
I need to extract the "200g" from the String and then split this by number and then by the measurement/weight.
So I need the "200" and "g" separately.
I have written a Ruby regex to find the "200g" in the String (sometimes there may be space between the number and measurement so I have included an optional whitespace between them):
([0-9]*[?:\s ]?[a-zA-Z]+)
And I think it works. But now that I have the result ("200g") that it matched from the entire String, I need to split this by number and measurement.
I wrote two regexes to split these:
([0-9]+)
to split by number and
([a-zA-Z]+)
to split by letters.
But the .split method is not working with these.
I get the following error:
undefined method 'split' for #MatchData "200"
Of course I will need to convert the 200 to a number instead of a String.
Any help is greatly appreciated,
Thank you!
UPDATE:
I have tested the 3 regexes on http://www.rubular.com/.
My issue seems to be around splitting up the result from the first regex into number and measurement.
One way among many is to use String#scan with a regex. See the last sentence of the doc concerning the treatment of capture groups.
str = "Chocolate Biscuits 200g"
r = /
(\d+) # match one or more digits in capture group 1
([[:alpha:]]+) # match one or more alphabetic characters in capture group 2
/x # free-spacing regex definition mode
number, weight = str.scan(r).flatten
#=> ["200", "g"]
number = number.to_i
#=> 200
I'm not an expert in ruby, but I guess that the following code does the deal
myString = String("Chocolate Biscuits 200g");
weight = 0;
unit = String('');
stringArray = myString.split(/(?:([a-zA-Z]+)|([0-9]+))/);
stringArray.each{
|val|
if val =~ /\A[0-9]+\Z/
weight = val.to_i;
elsif weight > 0 and val.length > 0
unit = val;
end
}
p weight;
p unit;
I'm just learning Ruby and have been tackling small code projects to accelerate the process.
What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.
goal = File.read('big.txt').split(/\s/).map do |word|
word.scan(/[[:alpha:]]+/).uniq
end
goal.each { |word|
if word.length < 5
goal.delete(word)
end
}
puts goal.sample
Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.
You might want to change your regex instead to catch only words longer than 5 characters to begin with:
goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
word.scan(/[[:alpha:]]{6,}/).uniq
end
Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:
goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w|
goal << w
end
In this case, use the delete_if method
goal => your array
goal.delete_if{|w|w.length < 5}
This will return a new array with the words of length lower than 5 deleted.
Hope this helps.
I really don't understand what a lot of the stuff you are doing in the first loop is for.
You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.
This is way too complicated for what you want. Try this:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5
end
This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5 &&
! word.upcase.include? 'Q'
end
This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.
Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.
Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.
Edit: Second version
goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1]
Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.
This works well, and it's only one line for your whole task, but it'll match
sugar'
in
I'd like some 'sugar', if you know what I mean
Like above, if your word can't contain q or Q, you could change the regex to
/[a-pr-z][a-pr-z']{4,})[ .'",]/i
And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I've got a string that has variable length sections. The length of the section precedes the content of that section. So for example, in the string:
13JOHNSON,STEVE
The first 2 characters define the content length (13), followed by the actual content. I'd like to be able to parse this using named capture groups with a backreference, but I'm not sure it is possible. I was hoping this would work:
(?<length>\d{2})(?<name>.{\k<length>})
But it doesn't. Seems like the backreference isn't interpreted as a number. This works fine though:
(?<length>\d{2})(?<name>.{13})
No, that will not work of course. You need to recompile your regular expression after extracting the first number.
I would recommend you to use two different expressions:
the first one that extracts number, and the second one that extracts texts basing on the number extracted by the first one.
You can't do that.
>> s = '13JOHNSON,STEVE'
=> "13JOHNSON,STEVE"
>> length = s[/^\d{2}/].to_i # s[0,2].to_i
=> 13
>> s[2,length]
=> "JOHNSON,STEVE"
This really seems like you're going after this the hard way. I suspect the sample string is not as simple as you said, based on:
I've got a string that has variable length sections. The length of the section precedes the content of that section.
Instead I'd use something like:
str = "13JOHNSON,STEVE 08Blow,Joe 10Smith,John"
str.scan(/\d{2}(\S+)/).flatten # => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John"]
If the string can be split accurately, then there's this:
str.split.map{ |s| s[2..-1] } # => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John"]
If you only have length bytes followed by strings, with nothing between them something like this works:
offset = 0
str.delete!(' ') # => "13JOHNSON,STEVE08Blow,Joe10Smith,John"
str.scan(/\d+/).map{ |l| s = str[offset + 2, l.to_i]; offset += 2 + l.to_i ; s }
# => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John"]
won't work if the names have digits in them – tihom
str = "13JOHNSON,STEVE 08Blow,Joe 10Smith,John 1012345,7890"
str.scan(/\d{2}(\S+)/).flatten # => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John", "12345,7890"]
str.split.map{ |s| s[2..-1] } # => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John", "12345,7890"]
With a a minor change, and minor addition it'll continue to work correctly with strings not containing delimiters:
str.delete!(' ') # => "13JOHNSON,STEVE08Blow,Joe10Smith,John1012345,7890"
offset = 0
str.scan(/\d{2}/).map{ |l| s = str[offset + 2, l.to_i]; offset += 2 + l.to_i ; s }.compact
# => ["JOHNSON,STEVE", "Blow,Joe", "Smith,John", "12345,7890"]
\d{2} grabs the numerics in groups of two. For the names where the numeric is a leading length value of two characters, which is according to the OPs sample, the correct thing happens. For a solid numeric "name" several false-positives are returned, which would return nil values. compact cleans those out.
What about this?
a = '13JOHNSON,STEVE'
puts a.match /(?<length>\d{2})(?<name>(.*),(.*))/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
After using a keyword API to get popular keywords and phrases, I get also a lot of "dirty" terms with too much extra words ("the", "a", etc.).
I'd also like to isolate names in search terms.
Is there a Ruby library to clean up keyword lists? Does such an algorithm exist at all?
You're talking about "stopwords", which are articles of speech, such as "the" and "a", plus words that are encountered so often that they are worthless.
Stopword lists exist; Wordnet has one if I remember right and there might be one in Lingua or the Ruby Wordnet for Ruby or readablity modules, but really they're pretty easy to generate yourself. And, you probably need to since the junk words vary depending on a particular subject matter.
The easiest thing to do is run a preliminary pass with several sample documents and split your text into words, then loop over them, and for each one increment a counter. When you're finished look for the words that are two to four letters long and are disproportionately higher counts. Those are good candidates for stopwords.
Then run passes over your target documents, splitting the text like you did previously, counting occurrences as you go. You can either ignore words in your stopword list and not add them to your hash, or process everything then delete the stopwords.
text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.
These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT
# do this against several documents to build a stopword list. Tweak as necessary to fine-tune the words.
stopwords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }.select{ |n,v| n.length < 5 }
print "Stopwords => ", stopwords.keys.sort.join(', '), "\n"
# >> Stopwords => 2606, 3, and, are, by, com, edu, for, have, in, into, net, not, or, org, page, rfc, see, this, use, web, you, your
Then, you're ready to do some keyword gathering:
text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.
These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT
stopwords = %w[2606 3 and are by com edu for have in into net not or org page rfc see this use web you your]
keywords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }
stopwords.each { |s| keywords.delete(s) }
# output in order of most often seen to least often seen.
keywords.keys.sort{ |a,b| keywords[b] <=> keywords[a] }.each { |k| puts "#{k} => #{keywords[k]}"}
# >> example => 4
# >> names => 1
# >> reached => 1
# >> browser => 1
# >> these => 1
# >> domain => 1
# >> typing => 1
# >> reserved => 1
# >> documentation => 1
# >> available => 1
# >> registration => 1
# >> section => 1
After you've narrowed down your list of words you can run the candidates through WordNet and find synonyms, homonyms, word relations, strip plurals, etc. If you're doing this to a whole lot of text you'll want to keep your stopwords in a database where you can continually fine-tune them. The same thing applies to your keywords, because from those you can start to determine tone and other semantic goodness.
Btw, I decided to go this route:
bad_words = ["the", "a", "for", "on"] #etc etc
# Strip non alpha chars, and split into a temp array, then cut out the bad words
tmp_str = str.gsub(/[^A-Za-z0-9\s]/, "").split - bad_words
str = tmp_str.join(" ")