My problem is I have a very large file, an example:
f = %q(1:9- The cost of\n
51:10- The beams cost so much\n
41:11- Should we buy more beams\n
21:12- Why buy more}
What I need to do is, as an example, is extract every beams word from any line that contains that particular word. But each beams word must come with the reference for the line it comes from, like this:
51:10 beams\n
41:11 beams\n
Any help is gratefully appreciated.
/(\d{2,2}:\d{2,2})-.*?(beams)/
The first capture will contain the line reference and the second the word beams
You can extract using scan:
f.scan(/^(\d+\:\d+).+?(beams)/)
=> [["51:10", "beams"], ["41:11", "beams"]]
And for the output:
f.scan(/^(\d+\:\d+).+?(beams)/).each do |pair|
puts pair.join(" ")
end
=>
51:10 beams
41:11 beams
Related
Im trying to get the uppercase words from a text. How i can use .match() for this?
Example
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
and I need something like:
r = /[A-Z]/
puts r.match(text)
I never used match and i need a method that gets all uppercase words (Acronym).
If you only want acronyms, you can use something like:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
text.scan(/\b[A-Z]+\b/)
# => ["PS"]
It's important to match entire words, which is where \b helps, as it marks word boundaries.
The problem is when your text contains single, stand-alone capital letters:
text = "Pediatric stroke (PS) I U.S.A"
text.scan(/\b[A-Z]+\b/)
# => ["PS", "I", "U", "S", "A"]
At that point we need a bit more intelligence and foreknowledge of the text content being searched. The question is, are single-letter acronyms valid? If not, then a minor modification will help:
text.scan(/\b[A-Z]{2,}\b/)
# => ["PS"]
{2,} is explained in the Regexp documentation, so read that for more information.
i only want acronym type " (ACRONYM) ", in this case PS
It's not easy to tell what you want by your description. An acronym is defined as:
An acronym is an abbreviation used as a word which is formed from the initial components in a phrase or a word. Usually these components are individual letters (as in NATO or laser) or parts of words or names (as in Benelux).
according to Wikipedia. By that definition, lowercase, all caps and mixed case can be valid.
If, you mean you only want all-caps within parenthesis, then you can easily modify the regex to honor that, but you'll fail on other acronyms you could encounter, by either missing ones you should want, or by capturing others you should want to ignore.
text = "(PS) (CT/CAT scan)"
text.scan(/\([A-Z]+\)/) # => ["(PS)"]
text.scan(/\([A-Z]+\)/).map{ |s| s[1..-2] } # => ["PS"]
text.scan(/\(([A-Z]+)\)/) # => [["PS"]]
text.scan(/\(([A-Z]+)\)/).flatten # => ["PS"]
are varying ways grab the text but this only opens a new can of worms when you look at "List of medical abbreviations" and "Medical Acronyms / Abbreviations".
Typically I'd have a table of the ones I'll accept, use a simple pattern to capture anything that looks like something I'd want, check to see if it's in the table then keep it or reject it. How to do that is for you to figure out as it's a completely different question and doesn't belong in this one.
Wrong function for the job. Use String#scan.
To get all words that start with uppercase, use String#scan with \b\p{Lu}\w*\b:
text = "Pediatric stroke (PS) is a relatively rare disease, having an estimated incidence of 2.5–13/100,000/year [1–4], but remains one of the most common causes of death in childhood, with a mortality rate of 0.6/100,000 dead/year [5, 6]"
puts text.scan(/\b\p{Lu}\w*\b/).flatten
See demo
The String.match() will only get you the first match, while scan will return all matches.
The regex \b\p{Lu}\w*\b matches:
\b - word boundary
\p{Lu} - an uppercase Unicode letter
\w* - 0 or more alphanumeric characters
\b - a trailing word boundary
To only match linguistic words (made of letters) you can use
puts text.scan(/\b\p{Lu}\p{M}*+(?>\p{L}\p{M}*+)*\b/).flatten
See another demo
Here, \p{Lu}\p{M}*+ matches any Unicode uppercase letter (even a precomposed one as \p{M} matches diacritics) and (?>\p{L}\p{M}*+)* matches 0 or more letters.
To only get words in ALLCAPS, use
puts text.scan(/\b(?>\p{Lu}\p{M}*+)+\b/).flatten
See the 3rd demo
Yes, you can use String#match for this. It may not be the best way, but you didn't ask if it was. You'd have to do something like this:
text.split.map { |s| s.match(/[A-Z]\w*/) }.compact.map { |md| md[0] }
#=> ["Pediatric", "PS"]
If you knew in advance that text contained two words beginning with a capital letter, you could write:
text.match(/([A-Z]\w*).*([A-Z]\w*)/)
[$1,$2]
#=> ["Pediatric", "PS"]
Note that using a regex is not your only option:
text.delete('.,!?()[]{}').split.select { |str| ('A'..'Z').cover?(str[0]) }
#=> ["Pediatric", "PS"]
def pick_random_line
chosen_line = nil
File.foreach("id'sForCascade.txt").each_with_index do |line, id|
chosen_line = line if rand < 1.0/(id+1)
end
return chosen_line
end`enter code here
Hey, i'm trying to make that code pick 37 different lines. So how would I do that i'm stuck and confused.
Assuming you don't want the same line to repeat more than once, I would do it in one line like this:
File.read("test.txt").split("\n").shuffle.first(37)
File.read("test.txt") reads the entire file.
split("\n") splits the file to lines based on the \n delimiter (I assume your file is textual and have lines separated by new line character).
shuffle is a very convenient method of Array that shuffles the lines randomly. You can read about it here:
http://docs.ruby-lang.org/en/2.0.0/Array.html#method-i-shuffle
Finally, first(37) gives you the first 37 lines out of the shuffled array. These are guaranteed to be random from the shuffle operation.
You can do something like this:
input_lines = File.foreach("test.txt").map(&:to_s)
output_lines = []
37.times do
output_lines << input_lines.delete_at(rand(input_lines.length))
end
puts output_lines
This will ensure that you aren't grabbing duplicate lines and you don't need to do any fancy checking.
However, if your file is less than 37 lines this may cause a problem, it also assumes that your file exists.
EDIT:
What is happening is the rand call is now changing the range on which it is called based on the size of the input lines. And since you are deleting at an index when you take the line out, the length shrinks and you do not risk duplicating lines.
If you want to save relatively few lines from a large file, reading the entire file into an array (and then randomly selecting lines) could be costly. It might be better to count the number of lines in the file, randomly select line offsets and then save the lines at those offsets to an array. This approach is no more difficult to implement than the former one, but makes the method more robust, even if the files in the current application are not overly large.1
Suppose your filename were given by FName. Here are three ways to count the numbers of lines in the file:
Count lines, literally
cnt = File.foreach(FName).reduce(0) { |c,_| c+1 }
Use $.
File.foreach(FName) {}
cnt = $.
On Unix-family computers, shell-out to the operating system
cnt = %x{wc -l #{FName}}.split.first.to_ii
The third option is very fast.
Random offsets (base 1) for n lines to be saved could be computed as follows:
lines = (1..cnt).to_a.sample(n).sort
Saving the lines at those offsets to an array is straightforward; for example:
File.foreach(FName).with_object([]) do |line,a|
if lines.first == $.
a << line
lines.shift
break a if lines.empty?
end
end
Note that $. #=> 1 after the first line is first line is read, and $. is incremented by 1 after each successive line is read. (Hence base 1 for line offsets.)
1 Moreover, many programmers, not just Rubiests, are repelled by the idea of amassing large numbers of anything and then discarding all but a few.
The following question was posted by #ruhroe about an hour ago. I was about to post an answer when it was taken down. That's unfortunate, as I thought it was rather interesting. I'm putting it back up in case the OP sees this and also to give others an opportunity to post solutions.
The original question (which I've edited):
The problem is to split a string on some spaces in the string, based on criteria which depend in part on a number given by the user. If that number were, say, 5, each substring would contain either:
one word having 5 or more characters or
as many consecutive words (separated by spaces) as possible, provided the resulting string has at most 5 characters.
For example, if the string were:
"abcdefg fg hijkl mno pqrs tuv wx yz"
the result would be:
["abcdefg", "fg", "hijkl", "mno", "pqrs", "tuv", "wx yz"]
"abcdefg" is on a separate line because it has at least five characters.
"fg" is on a separate line because "fg" contains 5 or few characters and when combined with the following word, with a space between them, the resulting string, "fg hijkl", contains more than 5 characters.
"hijkl" is on a separate line because it satisfies both criteria.
How can I do that?
I believe this does it:
str = "abcdefg fg hijkl e mn pqrs tuv wx yz"
str.scan(/\b(?:\w{5,}|\w[\w\s]{0,3}\w|\w)\b/)
#=> ["abcdefg", "fg", "hijkl", "e mn", "pqrs", "tuv", "wx yz"]
As you iterate through the words in your collection (splitting the original string up into words should be trivial), it seems like there are three possible scenarios:
It's a blank line, and we should insert the current word into the line
It's a non-blank line, and the word can fit
It's a non-blank line, and the word can't fit and it should go into a new line
Something like this should work (note - I haven't tested this much outside of your solution. You'll definitely want to do that):
words.each do |word|
if line.blank?
# this is a new line, so start it with the current word
line << word
elsif word_can_fit_line?(line, word, length)
# the word fits, so append it to the current line
line << " #{word}"
else
# the word doesn't fit, so keep this line and start a new one with
# the current word
lines << line
line = word
end
end
# add the last line and we're done
lines << line
lines
Note that the implementation of word_can_fit_line? should be trivial - you just want to see if the current line length, plus a space, plus the word length, is less than or equal to your desired line length.
I'm just learning Ruby and have been tackling small code projects to accelerate the process.
What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.
goal = File.read('big.txt').split(/\s/).map do |word|
word.scan(/[[:alpha:]]+/).uniq
end
goal.each { |word|
if word.length < 5
goal.delete(word)
end
}
puts goal.sample
Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.
You might want to change your regex instead to catch only words longer than 5 characters to begin with:
goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
word.scan(/[[:alpha:]]{6,}/).uniq
end
Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:
goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w|
goal << w
end
In this case, use the delete_if method
goal => your array
goal.delete_if{|w|w.length < 5}
This will return a new array with the words of length lower than 5 deleted.
Hope this helps.
I really don't understand what a lot of the stuff you are doing in the first loop is for.
You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.
This is way too complicated for what you want. Try this:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5
end
This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:
goal = File.readlines('big.txt').select do |word|
word =~ /^[a-zA-Z]+$/ &&
word.length >= 5 &&
! word.upcase.include? 'Q'
end
This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.
Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.
Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.
Edit: Second version
goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1]
Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.
This works well, and it's only one line for your whole task, but it'll match
sugar'
in
I'd like some 'sugar', if you know what I mean
Like above, if your word can't contain q or Q, you could change the regex to
/[a-pr-z][a-pr-z']{4,})[ .'",]/i
And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex
I was wondering if anyone had any advice on parsing a file with fixed length records in Ruby. The file has several sections, each section has a header, n data elements and a footer. For example (This is total nonsense - but has roughly similar content)
1923 000-230SomeHeader 0303030
209231-231992395 MoreData
293894-329899834 SomeData
298342-323423409 OtherData
3 3423942Footer record 9832422
Headers, Footers and Data rows each begin with a specific number (1,2 & 3) in this example.
I have looked at http://rubyforge.org/projects/file-formatter/ and it looks good - except that the documentation is light and I can't see how to have n data elements.
Cheers,
Dan
There are a number of ways to do this. The unpack method of string could be used to define a pattern of fields as follows :-
"209231-231992395 MoreData".unpack('aa5A1A9a4Z*')
This returns an array as follows :-
["2", "09231", "-", "231992395", " ", "MoreData"]
See the documentation for a description of the pack/unpack format.
Several options exist as usual.
If you want to do it manually I would suggest something like this:
very pseudo-code:
Read file
while lines in file
handle_line(line)
end
def handle_line
type=first_char
parse_line(type)
end
def parse_line
split into elements and do_whatever_to_them
end
Splitting the line into elements of fixed with can be done with for instance unpack()
irb(main):001:0> line="1923 000-230SomeHeader 0303030"
=> "1923 000-230SomeHeader 0303030"
irb(main):002:0* list=line.unpack("A1A5A7a15A10")
=> ["1", "923", "000-230", "SomeHeader ", "0303030"]
irb(main):003:0>
The pattern used for unpack() will vary with field lengths on the different kinds of records and the code will depend on wether you want trailing spaces and such. See unpack reference for details.