I'm doing some web scraping, this is the format for the data
Sr.No. Course_Code Course_Name Credit Grade Attendance_Grade
The actual string that i receive is of the following form
1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M
The things that I am interested in are the Course_Code, Course_Name and the Grade, in this example the values would be
Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A
Is there some way for me to use a regular expression or some other technique to easily extract this information instead of manually parsing through the string.
I'm using jruby in 1.9 mode.
Let's use Ruby's named captures and a self-describing regex!
course_line = /
^ # Starting at the front of the string
(?<SrNo>\d+) # Capture one or more digits; call the result "SrNo"
\s+ # Eat some whitespace
(?<Code>\S+) # Capture all the non-whitespace you can; call it "Code"
\s+ # Eat some whitespace
(?<Name>.+\S) # Capture as much as you can
# (while letting the rest of the regex still work)
# Make sure you end with a non-whitespace character.
# Call this "Name"
\s+ # Eat some whitespace
(?<Credit>\S+) # Capture all the non-whitespace you can; call it "Credit"
\s+ # Eat some whitespace
(?<Grade>\S+) # Capture all the non-whitespace you can; call it "Grade"
\s+ # Eat some whitespace
(?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
$ # Make sure that we're at the end of the line now
/x
str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
parts = str.match(course_line)
puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
Grade: #{parts['Grade']}".strip
#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=> Grade: A
Just for fun:
str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}
Do I see that correctly that the delimiter is always 3 spaces? Then just:
serial_number, course_code, course_name, credit, grade, attendance_grade =
the_string.split(' ')
Assuming everything except for the course description consists of single words and there are no leading or trailing spaces:
/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/
Your example string will yield the following match groups:
1. 1
2. CA727
3. PRINCIPLES OF COMPILER DESIGN
4. 3
5. A
6. M
This answer isn't very idiomatic Ruby, because in this case I think clarity is better than being clever. All you really need to do to solve the problem you described is to split your lines with whitespace:
line = '1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]
This assumes your data is regular. If not, you will need to work harder at tuning your regular expression and possibly handling edge cases where you don't have the required number of fields.
A Note for Posterity
The OP changed the input string, and modified the delimiter to a single space between fields. I'll leave my answer to the original question as-is (including the original input string for reference) as it may help others besides the OP in a less-specific case.
Related
Working on a Ruby challenge to convert dash/underscore delimited words into camel casing. The first word within the output should be capitalized only if the original word was capitalized (known as Upper Camel Case).
My solution so far..:
def to_camel_case(str)
str.split('_,-').collect.camelize(:lower).join
end
However .camelize(:lower) is a rails method I believe and doesn't work with Ruby. Is there an alternative method, equally as simplistic? I can't seem to find one. Or do I need to approach the challenge from a completely different angle?
main.rb:4:in `to_camel_case': undefined method `camelize' for #<Enumerator: []:collect> (NoMethodError)
from main.rb:7:in `<main>'
I assume that:
Each "word" is made up of one or more "parts".
Each part is made of up characters other than spaces, hypens and underscores.
The first character of each part is a letter.
Each successive pair of parts is separated by a hyphen or underscore.
It is desired to return a string obtained by modifying each part and removing the hypen or underscore that separates each successive pair of parts.
For each part all letters but the first are to be converted to lowercase.
All characters in each part of a word that are not letters are to remain unchanged.
The first letter of the first part is to remain unchanged.
The first letter of each part other than the first is to be capitalized (if not already capitalized).
Words are separated by spaces.
It this describes the problem correctly the following method could be used.
R = /(?:(?<=^| )|[_-])[A-Za-z][^ _-]*/
def to_camel_case(str)
str.gsub(R) do |s|
c1 = s[0]
case c1
when /[A-Za-z]/
c1 + s[1..-1].downcase
else
s[1].upcase + s[2..-1].downcase
end
end
end
to_camel_case "Little Miss-muffet sat_on_HE$R Tuffett eating-her_cURDS And_whey"
# => "Little MissMuffet satOnHe$r Tuffett eatingHerCurds AndWhey"
The regular expression is can be written in free-spacing mode to make it self-documenting.
R = /
(?: # begin non-capture group
(?<=^| ) # use a positive lookbehind to assert that the next character
# is preceded by the beginning of the string or a space
| # or
[_-] # match '_' or '-'
) # end non-capture group
[A-Za-z] # match a letter
[^ _-]* # match 0+ characters other than ' ', '_' and '-'
/x # free-spacing regex definition mode
Most Rails methods can be added into basic Ruby projects without having to pull in the whole Rails source.
The trick is to figure out the minimum amount of files to require in order to define the method you need. If we go to APIDock, we can see that camelize is defined in active_support/inflector/methods.rb.
Therefore active_support/inflector seems like a good candidate to try. Let's test it:
irb(main)> require 'active_support/inflector'
=> true
irb(main)> 'foo_bar'.camelize
=> "FooBar"
Seems to work. Note that this assumes you already ran gem install activesupport earlier. If not, then do it first (or add it to your Gemfile).
In pure Ruby, no Rails, given str = 'my-var_name' you could do:
delimiters = Regexp.union(['-', '_'])
str.split(delimiters).then { |first, *rest| [first, rest.map(&:capitalize)].join }
#=> "myVarName"
Where str = 'My-var_name' the result is "MyVarName", since the first element of the splitting result is untouched, while the rest is mapped to be capitalized.
It works only with "dash/underscore delimited words", no spaces, or you need to split by spaces, then map with the presented method.
This method is using string splitting by delimiters, as explained here Split string by multiple delimiters,
chained with Object#then.
This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.
I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.
Here is one output that requires multiple regex evaluations but gets what I want to do done (remove everything except the text).
words = IO.read("file.txt").
gsub(/\s/, ""). # delete white spaces
gsub(".",""). # delete periods
gsub(",",""). # delete commas
gsub("?","") # delete Q marks
puts words
# output
# WheninthecourseofhumaneventsitbecomesnecessaryIwanttobelieveyoureallyIdobutwhoamItoblameWhenthefactsarecountedthenumberswillbereportedLotsoflaughsCharlieIthinkIheardthatonetentimesbefore
Looking at this post - Ruby gsub : is there a better way - I figured I would try to do a match to accomplish the same result without multiple regex evaluations. But I don't get the same output.
words = IO.read("file.txt").
match(/(\w*)+/)
puts words
# output - this only gets the first word
# When
And this only gets the first sentence:
words = IO.read("file.txt").
match(/(...*)+/)
puts words
# output - this only gets the first sentence
# When in the course of human events it becomes necessary.
Any suggestions on getting the same output (including stripping out white spaces and non-word characters) on a match rather than gsub?
You can do what you want in one gsub operation:
s = 'When in the course of human events it becomes necessary.'
s.gsub /[\s.,?]/, ''
# => "Wheninthecourseofhumaneventsitbecomesnecessary"
You don't need multiple regex evaluations for this.
str = "# output - this only gets the first sentence
# When in the course of human events it becomes necessary."
p str.gsub(/\W/, "")
#=>"outputthisonlygetsthefirstsentenceWheninthecourseofhumaneventsitbecomesnecessary"
I want to do something like this
def get_count(string)
sentence.split(' ').count
end
I think there's might be a better way, string may have built-in method to do this.
I believe count is a function so you probably want to use length.
def get_count(string)
sentence.split(' ').length
end
Edit: If your string is really long creating an array from it with any splitting will need more memory so here's a faster way:
def get_count(string)
(0..(string.length-1)).inject(1){|m,e| m += string[e].chr == ' ' ? 1 : 0 }
end
If the only word boundary is a single space, just count them.
puts "this sentence has five words".count(' ')+1 # => 5
If there are spaces, line endings, tabs , comma's followed by a space etc. between the words, then scanning for word boundaries is a possibility:
puts "this, is./tfour words".scan(/\b/).size/2
I know this is an old question, but this might help someone stumbling here. Countring words is a complicated problem. What is a "word"? Do numbers and special characters count as words? Etc...
I wrote the words_counted gem for this purpose. It's a highly flexible, customizable string analyser. You can ask it to analyse any string for word count, word occurrences, and exclude words/characters using regexp, strings, and arrays.
counter = WordsCounted::Counter.new("Hello World!", exclude: "World")
counter.word_count #=> 1
counted.words #=> ["Hello"]
Etc...
The documentation and full source are on Github.
using regular expression will also cover multi spaces:
sentence.split(/\S+/).size
String doesn't have anything pre-built to do what you wanted. You can define a method in your class or extend the String class itself for what you want to do:
def word_count( string )
return 0 if string.empty?
string.split.size
end
Regex split on any non-word character:
string.split(/\W+/).size
...although it makes apostrophe use count as two words, so depending on how small the margin of error needs to be, you might want to build your own regex expression.
I recently found that String#count is faster than splitting up the string by over an order of magnitude.
Unfortunately, String#count only accepts a string, not a regular expression. Also, it would count two adjacent spaces as two things, rather than a single thing, and you'd have to handle other white space characters seperately.
p " some word\nother\tword.word|word".strip.split(/\s+/).size #=> 4
I'd rather check for word boundaries directly:
"Lorem Lorem Lorem".scan(/\w+/).size
=> 3
If you need to match rock-and-roll as one word, you could do like:
"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4