How do I split apart a CSV string in Ruby? - ruby

I have this line as an example from a CSV file:
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
I want to split it into an array. The immediate thought is to just split on commas, but some of the strings have commas in them, eg "Life and Living Processes, Life Processes", and these should stay as single elements in the array. Note also that there's two commas with nothing in between - i want to get these as empty strings.
In other words, the array i want to get is
[2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes","","",1,0,"endofline"]
I can think of hacky ways involving eval but i'm hoping someone can come up with a clean regex to do it...
cheers, max

This is not a suitable task for regular expressions. You need a CSV parser, and Ruby has one built in:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/classes/CSV.html
And an arguably superior 3rd part library:
http://fastercsv.rubyforge.org/

str=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
require 'csv' # built in
p CSV.parse(str)
# That's it! However, empty fields appear as nil.
# Makes sense to me, but if you insist on empty strings then do something like:
parser = CSV.new(str)
parser.convert{|field| field.nil? ? "" : field}
p parser.readlines

EDIT: I failed to read the Ruby tag. The good news is, the guide will explain the theory behind building this, even if the language specifics aren't right. Sorry.
Here is a fantastic guide to doing this:
http://knab.ws/blog/index.php?/archives/10-CSV-file-parser-and-writer-in-C-Part-2.html
and the csv writer is here:
http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html
These examples cover the case of having a quoted literal in a csv (which may or may not contain a comma).

text=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
x=[]
text.chomp.split("\042").each_with_index do |y,i|
i%2==0 ? x<< y.split(",") : x<<y
end
print x.flatten
output
$ ruby test.rb
["2412", "21", "Which of the following is not found in all cells?", "Curriculum", "Life and Living Processes, Life Processes", "", "", "", "1", "0", "endofline"]

This morning I stumbled across a CSV Table Importer project for Ruby-on-Rails. Eventually you will find the code helpful:
Github TableImporter

My preference is #steenstag's solution, but an alternative is to use String#scan with the following regular expression.
r = /(?<![^,])(?:(?!")[^,\n]*(?<!")|"[^"\n]*")(?![^,])/
If the variable str holds the string given in the example, we obtain:
puts str.scan r
displays
2412
21
"Which of the following is not found in all cells?"
"Curriculum"
"Life and Living Processes, Life Processes"
1
0
"endofline"
Start your engine!
See also regex101 which provides a detailed explanation of each token of the regex. (Move your cursor across the regex.)
Ruby's regex engine performs the following operations.
(?<![^,]) : negative lookbehind assert current location is not preceded
by a character other than a comma
(?: : begin non-capture group
(?!") : negative lookahead asserts next char is not a double-quote
[^,\n]* : match 0+ chars other than a comma and newline
(?<!") : negative lookbehind asserts preceding character is not a
double-quote
| : or
" : match double-quote
[^"\n]* : match 0+ chars other than double-quote and newline
" : match double-quote
) : end of non-capture group
(?![^,]) : negative lookahead asserts current location is not followed
by a character other than a comma
Note that (?<![^,]) is the same as (?<=,|^) and (?![^,]) is the same as (?=^|,).

Related

Working with Ruby class: Capitalizing a string

I'm trying to get my head around how to work with Classes in Ruby and would really appreciate some insight on this area. Currently, I've got a rather simple task to convert a string with the start of each word capitalized. For example:
Not Jaden-Cased: "How can mirrors be real if our eyes aren't real"
Jaden-Cased: "How Can Mirrors Be Real If Our Eyes Aren't Real"
This is my code currently:
class String
def toJadenCase
split
capitalize
end
end
#=> usual case: split.map(&:capitalize).join(' ')
Output:
Expected: "The Moment That Truth Is Organized It Becomes A Lie.",
instead got: "The moment that truth is organized it becomes a lie."
I suggest you not pollute the core String class with the addition of an instance method. Instead, just add an argument to the method to hold the string. You can do that as follows, by downcasing the string then using gsub with a regular expression.
def to_jaden_case(str)
str.downcase.gsub(/(?<=\A| )[a-z]/) { |c| c.upcase }
end
to_jaden_case "The moMent That trUth is organized, it becomes a lie."
#=> "The Moment That Truth Is Organized, It Becomes A Lie."
Ruby's regex engine performs the following operations.
(?<=\A| ) : use a positive lookbehind to assert that the following match
is immediately preceded by the start of the string or a space
[a-z] : match a lowercase letter
(?<=\A| ) can be replaced with the negative lookbehind (?<![^ ]), which asserts that the match is not preceded by a character other than a space.
Notice that by using String#gsub with a regular expression (unlike the split-process-join dance), extra spaces are preserved.
When spaces are to be matched by a regular expression one often sees whitespaces (\s) matched instead. Here, for example, /(?<=\A|\s)[a-z]/ works fine, but sometimes matching whitespaces leads to problems, mainly because they also match newlines (\n) (as well as spaces, tabs and a few other characters). My advice is to match space characters if spaces are to be matched. If tabs are to be matched as well, use a character class ([ \t]).
Try:
def toJadenCase
self.split.map(&:capitalize).join(' ')
end

Could someone please explain the following Ruby code to me in detail?

I nearly had this challenge on Code Wars in the bag but, I blew it because my knowledge of gsub is sub-par at best. While I roughly understand the concept of gsub, I would like a more thorough understanding of it (different ways you can use it could be helpful to my development) as well as a bit by bit explanation of the code below.
def autocorrect(input)
input.gsub(/\b(you+|u)\b/i, 'your sister')
end
You're taking any string that contains a match to the regular expression shown and replacing it with the second parameter which is in this case, "your sister". Regular expressions are a bit tricky in Ruby but essentially that regular expression is saying:
/ #starts the reg exp
\b #any word boundary
(you+|u) #the word 'you' with one or more of the letter 'u' added after it (so youuuuu would fit) or just the letter 'u' alone with a 'y' or 'o'... the pipe symbol is an or statement in reg-exp. taking one or the other for a match.
\b #again finishing a word boundary
/ #closes the expression.
Checkout Rubular for tips. http://rubular.com/

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

Working with Regular Expressions - Repeating Patterns

I am trying to use regular expressions to match some text.
The following pattern is what I am trying to gather.
#Identifier('VariableA', 'VariableB', 'VariableX', ..., 'VariableZ')
I would like to grab a dynamic number of variables rather than a fixed set of two or three.
Is there any way to do this? I have an existing Regular Expression:
\#(\w+)\W+(\w+)\W+(\w+)\W+(\w+)
This captures the Identifier and up to three variables.
Edit: Is it just me, or are regular expressions not as powerful as I'm making them out to be?
You want to use scan for this sort of thing. The basic pattern would be this:
s.scan(/\w+/)
That would give you an array of all the contiguous sequences for word characters:
>> "#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')".scan(/\w+/)
=> ["Identifier", "VariableA", "VariableB", "VariableX", "VariableZ"]
You say you might have multiple instances of your pattern with arbitrary stuff surrounding them. You can deal with that with nested scans:
s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
That will give you an array of arrays, each inner array will have the "Identifier" part as the first element and that "Variable" parts as an array in the second element. For example:
>> s = "pancakes #Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ') pancakes #Pancakes('one','two','three') eggs"
>> s.scan(/#(\w+)\(([^)]+?)\)/).map { |m| [ m.first, m.last.scan(/\w+/) ] }
=> [["Identifier", ["VariableA", "VariableB", "VariableX", "VariableZ"]], ["Pancakes", ["one", "two", "three"]]]
If you might be facing escaped quotes inside your "Variable" bits then you'll need something more complex.
Some notes on the expression:
# # A literal "#".
( # Open a group
\w+ # One more more ("+") word characters ("\w").
) # Close the group.
\( # A literal "(", parentheses are used for group so we escape it.
( # Open a group.
[ # Open a character class.
^) # The "^" at the beginning of a [] means "not", the ")" isn't escaped because it doesn't have any special meaning inside a character class.
] # Close a character class.
+? # One more of the preceding pattern but don't be greedy.
) # Close the group.
\) # A literal ")".
You don't really need [^)]+? here, just [^)]+ would do but I use the non-greedy forms by habit because that's usually what I mean. The grouping is used to separate the #Identifier and Variable parts so that we can easily get the desired nested array output.
But alex thinks that you meant you wanted to capture the same thing four times. If you want to capture the same pattern, but different things, then you may want to consider two things:
Iteration. In perl, you can say
while ($variable =~ /regex/g) {
the 'g' stands for 'global', and means that each time the regex is called, it matches the /next/ instance.
The other option is recursion. Write your regex like this:
/(what you want)(.*)/
Then, you have backreference 1 containing the first thing, which you can push to an array, and backreference 2 which you'll then recurse over until it no longer matches.
You may use simply (\w+).
Given the input string
#Identifier('VariableA', 'VariableB', 'VariableX', 'VariableZ')
The results would be:
Identifier
VariableA
VariableB
VariableX
VariableZ
This would work for an arbitrary number of variables.
For future reference, it's easy and fun to play around with regexp ideas on Rubular.
So you are asking if there is a way to capture both the identifier and an arbitrary number of variables. I am afraid that you can only do this with regex engines that support captures. Note here that captures and capturing groups are not the one and the same thing. You want to remember all the "variables". This can't be done with simple capturing groups.
I am unaware whether Ruby supports this or not, but I am sure that .NET and the new PERL 6 support it.
In your case you could use two regexes. One to capture the identifier e.g. ^\s*#(\w+)
and another one to capture all variables e.g. result = subject.scan(/'[^']+'/)

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources