String#split in Ruby not behaving as expected - ruby

File.open(path, 'r').each do |line|
row = line.chomp.split('\t')
puts "#{row[0]}"
end
path is the path of file having content like name, age, profession, hobby
I'm expecting output to be name only but I am getting the whole line.
Why is it so?

The question already has an accepted answer, but it's worth noting what the cause of the original problem was:
This is the problem part:
split('\t')
Ruby has several forms for quoted string, which have differences, usually useful ones.
Quoting from Ruby Programming at wikibooks.org:
...double quotes are designed to
interpret escaped characters such as
new lines and tabs so that they appear
as actual new lines and tabs when the
string is rendered for the user.
Single quotes, however, display the
actual escape sequence, for example
displaying \n instead of a new line.
Read further in the linked article to see the use of %q and %Q strings. Or Google for "ruby string delimiters", or see this SO question.
So '\t' is interpreted as "backslash+t", whereas "\t" is a tab character.
String#split will also take a Regexp, which in this case might remove the ambiguity:
split(/\t/)

Your question was not very clear
split("\n") - if you want to split by lines
split - if you want to split by spaces
and as I can understand, you do not need chomp, because it removes all the "\n"

Related

Match & includes? method

My code is about a robot who has 3 posible answers (it depends on what you put in the message)
So, inside this posible answers, one depends if the input it's a question, and to prove it, i think it has to identify the "?" symbol on the string.
May i have to use the "match" method or includes?
This code it's gonna be include in a loop, that may answer in 3 possible ways.
Example:
puts "whats your meal today?"
answer = gets.chomp
answer.includes? "?"
or
answer.match('?')
Take a look at String#end_with? I think that is what you should use.
Use String#match? Instead
String#chomp will only remove OS-specific newlines from a String, but neither String#chomp nor String#end_with? will handle certain edge cases like multi-line matches or strings where you have whitespace characters at the end. Instead, use a regular expression with String#match?. For example:
print "Enter a meal: "
answer = gets.chomp
answer.match? /\?\s*\z/m
The Regexp literal /\?\s*\z/m will return true value if the (possibly multi-line) String in your answer contains:
a literal question mark (which is why it's escaped)...
followed by zero or more whitespace characters...
anchored to the end-of-string with or without newline characters, e.g. \n or \r\n, although those will generally have been removed by #chomp already.
This will be more robust than your current solution, and will handle a wider variety of inputs while being more accurate at finding strings that end with a question mark without regard to trailing whitespace or line endings.

Tokenise lines with quoted elements in Ruby

I need to tokenise strings in Ruby - string.split is almost perfect, except some of the strings may be enclosed in double-quotes, and within them, whitespace should be preserved. In the absence of lex for Ruby (correct?), writing a character-by-character tokenizer seems silly. What are my options?
I want a loop that's essentially:
while !file.eof:
line = file.readline
tokens = line.tokenize() # like split() but handles "some thing" as one token
end
I.e an an array of white-space delimited fields, but with correct handling of quoted sequences. Note there is no escape sequence for the quotes I need to handle.
The best I can imagine so far, is repeatedly match()ing a reg-exa which matches either the quotes sequence or everything until the next whitespace character, but even then I'm not sure how to formulate than neatly.
Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep and :quote_char options.
If you insist to parse manually you may use the following pattern in a more ruby way:
file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end
split accepts a regex so you could just write the regexp you want and call split on the line you just read.
line.split(/\w+/)
Try using Ruby's CSV library, and use a space (" ") as the :col_sep
:col_sep
The String placed between each field. This String will be transcoded
into the data’s Encoding before parsing.

Semantic differences between percent literals and herdocs in Ruby?

Looking at some documentation, I saw a multiline string defined using a percent literal:
command %Q{
do this;
do that;
}
In the past, I've always used heredocs when I needed multiline strings:
command <<-heredoc
echo "stuff" | do stuff;
heredoc
What are the semantic differences between them? Is there any reason why I would want to use %Q and not a heredoc?
I tend to evaluate how much text is being used when deciding which to use.
I use %Q when there's not a lot of text (for example, a single line), e.g. %Q|foobar|. The value that %Q provides, is it allows you to easily mix quotes, e.g.
%Q|"Get a Job" ~Mom's words|
I use "heredoc"s when there is a lot of text that spans multiple lines.
For example, suppose you're pasting a lot of text into a REPL (like the content of a YAML file). Unless you traverse the whole file, you can't be certain whether or not you will have a conflict with whatever %Q separator you have chosen. With a "heredoc" you just use some really obscure piece of text that you're fairly certain will not have a conflict, e.g.
<<-BatMobilePrettyObscure
... Lots of text ...
BatMobilePrettyObscure
As far as I know, semantically, there are just a few small differences:
%Q can only use one character to delimit strings
%Q can be multi-line or single-line
"heredoc"s must be Multi-line, with the closing "heredoc" standing alone
%Q delimiters can be "mashed" up against their strings, e.g. %Q|foobar|
There's a funky trick that you can use with heredocs: the first line can be used as if it was a complete string. For example, all of the following examples are valid Ruby code:
puts(<<-EOS)
Hello, world!
EOS
<<-EOS.upcase
Hello, world!
EOS
puts(<<-EOS.upcase)
Hello, world!
EOS
However, you will not find that very often in the wild. Other than that, they are the same as double quoted strings or %Q{} and %{} literals, except that you can choose multi-character delimiters. This comes in handy when all of the possible percent literal delimiters may occur in the string. This especially applies to long strings.
There isn't really a semantic difference, and it doesn't have to do with multiline strings either. All strings can be multiline in Ruby. These are all the same string:
'a
b
'
"a
b
"
%Q{a
b
}
<<-heredoc
a
b
heredoc
The question of which to use is decided by whether you need interpolation and the convenience of escaping characters. For example:
Do you need interpolation? If not then '' or %q()
Will there be lots of quote characters to escape? Then use %Q()
Do you want to write a lot of text without thinking about escaping characters? Use heredocs.

Matching an unescaped balanced pair of delimiters

How can I match a balanced pair of delimiters not escaped by backslash (that is itself not escaped by a backslash) (without the need to consider nesting)? For example with backticks, I tried this, but the escaped backtick is not working as escaped.
regex = /(?!<\\)`(.*?)(?!<\\)`/
"hello `how\` are` you"
# => $1: "how\\"
# expected "how\\` are"
And the regex above does not consider a backslash that is escaped by a backslash and is in front of a backtick, but I would like to.
How does StackOverflow do this?
The purpose of this is not much complicated. I have documentation texts, which include the backtick notation for inline code just like StackOverflow, and I want to display that in an HTML file with the inline code decorated with some span material. There would be no nesting, but escaped backticks or escaped backslashes may appear anywhere.
Lookbehind is the first thing everyone thinks of for this kind of problem, but it's the wrong tool, even in flavors like .NET that support unrestricted lookbehinds. You can hack something up, but it's going to be ugly, even in .NET. Here's a better way:
`[^`\\]*(\\.[^`\\]*)*`
The first part starts from the opening delimiter and gobbles up anything that's not the delimiter or a backslash. If the next character is a backslash, it consumes that and the character following it, whatever it may be. It could be the delimiter character, another backslash, or anything else, it doesn't matter.
It repeats those steps as many times as necessary, and when neither [^`\\] nor \\. can match, the next character must be the closing delimiter. Or the end of the string, but I'm assuming the input is well formed. But if it's not well formed, this regex will fail very quickly. I mention that because of this other approach I see a lot:
`(?:[^`\\]+|\\.)*`
This works fine on well-formed input, but what happens if you remove the last backtick from your sample input?
"hello `how\` are you"
According to RegexBuddy, after encountering the first backtick, this regex performed 9,252 distinct operations (or steps) before it could give up and report failure; mine failed in ten steps.
EDIT To extract just the par inside the delimiters, wrap that part in a capturing group. You'll still have to remove the backslashes manually.
`([^`\\]*(?:\\.[^`\\]*)*)`
I also changed the other group to non-capturing, which I should have done from the start. I don't avoid capturing religiously, but if you are using them to capture stuff, any other groups you use should be non-capturing.
EDIT I think I've been reading too much into the question. On StackOverflow, if you want to include literal backticks in an inline-code segment or a comment, you use three backticks as the the delimiter, not just one. Since there's no need to escape backticks, you can ignore backslashes as well. Your regex could turn out to be as simple as this:
```(.*?)```
Dealing with the possibility of false delimiters, you use the same basic technique:
```([^`]*(?:`(?!``)[^`]*)*)```
Is this what you're after?
By the way, this answer doesn't contradict #nneonneo's comment above. This answer doesn't consider the context in which the match is taking place. Is it in the source code of a program or web page? If it is, did the match occur inside a comment or a string literal? How do I even know the first backtick I found wasn't escaped? Regexes don't know anything about the context in which they operate; that's what parsers are for.
If you don't need nesting, regexes can indeed be a proper tool. Lexers of programming languages, for instance, use regexes to tokenize strings, and strings usually allow their own delimiters as an escaped content. Anything more complicated than that will probably need a full-blown parser though.
The "general formula" is to match an escaped character (\\.) or any character that's valid as content but don't need to be escaped ([^{list of invalid chars}]). A "naïve" solution would be joining them with or (|), but for a more efficient variant see #AlanMoore's answer.
The complete example is shown below, in two variants: the first assumes than backslashes should only be used for escaping inside the string, the second assumes that a backslash anywhere in the text escapes the next character.
`((?:\\.|[^`\\])*)`
(?:\\.|[^`\\])*`((?:\\.|[^`\\])*)`
Working examples here and here. However, as #nneonneo commented (and I endorsed), regexes are not meant to do a complete parse, so you'd better keep things simple if you want them to work out right (do you want to find a token in the text, or do you want to delimit it already knowing where it starts? The answer to that question is important to decide which strategy works best for your case).

gsub ASCII code characters from a string in ruby

I am using nokogiri to screen scrape some HTML. In some occurrences, I am getting some weird characters back, I have tracked down the ASCII code for these characters with the following code:
#parser.leads[0].phone_numbers[0].each_byte do |c|
puts "char=#{c}"
end
The characters in question have an ASCII code of 194 and 160.
I want to somehow strip these characters out while parsing.
I have tried the following code but it does not work.
#parser.leads[0].phone_numbers[0].gsub(/160.chr/,'').gsub(/194.chr/,'')
Can anyone tell me how to achieve this?
I found this question while trying to strip out invisible characters when "trimming" a string.
s.strip did not work for me and I found that the invisible character had the ord number 194
None of the methods above worked for me but then I found "Convert non-breaking spaces to spaces in Ruby " question which says:
Use /\u00a0/ to match non-breaking spaces: s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces
Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.
So glad I found that! Now I'm using:
s.gsub(/[[:space:]]/,'')
This doesn't answer the question of how to gsub specific character codes, but if you're just trying to remove whitespace it seems to work pretty well.
Your problem is that you want to do a method call but instead you're creating a Regexp. You're searching and replacing strings consisting of the string "160" followed by any character and then the string "chr", and then doing the same except with "160" replaced with "194".
Instead, do gsub(160.chr, '').
Update (2018): This code does not work in current Ruby versions. Please refer to other answers.
You can also try
s.gsub(/\xA0|\xC2/, '')
or
s.delete 160.chr+194.chr
First thought would be should you be using gsub! instead of gsub
gsub returns a string and gsub! performs the substitution in place
I was getting "invalid multibyte escape" error while trying the above solution, but for a different situation. Google was return \xA0 when the number is greater than 999 and I wanted to remove it. So what I did was use return_value.gsub(/[\xA0]/n,"") instead and it worked perfectly fine for me.

Resources