When collecting user input in Ruby, is there ever a time where using chomp on that input would not be the desired behavior? That is, when would simply using gets and not gets.chompbe appropriate.
Yes, if you specify the maximum length for input, having a "\n" included in the gets return value allows you to tell if Ruby gave you x characters because it encountered the "\n", or because x was the maximum input size:
> gets 5
abcdefghij
=> 'abcde'
vs:
> gets 5
abc\n
=> 'abc\n'
If the returned string contains no trailing newline, it means there are still characters in the buffer.
Without the limit on the input, the trailing newline, or any other delimiter, probably doesn't have much use, but it's kept for consistency.
Related
So I'm trying to figure out what a question from an old exam means and I'm slightly confused about one or two parts.
#!/bin/bash
awk '{$0 = tolower($0)
gsub(/[,.?;:#!\(\)]/),"",$0)
for(a=1;a<=NF;a++)
b[$a]++}
END print b[a],a}'
sort -sk2
Here is my interpretation:
target the bash script location
scan file with awk
convert string to lower case
sub all occurrences of symbols with nothing (ie. remove) and overwrite string
(here is my issue) for every field increment a by 1?
(again not sure what this is doing) b takes a's number and increments by 1?
end the for loop and print (b, a)
sort by size of the second field
I think the last four lines are my main issue. Also is it just me or is there an extra } in that question?
Thanks in advance.
The for loop is weirdly formatted. Here it is again with proper indentation:
for(a=1; a<=NF; a++)
b[$a]++
In other words, we loop over the field positions; for each, the count in the associative array b is incremented. So if the current input line is
foo bar poo bar baz
the script will do
b["foo"]++ # a is 1; $a is $1
b["bar"]++
b["poo"]++
b["bar"]++
b["baz"]++
So now b contains a set of tokens as keys, and the number of times each occurred as their respective values. In other words, this collects word counts for each word in the input.
The case folding and removal of punctuation normalizes the input so that
Word word word, word!
will count as four occurrences of "word", rather than one each for the capitalized version, the undecorated normal form, and the ones with punctuation attached at the end. It slightly distorts e.g. words which should properly be capitalized, and conflates into homographs words which are differentiated only by capitalization (such as china porcelain vs China the country.)
The END block is executed only when all input lines have been consumed, and thus b is fully loaded with all input words from all input lines, with their final counts. (Though here, there is no valid END block actually, because the opening brace after END is missing; this is a fatal syntax error. There isn't one closing brace too many, there's one non-optional opening brace missing.)
I see people use the following code:
gets.chomp.to_i
or
gets.chomp.to_f
I don't understand why, when the result of those lines are always the same as when there is no chomp after gets.
Is gets.chomp.to_i really necessary, or is gets.to_i just enough?
From the documentation for String#to_i:
Returns the result of interpreting leading characters in str as an
integer base base (between 2 and 36). Extraneous characters past the
end of a valid number are ignored. If there is not a valid number at
the start of str, 0 is returned
String#to_f behaves the same way, excluding, of course, the base numbers.
Extraneous characters past the end of a valid number are ignored, this would include the newline. So there is no need to use chomp.
There is no need to use chomp method because:
String#chomp returns a new String with the given record separator removed from the end of str (if present). If $/ has not been changed from the default Ruby record separator, then chomp also removes carriage return characters (that is it will remove "\n", "\r", and "\r\n"). Here are some examples.
String#to_f returns the result of interpreting leading characters in str as a floating point number. Extraneous characters past the end of a valid number are ignored. If there is not a valid number at the start of str, 0.0 is returned. This method never raises an exception. Here are some examples for to_f.
It is my opinion that it works the same either way, so there is no need for the chomp after gets if you are going to immediately do to_i or to_f.
In practice, I have never seen an error raised or different behavior because of leaving chomp out of the line.
I find it is distracting, when I see it used in answers, and there is absolutely no need for it. It doesn't add to a "style", and it is, as #TheTinMan states, wasted CPU cycles.
I have a string seperated by \t and ,, but the number of \t is not fixed, for example :
a=["seg1\tseg2\t\tseg3,seg4"]
seg2 and seg3 is seperated by two \t.
So I try to split them by
a.split(/\t+|,/)
it print the right anwser :
["seg1", "seg2", "seg3", "seg4"]
And I also try this
a.split(/[\t+,]/)
but the answer is
["seg1", "seg2", "", "seg3", "seg4"]
Why ruby print different results?
Because \t+ inside [] does not mean "one or more tabs", it means "a tab or a plus". Since it finds two consecutive tabs, it splits twice, and the string in the middle becomes empty.
Most special characters, like . + * ? etc, when placed in an interval become "regular" characters. There are some exceptions, like ^ (which negates the interval when placed at the beginning), the \ (that escapes the next character(s), just like it does outside intervals) and the ] (that closes the interval; another [ is also disallowed there). So, [\t+,] actually means '\t' or '+' or ','.
Unfortunatly, I don't know any reference for the full set of characters that need or don't need escaping inside an interval. In doubt, I tend to escape just to be sure. In any case, an interval will always match a single character only, if you want something different you must put your quantifier outside the interval. (For example: [\t,]+, if you also admit two commas in a row; otherwise, your first regex is really the correct one)
I never need the ending newline I get from gets. Half of the time I forget to chomp it and it is a pain in the....
Why is it there?
Like puts (which sounds similar), it is designed to work with lines, using the \n character.
gets takes an optional argument that is used for "splitting" the input (or "just reading till it arrives). It defaults to the special global variable $/, which contains a \n by default.
gets is a pretty generic method for readings streams and includes this separator. If it would not do it, parts of the stream content would be lost.
var = gets.chomp
This puts it all on one line for you.
If you look at the documentation of IO#gets, you'll notice that the method takes an optional parameter sep which defaults to $/ (the input record separator). You can decide to split input on other things than newlines, e.g. paragraphs ("a zero-length separator reads the input a paragraph at a time (two successive newlines in the input separate paragraphs)"):
>> gets('')
dsfasdf
fasfds
dsafadsf #=> "dsfasdf\nfasfds\n\n"
From a performance perspective, the better question would be "why should I get rid of it?". It's not a big cost, but under the hood you have to pay to chomp the string being returned. While you may never have had a case where you need it, you've surely had plenty of cases where you don't care -- gets s; puts stuff() if s =~ /y/i, etc. In those cases, you'll see a (tiny, tiny) performance improvement by not chomping.
How I auto-detect line endings:
# file open in binary mode
line_ending = 13.chr + 10.chr
check = file.read(1000)
case check
when /\r\n/
# already set
when /\n/
line_ending = 10.chr
when /\r/
line_ending = 13.chr
end
file.rewind
while !file.eof?
line = file.gets(line_ending).chomp
...
end
I have a regular expression that I'm testing a input stream of characters. I wonder if there is a way to match the regular expression against the input and determine if it is a partial match that consumes the entire input buffer? I.e. the end of the input buffer is reached before the regexp completes. I would like the implementation to decide whether to wait for more input characters, or abort the operation.
In other words, I need to determine which one is true:
The end of the input buffer was reached before the regexp was matched
E.g. "foo" =~ /^foobar/
The regular expression matches completely
E.g. "foobar" =~ /^foobar/
The regular expression failed to match
E.g. "fuubar" =~ /^foobar
The input is not packetized.
Is this the scenario you are solving? You are waiting for a literal string, e.g. 'foobar'. If the user types a partial match, e.g. 'foo', you want to keep waiting. If the input is a non-match you want to exit.
If you are working with literal strings I would just write a loop to to test the characters in sequence. Or,
If (input.Length < target.Length && target.StartsWith(input))
// keep trying
If you are trying to match more complex regular expressions, I don't know how to do this with regular expressions. But I would start by reading more about how the platform implements regular expressions.
tom
I'm not sure if this is your question but.
Regular expressions either match or not. And the expression will match a variable amount of input. So, it can't be determined directly.
However, it is possible, if you believe there is a possibility of overlap, to use a smart buffering scheme to accomplish the same thing.
There are many ways to do this.
One way is to match all that does not match via assertions, up until you get the start
of a match (but not the full match you seek).
These you simple throw away and clear from your buffer. When you get a match you seek, clear the buffer of that data and data before it.
Example: /(<function.*?>)|([^<]*)/ The part you throw away/clear from the buffer is in group 2 capture buffer.
Another way is if you are matching finite length strings, if you don't match anything in
the buffer, you can safely throw away all from the beginning of the buffer to the end of the buffer minus the length of the finite string you are searching for.
Example: Your buffer is 64k in size. You are searching for a string of length 10. It was not found in the buffer. You can safely clear (64k - 10) bytes, retaining the last 10 bytes. Then append (64k-10) bytes to the end of the buffer. Of course you only need a buffer of size 10 bytes, constantly removing/adding 1 character but a larger buffer is more
efficient and you could use thresholds to reload more data.
If you can create a buffer that easily contracts/expands, more buffering options are available.