I know that
my_str.split("\n").first
gives me the first line of the string.
But sadly that cuts the entire string into an array. If that string is several MB in size and I only need the first 5 lines then... There's gotta be a better alternative. I could write my own method to process the string character by character but there is probably some better method or even a build-in one for what I need?
There's String#each_line:
my_str.each_line.take(5)
Related
I have some HTML page and on this page I will provide the possibility for free text.
For example, it is possible to write in textbox either: 10 or 10 apples.
In a case of writing 10 apples I got NumberFormatException which is correct, but for me will be good to extract only number automatically without javascript writing.
Is it possible to map string from HTML page to the number in my java entity? May be with some annotation or somehow else?
Try:
final String stripped = textbox.getText().replaceAll("[^0-9]", "");
This takes the contents of the text field and strips out any characters that aren't digits. If you need to deal with floating point or negative numbers, it can be done, but becomes more complicated.
I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.
I have a long string, consisting of multiple sentences, of various length, divided by a "-".
I want to iterate over the string and extract everything between the -'s, preferably to an array.
From another thread I found something that gets me pretty close, but not all the way:
longString.scan( /-([^-]*)-/)
Needless to say, I am new to Ruby, and especially to RegEx.
What's wrong with using String#split?
longString.split('-')
Why not just use string.split()?
longString.split('-');
I have a very large data file (2GB-3GB). I need to parse some data out of it and check if there is a duplication. So I have a empty string to start with, so data that I parse out from input file, will be check against this string. If it is not already there, append it. This string can potential be very very long. Is it dangerous?
It is not dangerous. You just might have not enough memory to store a very very long string. So will encounter out of memory error.
Background: I've been writing a little interpreter in Scheme (R5RS).
The reader/lexer takes a (sometimes long) string from input and tokenises it. It does this by matching the first few characters of the string against some token and returning the token and the remaining unmatched part of the string.
Problem: to return the remaining portion of the string, a new string is created every time a token is read. This means the reader is O(n^2) in the number of tokens present in the string.
Possible solution: convert the string to a list, which can be done in time O(n), then pull tokens from the list instead of the string, returning the remainder of the list instead of the remainder of the string. But this seems terribly inefficient and artificial.
Question: am I imagining it, or is there just no other way to do this efficiently in Scheme due to its purely functional outlook?
Edit: in R5RS Scheme, there isn't a way to return a pointer into a string. The "substring" function is the only function which extracts an object which is itself a string. But the Scheme standard insists this be a newly allocated string. Why? Because strings are not immutable in Scheme R5RS, e.g. see the "string-set!" function!!
One solution suggested below which works is to store an index into the string. Then one can read off the characters one at a time from that index until a token is read. Too bad the regexp library I'm using for the tokenisation requires an actual string not an index into one...
Consider making a shared-substring implementation of strings (this is how Java does it, for example). So when you want to grab a substring of a given string, rather than copying the characters, simply keep a pointer to (some location in) those characters, and a length.