I have a very large data file (2GB-3GB). I need to parse some data out of it and check if there is a duplication. So I have a empty string to start with, so data that I parse out from input file, will be check against this string. If it is not already there, append it. This string can potential be very very long. Is it dangerous?
It is not dangerous. You just might have not enough memory to store a very very long string. So will encounter out of memory error.
Related
I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.
I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case รบ), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
EDIT:
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
EDIT:
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.
I know that
my_str.split("\n").first
gives me the first line of the string.
But sadly that cuts the entire string into an array. If that string is several MB in size and I only need the first 5 lines then... There's gotta be a better alternative. I could write my own method to process the string character by character but there is probably some better method or even a build-in one for what I need?
There's String#each_line:
my_str.each_line.take(5)
I have a number in Mathematica, a large number. I have even gotten this number in base 16 form, using OutputForm[]. I am basically trying to write out a number to a file in hex format.
Please keep in mind I am using 123456 in these examples instead of my 70,000 digit number.
Whenever I write a file using a simple Put[123456, "file.raw"] command, I get a raw data file with the actual data 3132333435360A with a line ending.
If I use Put[OutputForm[BaseForm[123456, 16]], "file.raw"] command, I get a raw data file with the data in hex format 31653234300A202020202031360A but still not written as raw data.
I would like the Hex Form of the Number Dumped as Data.
I have tried Export, BinaryWrite, and DumpSave, but can't figure it out.
I just am getting a headache I guess cause I can't see past what I need to do.
One thing I did try was doing:
Export["file.raw", 123456];
But the file is not raw enough. What I mean by that is there is there is header data and extra crap.
Would love to get this working thanks.
Please let us know what you expect to see in your output file, and what you want use it for. Do you want something a human can read, or something in a specified format to be used by a computer? Please provide an example.
The two examples using Put[] correctly provide files containing ASCII characters corresponding to the text representations of your inputs, and which are human-readable.
I think what you're looking for is IntegerString[_,16]:
In[33]:= IntegerString[123456, 16]
Out[33]= "1e240"
str = OpenWrite[];
WriteString[str, IntegerString[123456, 16]];
Close[str];
FilePrint[%]
1e240
(using WriteString instead of Put avoids having the string characters
Background: I've been writing a little interpreter in Scheme (R5RS).
The reader/lexer takes a (sometimes long) string from input and tokenises it. It does this by matching the first few characters of the string against some token and returning the token and the remaining unmatched part of the string.
Problem: to return the remaining portion of the string, a new string is created every time a token is read. This means the reader is O(n^2) in the number of tokens present in the string.
Possible solution: convert the string to a list, which can be done in time O(n), then pull tokens from the list instead of the string, returning the remainder of the list instead of the remainder of the string. But this seems terribly inefficient and artificial.
Question: am I imagining it, or is there just no other way to do this efficiently in Scheme due to its purely functional outlook?
Edit: in R5RS Scheme, there isn't a way to return a pointer into a string. The "substring" function is the only function which extracts an object which is itself a string. But the Scheme standard insists this be a newly allocated string. Why? Because strings are not immutable in Scheme R5RS, e.g. see the "string-set!" function!!
One solution suggested below which works is to store an index into the string. Then one can read off the characters one at a time from that index until a token is read. Too bad the regexp library I'm using for the tokenisation requires an actual string not an index into one...
Consider making a shared-substring implementation of strings (this is how Java does it, for example). So when you want to grab a substring of a given string, rather than copying the characters, simply keep a pointer to (some location in) those characters, and a length.