I am working on application which processes large CSV files (several hundreds of MB's). Recently I faced a problem which at first looked as a memory leak in application, but after some investigation, it appears that it is combination of bad formatted CSV and attempt of CsvListReader to parse never-ending line. As a result, I got following exception:
at java.lang.OutOfMemoryError.<init>(<unknown string>)
at java.util.Arrays.copyOf(<unknown string>)
Local Variable: char[]#13624
at java.lang.AbstractStringBuilder.expandCapacity(<unknown string>)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(<unknown string>)
at java.lang.AbstractStringBuilder.append(<unknown string>)
at java.lang.StringBuilder.append(<unknown string>)
Local Variable: java.lang.StringBuilder#3
at org.supercsv.io.Tokenizer.readStringList(<unknown string>)
Local Variable: java.util.ArrayList#642
Local Variable: org.supercsv.io.Tokenizer#1
Local Variable: org.supercsv.io.PARSERSTATE#2
Local Variable: java.lang.String#14960
at org.supercsv.io.CsvListReader.read(<unknown string>)
By analyzing heap dump and CSV file based on dump findings, I noticed that one of columns in one of CSV lines was missing closing quotes, which obviously resulted in reader trying to find end of the line by appending file content to internal string buffer until there was no more heap memory.
Anyway, that was the problem and it was due to bad formatted CSV - once I removed critical line, problem disappeared. What I want to achieve is to tell reader that:
All the content it should interpret always ends with new line character, even if quotes are not closed properly (no multi-line support)
Alternatively, to provide certain limit (in bytes) of the CSV line
Is there some clear way to do this in SuperCSV using CsvListReader (preferred in my case)?
That issue has been reported, and I'm working on some enhancements (for a future major release) at the moment that should make both options a bit easier.
For now, you'll have to supply your own Tokenizer to the reader (so Super CSV uses yours instead of its own). I'd suggest taking a copy of Super CSV's Tokenizer and modifying with your changes. That way you don't have to modify Super CSV, and you won't waste time.
Related
I have a CSV file that I've introduced into my pipeline (for testing purposes) in two ways. First, I'm using GetFile to read it in from the server file system. Second, I'm using GenerateFlowFile. The content of these files is identical; I copied and pasted the content from the GetFile output to insert as text into GenerateFlowFile. Yet, when I run these through a ReplaceText processor, I am seeing different results.
The file from GenerateFlowFile is working as expected, and the regex string in ReplaceText is being found and replaced with an empty string exactly as I want. But, the file from GetFile is returning a file with no change after running through ReplaceText. How is this possible, and how can I fix this?
I tried to create a reproducible example, but I'm only seeing the issue with my data and can't replicate it with non-PII data. If it makes a difference, the regex used in ReplaceText is ^.*"\(Line.*,\n and the replacement value is an empty string set. Essentially, I want to drop the extraneous first line.
A common problem in Big Data is getting data into Big Data friendly format (parquet or TSV).
In Spark wholeTextFiles which currently returns RDD[(String, String)] (path -> whole file as string) is a useful method for this but causes many issues when the files are large (mainly memory issues).
In principle it ought to be possible to write a method as follows using the underlying Hadoop API
def wholeTextFilesIterators(path: String): RDD[(String, Iterator[String])]
Where the iterator is the file (assuming newline as delimiter) and the iterator is encapsulating the underlying file reading & buffering.
After reading through the code for a while I think a solution would involve creating something similar to WholeTextFileInputFormat and WholeTextFileRecordReader.
UPDATE:
After some thought this probably means also implementing a custom org.apache.hadoop.io.BinaryComparable so the iterator can survive a shuffle (hard to serialise the iterator as it has file handle).
See also https://issues.apache.org/jira/browse/SPARK-22225
Spark-Obtaining file name in RDDs
As per Hyukjin's comment on the JIRA, something close to what is wanted is given by
spark.format("text").read("...").selectExpr("value", "input_file_name()")
Seems like it must be easy, but I just can't figure it out. How do you delete the very last character of a file using Ruby IO?
I took a look at the answer for deleting the last line of a file with Ruby but didn't fully understand it, and there must be a simpler way.
Any help?
There is File.truncate:
truncate(file_name, integer) → 0
Truncates the file file_name to be at most integer bytes long. Not available on all platforms.
So you can say things like:
File.truncate(file_name, File.size(file_name) - 1)
That should truncate the file with a single system call to adjust the file's size in the file system without copying anything.
Note that not available on all platforms caveat though. File.truncate should be available on anything unixy (such as Linux or OSX), I can't say anything useful about Windows support.
I assume you are referring to a text file. The usual way of changing such is to read it, make the changes, then write a new file:
text = File.read(in_fname)
File.write(out_fname, text[0..-2])
Insert the name of the file you are reading from for in_fname and the name of the file you are writing to for 'out_fname'. They can be the same file, but if that's the intent it's safer to write to a temporary file, copy the temporary file to the original file then delete the temporary file. That way, if something goes wrong before the operations are completed, you will probably still have either the original or temporary file. text[0..-2] is a string comprised of all characters read except for the last one. You could alternatively do this:
File.write(out_fname, File.read(in_fname, File.stat(in_fname).size-1))
I wrote a little program that creates a hash called movies. Then I can add, update, delete, and display all current movies in the hash by typing the title.
Instead of having it start a new hash each time and save anything added to a file, and, when updated or deleted, update or delete the key, value pair from the file, I want the program to auto-load the file on startup and create it if it doesn't exist.
I have no idea how to go about doing this.
After reading a lot of the comments I have decided that maybe I should do this with SQL instead, seems like a much better approach!
You can't store Ruby objects directly on the disk; you will first need to convert them to some sequence of bytes (i.e. a string). This is called serialization, and there are several different ways to do it and several different formats the data could be in. I think I would recommend JSON, but you might also want to try YAML or Marshal.
Any of those libraries will allow you to convert your hash into a string and allow you to convert that same string back into a hash. Then you can use Ruby's File class to save and load that string from the disk.
This should get you pointed in the right direction. From here you can search for more specific things like "how do I convert a hash to JSON" or "how do I write a string to a file".
You have the ability to marshal your code in a few ways.
YAML if you would like to use a gem, or JSON. There is also a built in Marshal
RI tells us:
Marshal
(from ruby site)
----------------------------------------------------------------------------- The marshaling library converts collections of Ruby objects into a
byte stream, allowing them to be stored outside the currently active
script. This data may subsequently be read and the original objects
reconstituted.
Marshaled data has major and minor version numbers stored along with
the object information. In normal use, marshaling can only load data
written with the same major version number and an equal or lower minor
version number. If Ruby's ``verbose'' flag is set (normally using -d,
-v, -w, or --verbose) the major and minor numbers must match exactly. Marshal versioning is independent of Ruby's version numbers. You can
extract the version by reading the first two bytes of marshaled data.
And I will leave it at that for Marshal. But there is a bit more documentation there.
You can also use IO#puts to write to a file, and then modify that file to load later, which I use sometimes for config settings. Why use YAML or another external source, when Ruby is easy enough to have a user modify? You use YAML when it needs to be more generally accessible, as the Tin Man points out.
For example this file is the sample file, but is intended for interactive editing (with constraints, of course) but it is simply valid Ruby. And it gets read by a Ruby program, and is a valid object (in this case a Hash stored in a constant.)
ruby somescript.rb somehugelonglistoftextforprocessing
is this a bad idea? rather should i create a separate flat file containig the somehugelonglistoftextforprocessing, and let somescript.rb read it ?
does it matter if the script argument is very very long text(1KB~300KB) ? what are some problems that can arise if any.
As long as the limits of your command-line handling code (e.g., bash or ruby itself) are not exceeded, you should have no technical problems in doing this.
Whether it's a good idea is another matter. Do you really want to have to type in a couple of hundred kilobytes every single time you run your program? Do you want to have to remember to put quotes around your data if it contains spaces?
There are a number of ways I've seen this handled which you may want to consider (this list is by no means exhaustive):
Change your code so that, if there's no arguments, read the information from standard input - this will allow you to do either
ruby somescript.rb myData
or
ruby somescript.rb <myFile.txt.
Use a special character to indicate file input (I've seen # used in this way). So,
ruby somescript.rb myData
would use the data supplied on the command line whilst
ruby somescript.rb #myFile.txt
would get the data from the file.
My advice would be to use the file-based method for that size of data and allow an argument to be used if specified. This covers both possible scenarios:
Lots of data, put it in a file so you won't have to retype it every time you want to run your command.
Not much data, allow it to be passed as an argument so that you don't have to create a file for something that's easier to type in on the command line.