opencsv RFC4180Parser readNext is too slow when there are too many ‘\n’ in a cell - cpu

Recently encountered slow parsing when downloading csv files, accompanied by soaring CPU and crazy young gc. and we use RFC4180Parser to parse the file, It is probably caused by too many '\n' in a cell, has anyone encountered it? how to solve?
I read the source code implementation of opencsv. It should be that the linereader judges whether it is a new line based on '\n', and each line will be parsed to slow down. Is it possible to only replace the newline symbol?

Related

Protobuf message - parsing difference between binary and text files

During my implementation on a protocol buffer application, I tried to work with the text pbtxt files to ease up my programming. The idea was to switch to the pb binary format afterward, once I had a clearer understanding of the API. (I am working in C++)
I made my application working by importing the file with TextFormat::Parse. (The content of the file came from TextFormat::Print). I then generated the corresponding binary file, that I tried to import with myMessageVariable.ParsefromCodedStream (file not compressed). But I notice that only a very small part of the message is imported. The myMessageVariable.IsInitialized returns true, thus I guess that the library "thinks" that it has completely imported the file.
So my question: is there something different in the way the file are imported that could make the import "half-fail"? (Besides the obvious reason that one is binary and the other one is in text?) And what can we do against it?
There are a few differences in reading text data and reading binary data:
Text files sometimes use automatic linefeed conversion (\r\n vs. \n), especially on Windows platforms. This has to be disabled by opening the file in binary mode.
Binary files can contain null bytes at any point. Some text processing functions stop reading at the first null byte.
It could help if you can determine more about how much of the message gets parsed. Then you can look at what kind of bytes are near the problem point, using e.g. hex editor.

Hadoop: Control Characters in output inspiring compression

It's Friday, I'm super tired, and I was up against a really strange issue.
In my Reducer, I have a Text output. It contains a string with a custom delimiter, to be split on the next MapReduce job.
Thinking I was clever, the delimiter I used was a control character, U+0002.
When it was output, the file was compressed. It was not compressed before I was splitting anything. I very specifically need to avoid compression for my own reasons. I tried turning compression off manually, but to no avail. I was very frustrated for about an hour or two trying everything I could think of.
The answer is... don't use control characters in your output. Or at least that's the answer as far as I can tell! I'd be curious to hear if anyone else have come up against the same issue.

Limit CsvListReader to one line

I am working on application which processes large CSV files (several hundreds of MB's). Recently I faced a problem which at first looked as a memory leak in application, but after some investigation, it appears that it is combination of bad formatted CSV and attempt of CsvListReader to parse never-ending line. As a result, I got following exception:
at java.lang.OutOfMemoryError.<init>(<unknown string>)
at java.util.Arrays.copyOf(<unknown string>)
Local Variable: char[]#13624
at java.lang.AbstractStringBuilder.expandCapacity(<unknown string>)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(<unknown string>)
at java.lang.AbstractStringBuilder.append(<unknown string>)
at java.lang.StringBuilder.append(<unknown string>)
Local Variable: java.lang.StringBuilder#3
at org.supercsv.io.Tokenizer.readStringList(<unknown string>)
Local Variable: java.util.ArrayList#642
Local Variable: org.supercsv.io.Tokenizer#1
Local Variable: org.supercsv.io.PARSERSTATE#2
Local Variable: java.lang.String#14960
at org.supercsv.io.CsvListReader.read(<unknown string>)
By analyzing heap dump and CSV file based on dump findings, I noticed that one of columns in one of CSV lines was missing closing quotes, which obviously resulted in reader trying to find end of the line by appending file content to internal string buffer until there was no more heap memory.
Anyway, that was the problem and it was due to bad formatted CSV - once I removed critical line, problem disappeared. What I want to achieve is to tell reader that:
All the content it should interpret always ends with new line character, even if quotes are not closed properly (no multi-line support)
Alternatively, to provide certain limit (in bytes) of the CSV line
Is there some clear way to do this in SuperCSV using CsvListReader (preferred in my case)?
That issue has been reported, and I'm working on some enhancements (for a future major release) at the moment that should make both options a bit easier.
For now, you'll have to supply your own Tokenizer to the reader (so Super CSV uses yours instead of its own). I'd suggest taking a copy of Super CSV's Tokenizer and modifying with your changes. That way you don't have to modify Super CSV, and you won't waste time.

Is there an elegant way to parse a text file *backwards*? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read a file from bottom to top in Ruby?
In the course of working on my Ruby program, I had the Eureka Moment that it would be much simpler to write if I were able to parse the text files backwards, rather than forward.
It seems like it would be simple to simply read the text file, line by line, into an array, then write the lines backwards into a text file, parse this temp file forwards (which would now effectively be going backwards) make any necessary changes, re-catalog the resulting lines into an array, and write them backwards a second time, restoring the original direction, before saving the modifications as a new file.
While feasible in theory, I see several problems with it in practice, the biggest of which is that if the size of the text file is very large, a single array will not be able to hold the entirety of the document at once.
Is there a more elegant way to accomplish reading a text file backwards?
If you are not using lots of UTF-8 characters you can use Elif library which work just like File.open. just load Elif and replace File.open with Elif.open
Elif.open('read.txt', "r").each_line{ |s|
puts s
}
This is a great library, but the only problem I am experiencing right now is that it have several problems with line ending in UTF-8. I now have to re-think a way to iterate my files
Additional Details
As I google a way to answer this problem for UTF-8 reverse file reading. I found a way that already implemented by File library:
To read a file backward you can try the ff code:
File.readlines('manga_search.test.txt').reverse_each{ |s|
puts s
}
This can do a good job as well
There's no software limit to Ruby array. There are some memory limitations though: Array size too big - ruby
Your approach would work much faster if you can read everything into memory, operate there and write it back to disk. Assuming the file fits in memory of course.
Let's say your lines are 80 chars wide on average, and you want to read 100 lines. If you want it efficient (as opposed to implemented with the least amount of code), then go back 80*100 bytes from the end (using seek with the "relative to end" option), then read ONE line (this is likely a partial one, so throw it away). Remember your current position via tell, then read everything up until the end.
You now have either more or less than a 100 lines in memory. If less, go back (100+1.5*no_of_missing_lines)*80, and repeat the above steps, but only reading lines until you reach your remembered position from before. Rinse and repeat.
How about just going to the end of the file and iterating backwards over each char until you reach a newline, read the line, and so on? Not elegant, but certainly effective.
Example: https://gist.github.com/1117141
I can't think of an elegant way to do something so unusual as this, but you could probably do it using the file-tail library. It uses random access files in Ruby to read it backwards (and you could even do it yourself, look for random access at this link).
You could go throughout the file once forward, storing only the byte offset of each \n instead of storing the full string for each line. Then you traverse your offset array backward and can use ios.sysseek and ios.sysread to get lines out of the file. Unless your file is truly enormous, that should alleviate the memory issue.
Admitedly, this absolutely fails the elegance test.

Why should I have to bother putting a linefeed at the end of every file?

I've occasionally encountered software - including compilers - that refuse to accept or properly handle text files that aren't properly terminated with a newline. I've even encountered explicit errors of the form,
no newline at the end of the file
...which would seem to indicate that they're explicitly checking for this case and then rejecting it just to be stubborn.
Am I missing something here? Why would - or should - anything care whether or not a file ends with a seemingly-superfluous bit of whitespace?
Historically, at least in the Unix world, "newline" or rather U+000A Line Feed was a line terminator. This stands in stark contrast to the practice in Windows for example, where CR+LF is a line separator.
A naïve solution of reading every line in a file would be to append characters to a buffer until an LF was encountered. If done really stupid this would ignore the last line in a file if it wasn't terminated by LF.
Another thing to consider are macro systems that allow including files. A line such as
%include "foo.inc"
might be replaced by the contents of the mentioned file where, if the last line wasn't ended with an LF, it would get merged with the next line. And yes, I've seen this behavior with a particular macro assembler for an embedded platform.
Nowadays I'm in the firm belief that (a) it's a relic of ancient times and (b) I haven't seen modern software that can't handle it but yet we still carry around numerous editors on Unix-like systems who helpfully put a byte more than needed at the end of a file.
Generally I would say that a lack of a newline at the end of a source file would mean that something went wrong in the editor or source code control client and not all of the code in the buffer got flushed. While it's likely that this would result in other errors, knowing that something likely went wrong in the editor/SCM and code may be missing is a pretty useful bit of knowledge. Certainly something that I would want to check.

Resources