So I was spliting some large files, everything worked properly until a file of 81GB came to scene. The split command seems that made its job, but the last files has a non correlated name. Look at the right bottom of picture.
And I'm using the command like this:
split -b 125M ./2014.txt 2014/2014_
Anyone knows why instead of create the file 2014_za created the 2014_zaaa?
You can only have 676 files named [a-z][a-z], while your command required more.
Here are some options for what split could do:
Crash.
This is the behavior mandated by POSIX, and followed by macOS.
Start writing larger suffixes.
This is a bad choice because after _zz comes _aaa, but now the files will show up in the wrong order in ls and cat * will no longer join them in correct order.
Save the last range, _z, for longer suffixes.
This is a good choice because after _yz comes _zaaa, which has room to grow while still remaining in alphabetical order. This is what GNU does, and the behavior you're seeing.
If you want all the names to be uniform without triggering any of these behaviors, just use a larger suffix length with -a 6 to ensure you have enough room.
I have the following problem to solve:
There is a flat file to read, but the information is unfortunately spread over two rows. So i need to merge these two rows.
I thought about creating an incomplete object first and then add the information from the next row. Then move to the next couple. But i don't really see how to manage that.
Is there a way to read two lines and then process, or to remember an object from one to another step. I'm quite confused.
Any hint would be appreciated. Thanks.
This is a perfect use case for using a SingleItemPeekableItemReader. Check out this older answer for an example.
I have a hadoop application that -depending on a parameter- only needs certain (few!) input files from the input directory. My question is now: where is the best place (read: as early as possible) to skip those files? Right now I customized a RecordReader to take care of that, but I was wondering whether I could skip those files sooner? In my current implmentation hadoop still has a huge overhead due to irrelevant files.
Maybe I should add that it is very easy to see whether I need a certain input file. If the filename starts with a parameter, it is needed. Structuring my input directory hierachically might be a solution, but one that is not very likely for my project since every files would end up lonely in a certain directory.
I'd propose you to filter out the input files by applying the appropriate pattern on the input Paths as mentioned here: https://stackoverflow.com/a/13454344/1050422
Note that this solution doesn't consider subdirectories. Alter it
to be able to recursively visit all subdirectories, within the base path.
I've had success with using the setInputPaths() method on TextInputFormat to specify a single String containing comma-separated file names.
I have two files with different data formats in HDFS. How would a job set up look like, if I needed to reduce across both data files?
e.g. imagine the common word count problem, where in one file you have space as the world delimiter and in another file the underscore. In my approach I need different mappers for the various file formats, that than feed into a common reducer.
How to do that?
Or is there a better solution than mine?
Check out the MultipleInputs class that solves this exact problem. It's pretty neat-- you pass in the InputFormat and optionally the Mapper class.
If you are looking for code examples on google, search for "Reduce-side join", which is where this method is typically used.
On the other hand, sometimes I find it easier to just use a hack. For example, if you have one set of files that is space delimited and the other that is underscore delimited, load both with the same mapper and TextInputFormat and tokenize on both possible delimiters. Count the number of tokens from the two results set. In the word count example, pick the one with more tokens.
This also works if both files are the same delimiter but have a different number of standard columns. You can tokenize on comma then see how many tokens there are. If it is say 5 tokens it is from data set A, if it is 7 tokens it is from data set B.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to read a file from bottom to top in Ruby?
In the course of working on my Ruby program, I had the Eureka Moment that it would be much simpler to write if I were able to parse the text files backwards, rather than forward.
It seems like it would be simple to simply read the text file, line by line, into an array, then write the lines backwards into a text file, parse this temp file forwards (which would now effectively be going backwards) make any necessary changes, re-catalog the resulting lines into an array, and write them backwards a second time, restoring the original direction, before saving the modifications as a new file.
While feasible in theory, I see several problems with it in practice, the biggest of which is that if the size of the text file is very large, a single array will not be able to hold the entirety of the document at once.
Is there a more elegant way to accomplish reading a text file backwards?
If you are not using lots of UTF-8 characters you can use Elif library which work just like File.open. just load Elif and replace File.open with Elif.open
Elif.open('read.txt', "r").each_line{ |s|
puts s
}
This is a great library, but the only problem I am experiencing right now is that it have several problems with line ending in UTF-8. I now have to re-think a way to iterate my files
Additional Details
As I google a way to answer this problem for UTF-8 reverse file reading. I found a way that already implemented by File library:
To read a file backward you can try the ff code:
File.readlines('manga_search.test.txt').reverse_each{ |s|
puts s
}
This can do a good job as well
There's no software limit to Ruby array. There are some memory limitations though: Array size too big - ruby
Your approach would work much faster if you can read everything into memory, operate there and write it back to disk. Assuming the file fits in memory of course.
Let's say your lines are 80 chars wide on average, and you want to read 100 lines. If you want it efficient (as opposed to implemented with the least amount of code), then go back 80*100 bytes from the end (using seek with the "relative to end" option), then read ONE line (this is likely a partial one, so throw it away). Remember your current position via tell, then read everything up until the end.
You now have either more or less than a 100 lines in memory. If less, go back (100+1.5*no_of_missing_lines)*80, and repeat the above steps, but only reading lines until you reach your remembered position from before. Rinse and repeat.
How about just going to the end of the file and iterating backwards over each char until you reach a newline, read the line, and so on? Not elegant, but certainly effective.
Example: https://gist.github.com/1117141
I can't think of an elegant way to do something so unusual as this, but you could probably do it using the file-tail library. It uses random access files in Ruby to read it backwards (and you could even do it yourself, look for random access at this link).
You could go throughout the file once forward, storing only the byte offset of each \n instead of storing the full string for each line. Then you traverse your offset array backward and can use ios.sysseek and ios.sysread to get lines out of the file. Unless your file is truly enormous, that should alleviate the memory issue.
Admitedly, this absolutely fails the elegance test.