Hadoop: Control Characters in output inspiring compression - hadoop

It's Friday, I'm super tired, and I was up against a really strange issue.
In my Reducer, I have a Text output. It contains a string with a custom delimiter, to be split on the next MapReduce job.
Thinking I was clever, the delimiter I used was a control character, U+0002.
When it was output, the file was compressed. It was not compressed before I was splitting anything. I very specifically need to avoid compression for my own reasons. I tried turning compression off manually, but to no avail. I was very frustrated for about an hour or two trying everything I could think of.

The answer is... don't use control characters in your output. Or at least that's the answer as far as I can tell! I'd be curious to hear if anyone else have come up against the same issue.

Related

Odd Characters in Ruby Cron Output

I have some Ruby scripts that I run overnight. They output to a cronlog.txt as a report I can check daily. This is using the puts method. It has some characters output which don't show up in the terminal when run.
How can I rid these odd characters? I am assuming I need to force some kind of formatting.
Here's a sample of the output:
[H[2J[0;31;49m### wobbly/assign_city_pc.rb 2020_Sep_01 3:09[0m
[0;33;49mlocations count: 16484[0m
[0;33;49mgrabbed count: 150[0m
[0;33;49mtargets count: 16334[0m
Desired output:
### wobbly/assign_city_pc.rb 2020_Sep_01 3:09
locations count: 16484
grabbed count: 150
targets count: 16334
Solved:
As per #Stefan, ANSI escape codes.
I took out the gem 'colorize' from that script.
Now, as #Stefan properly noted, it was the resulting codes that gave the characters in my situation. There are other gems that also could do this, but I haven't used them. I don't know what their names are. I don't wish to use them. I will not research them. I am assuming there are others that accomplish this. I'm also thinking there are other gems that provide other markup as well. I haven't used them. I don't wish to use them.
None of the aforementioned gems I will research in order to post a novel about new markup. I used one, and I stopped using it and the characters went away. I can safely assume that this specific change is what brought on the fix.
As you can see, I am assuming that some gems providing "ANSI escape codes" would be a sufficient answer to this question, as it was evident more characters were being added. My original point was to say "I took out anything that generated more characters", but that wasn't good enough.
So the answer to the specific characters I had, was to take out colorize only. Any other new characters generated are outside of my knowledge and I will not be testing any further on something that was kindly answered by three words.

How to split a large csv file into multiple files in GO lang?

I am a novice Go lang programmer,trying to learn Go lang features.I wanted to split a large csv file into multiple files in GO lang, each file containing the header.How do i do this? I have searched everywhere but couldnt get the right solution.Any help in this regard will be greatly appreciated.
Also please suggest me a good book for reference.
Thanking You
Depending on your shell fu this problem might be better suited for common shell utilities but you specifically mentioned go.
Let's think through the problem.
How big is this csv file? Are we talking 100 lines or is it 5G ?
If it's smallish I typically use this:
http://golang.org/pkg/io/ioutil/#ReadFile
However, this package also exists:
http://golang.org/pkg/encoding/csv/
Regardless - let's return to the abstraction of the problem. You have a header (which is the first line) and then the rest of the document.
So what we probably want to do (if ignoring csv for the moment) is to read in our file.
Then we want to split the file body by all the newlines in it.
You can use this to do so:
http://golang.org/pkg/strings/#Split
You didn't mention but do you know how many files you want to split by or would you rather split by the line count or byte count? What's the actual limitation here?
Generally it's not going to be file count but if we pretend it is we simply want to divide our line count by our expected file count to give lines/file.
Now we can take slices of the appropriate size and write the file back out via:
http://golang.org/pkg/io/ioutil/#WriteFile
A trick I use sometime to help think me threw these things is to write down our mission statement.
"I want to split a large csv file into multiple files in go"
Then I start breaking that up into pieces but take the divide/conquer approach - don't try to solve the entire problem in one go - just break it up to where you can think about it.
Also - make gratiutious use of pseudo-code until you can comfortably write the real code itself. Sometimes it helps to just write a short comment inline with how you think the code should flow and then get it down to the smallest portion that you can code and work from there.
By the way - many of the golang.org packages have example links where you can literally run in your browser the example code and cut/paste that to your own local environment.
Also, I know I'll catch some haters with this - but as for books - imo - you are going to learn a lot faster just by trying to get things working rather than reading. Action trumps passivity always. Don't be afraid to fail.
Here is a package that might help. You can set a necessary chunk size in bytes and a file will be split on an appropriate amount of chunks.

Reliably parse unpredictable CSV formats

I'm looking for some kind of solution to a problem we're having importing CSV files with Ruby. We keep running into all kinds of exceptions ranging from malformed lines to line ending problems. Right now we're using FasterCSV and have this hacky exception catching solution to try different combinations of delimiters and quotation styles. I don't like it.
All in all, it's an inelegant solution and it seems like this shouldn't be something we should have to deal with. I'm looking for a lib, in any language, that I can point to a file and it'll just figure out how it's formatted and give me the data I need from any CSV.
thanks
The Python CSV package is pretty good at this. However, when dealing with unpredictable CSV formats, I expect you'll have to do maintenance no matter what library you pick.

Utility to Stamp/Watermark Unicode Text Into a PDF

I am looking for a (preferably) command line utility to stamp/watermark unicode text content into a PDF document.
I tried PDF Stamp and a couple of others that I found over the net, but to no avail with Greek characters (e.g. ΓΔΘΛ become ÃÄÈË).
Many thanks for any help!
With sufficiently "odd" characters, you generally need to specify a font and an encoding. I suspect that at least one of the tools you experimented with have the capability to define such things.
Reading their docs, it looks like PDFStamp will let you specify a font, but not an encoding. That doesn't bode well. It might always pick "Identity-H" for system fonts... worth trying.
I must admit, I'm surprised. "Disappointed" even. Have you contacted their email support?
Once upon a time, iText shipped with a number of command line tools that were mostly intended as examples but were none the less useful. I suspect you could dig them out of the SVN archive on sourceforge and get them to build again, if your Java-fu is up to the task. Just be sure to use BaseFont.IDENTITY_H whenever you're given a choice of encodings for a font.

What is the best character to use as a delimiter in a custom batch syntax?

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?

Resources