hadoop converting \r\n to \n and breaking ARC format - hadoop

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like
cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb
It works as expected.
It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).
To double check, I changed my mapper to expect uncompressed data, and did:
cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb
And it works.
I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.
How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.
All of this will eventually run on AWS ElasticMapReduce.

Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):
PipeMapper.java (0.20.2)
Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.
A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

Related

Update the tar. bz2 compressed file

We have 100 hundreds of file in trx_date.tar.bz2 compressed file which has request and response . below is file structure of trx_date.tar.bz2 : trx_date.tar: trx_date contains : log1 ,log2,log3 files which has xml request having some sensitive info and i would like to mask it to some default value. Request Request is having tag 1234567 and i want to mask it to i.e update it to log file to 3333333
I am able to grep it using the the :
Number1=bzcat $LOGDIR/$LOG_FORMAT | grep "<number>[0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9]"
how we can override the those value in the log files using shell script ?
Log file contains request and response.. Where we have tag like 123456 and also other tags as well . I want to read all the line of log file and replace that specific tag like below 333333 and save the info into same file. We have info tag with 333333 as well but I don't want to consider that.
In principle, you cannot do directly what you want (without extracting the file from your .tar.bz2 compressed archive), since a .tar.bz2 file is a bzip2-ed compression of a tar archive. So the only good solution would be to extract files from the archive, do the modification on the extracted files (e.g. with sed(1) or awk), and recreate an archive from it. Using sed on one particular textual file to replace a pattern like <number>[0-9]*</number> by <number>0000000</number> is easy. Writing a bash for loop to iterate that on several files is easy. So combine both approaches, or write a tiny shell or Python script doing that (on the extracted files).
In practice (but that is risky and I don't recommend that) you could hope that <number> digits </number> happens only in the files part of the tar archive you want to modify in place, and then you could perhaps replace (directly in the uncompressed tar archive), using e.g. sed(1), such sequences with other sequences of the same byte length (read more about the tar format: meta data such as file sizes appear in textual form, NUL bytes completed).
You might also consider using tardy, a tar post-processor (that you need to install).
I strongly recommend extracting the tar archive, operate on the extracted files, then recreate that archive again. Of course, you need enough disk space, and you have to estimate it. But tell your manager that disk space is cheap, generally cheaper than your labor costs.
PS. The command given in your question is really wrong and does not do what you dream of. Read more about redirection, pipelines, globbing, unix shells. Read carefully the documentation of Bash (notably basic shell features, shell expansion, command substitution). Read also the documentation of each command that you want to use, e.g. tar(1), grep(1), sed(1), etc....). Read the relevant man-pages(7) perhaps with the man(1) command.

Does HDFS support special characters (Umlauts, etc..)?

I am trying to add a file with umlauts to hdfs but when I do I get an error message like this below
++ hdfs dfs -put $'data/R\366\337el.doc' solr/test/test.data
put: `test.data/R��el.doc': No such file or directory
What should I do then ? Translate the files with ä for example to ae or is there another way to handle this ?
HDFS stores these strings using Java whose strings are UTF-16 encoded. On the wire Hadoop's RPC uses UTF-8 which contains umlauts and various other characters.
What you've probably encountered is that your shell does not seem to support the encoding or the characters.
When in doubt, you can always use the Java API to put files into HDFS, which requires to write some code.

Stream processing lots of stuff to OVA

So one of our developers needs me to batch a bunch of information and process it into an OVA to be presented back for download. This is an easy process using the long method (ie writing to the filesystem), but the developers want a cleaner, streamlined solution that will scale better. They have therefore requested that I stream the entire processes which is proving difficult. Can someone please give me some direction. Here are the steps that need to be accomplished:
Get input from webserver (Webserver will pass these as stream eventually.)
Random password
XML file
Modify boot script on file system (ie insert random password generated by server)
Create ISO of XML file and boot script
Calculate the SHA1 sum of ISO
Append SHA1 sum of ISO to manifest file in OVF directory
Create OVA from OVF directory
Here is an example directory structure (I outlined this in / just for simplicity)
/--
|
|--ISO/
| |
| |--boot.sh (Where the random password gets inserted)
| |--config.xml (This is handed from the web server. Needs to stream from server)
|
|--OVF/
|
|--disk.vmdk
|--ovf.xml
|--manifest.mf (Contains SHA1 of all files in OVF directory)
|--boot.iso (This file will exist once created from ISO directory)
Here is what I have so far (I'll explain the issues afterwards. Yes... there are a lot of issues):
cat /ISO/boot.sh | sed "s%DEFAULT%RANDOM%" | mkisofs /ISO/* | echo "SHA1(boot.iso)= " && sha1sum >> manifest.mf | tar -cvf success.ova /OVF/*
NOTE
In boot.sh there is a variable set to DEFAULT like this (Just for testing purposes):
PASSWORD="DEFAULT"
NOTE
This is what a line in the manifest file should look like:
SHA1(boot.iso)= 5fbc0d70 BLAH BLAH BLAH a91c9121bb
So I've never tried to write an entire script in one stream before. Usually I write to the filesystem a lot as I go. The first issue I see with this is that sed is replacing the string, but what it's piping over to mkisofs will not be used as mkiosfs is just going to make an iso of what it finds in /ISO. I dont even know if you can pass something like that to mkisofs. Piping is sometimes weird to think about.
Next, I think mkisofs is ok because I didnt specify a file output, therefore it should output to stdout which will be passed to sha1sum, but and here is the next problem I see. I need to append some additional text to the file before the SHA1 sum gets added which kinda interrupts the stream.
Finally, the last problem I see is how to pass everything to be tar into OVA without writing to the filesystem first (writing to manifest.mf).
Oh and the last BIG problem which I should have mentioned first is the config.xml file. Right now im dealing with it as just a file. The dev guys want to pass it to this script as a stream as well. I dont have a clue how to handle that.
Any help would be greatly appreciated. These concepts are a little beyond my knowledge.
Thanks!
UPDATE 12/11/13 2:11PM EST
Testing each part individually right now. Will report findings below soon.
UPDATE 12/11/13 2:14PM EST
The following works:
cat /ISO/boot.sh | sed "s%DEFAULT%RANDOM%"
and produces the following output:
RANDOM="RANDOM"
Exactly as expected.
You are correct NeronLeVelu, I will have to come back later and look at sed more carefully when real random passwords are being generated. ie. Making sure proper characters are escaped. Right now though, I'm just testing the logic. I will worry about regex and escaping later. We have not even decided on random password yet. It's only temporary and will most likely be alphanumeric.
Moving onto next part. Still not sure how to take the output from sed (stdout) and use it to include in ISO creation without actually creating a file that gets written to the file system. It may not be possible without writing to file system. More to come soon
# for the password if it contain & \ and separator used in your sed (default is /)
Password4Sed="`echo \"${PASSWORD} | sed \"s/[\\/&]/\\\\&/g\"`"
# no need of a cat with a sed
sed "s/DEFAULT/${Password4Sed}/"/ISO/boot.sh > /tmp/mkisofs.input
Treat rest from this input and put some test to validate each step like empty crc value or mkisofs.input. This will help at runtime when production error occur

extract payload from tcpflow output

Tcpflow outputs a bunch of files, many of which are HTTP responses from a web server. Inside, they contain HTTP headers, including Content-type: , and other important ones. I'm trying to write a script that can extract just the payload data (i.e. image/jpeg; text/html; et al.) and save it to a file [optional: with an appropriate name and file extension].
The EOL chars are \r\n (CRLF) and so this makes it difficult to use in GNU distros (in my experiences).
I've been trying something along the lines of:
sed /HTTP/,/^$/d
To delete all text from the the beginning of HTTP (incl) to the end of \r\n\r\n (incl) but I have found no luck. I'm looking for help from anyone with good experience in sed and/or awk. I have zero experience with Perl, please I'd prefer to use common GNU command line utilities for this
Find a sample tcpflow output file here. (bad link)
Thanks,
Felipe
This article recommends running foremost on output from tcpflow to extract the images. It's available at that link and in the repositories of (at least) Debian, Fedora and Ubuntu.
I tried it on the sample file you linked to and it seemed to work fine.
foremost -i tcpflow.out
It created a directory called "output" with subdirectories called "gif" and "jpeg" with files in each. The names of the files don't match the filenames in the headers, though.
To change the line endings of your files do:
dos2unix filename
or in a pipe:
dos2unix < filename | nextcommand
Other links of interest:
httpflow - parses tcpflow output
tcpxtract - another file extractor
Forensic Tools for Unix - a list of open source tools

ruby mechanize: how read downloaded binary csv file

I'm not very familiar using ruby with binary data. I'm using mechanize to download a large number of csv files to my local disk. I then need to search these files for specific strings.
I use the save_as method in mechanize to save the file (which saves the file as binary). The content type of the file (according to mechanize) is:
application/vnd.ms-excel;charset=x-UTF-16LE-BOM
From here, I'm not sure how to read the file. I've tried reading it in as a normal file in ruby, but I just get the binary data. I've also tried just using standard unix tools (strings/grep) to try and search without any luck.
When I run the 'file' command on one of the files, I get:
foo.csv: Little-endian UTF-16 Unicode Pascal program text, with very long lines, with CRLF, CR, LF line terminators
I can see the data just fine with cat or vi. With vi I also see some control characters.
I've also tried both the csv and fastercsv ruby libraries, but I get 'IllegalFormatError' exception for these. I've also tried this solution without any luck.
Any help would be greatly appreciated. Thanks.
You can use the command 'iconv' to conver to UTF-8,
# iconv -f 'UTF-16LE' -t 'UTF-8' bad_file.csv > good_file.csv
There is also a wrapper for iconv in the standard library, you could use that to convert the file after reading it into your program.

Resources