Generating "terasort" input data set with TeraGen - hadoop

I want to generate a data set (for my own "terasort" MapReduce job) by running the TeraGen program that ships with Hadoop (inside hadoop-examples.jar):
hadoop jar /<full-path>/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar teragen 1000 ./teragen
I am not getting the expected output that should follow the format:
(10 bytes key) (10 bytes rowid) (78 bytes filler) \r \n
I am getting a file that:
starts with JimGrayRIP followed by a NUL character (when I am trying to paste it, it gets truncated; I uploaded a copy to Dropbox),
contains two characters repeated every 100 bytes, but - instead of OD OA - they are EE FF.
What can be wrong?
Can this be any encoding issue?
Is a sample "terasort" data set available for download anywhere?

Related

Hadoop does the returned file size include the replication factor?

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size
hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}
I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?
From Hadoop documentation:
The du returns three columns with the following format:
size disk_space_consumed_with_all_replicas full_path_name
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
As you can see the first column is size of file, while second column is space consumed including replicas.

Mainframe pkunzip generates PEX013W Record(s) being truncated to lrecl=

I'm sending binary .gz files from Linux to z/OS via ftps. The file transfers seem to be fine, but when the mainframe folks pkunzip the file, they get a warning:
PEX013W Record(s) being truncated to lrecl= 996. Record# 1 is 1000 bytes.
Currently I’m sending the site commands:
SITE TRAIL
200 SITE command was accepted
SITE CYLINDERS PRIMARY=50 SECONDARY=50
200 SITE command was accepted
SITE RECFM=VB LRECL=1000 BLKSIZE=32000
200 SITE command was accepted
SITE CONDDISP=delete
200 SITE command was accepted
TYPE I
200 Representation type is Image
...
250 Transfer completed successfully.
QUIT
221 Quit command received. Goodbye.
They could read the file after the pkunzip, but having a warning is not a good thing.
Output from pkunzip:
SDSF OUTPUT DISPLAY RMD0063A JOB22093 DSID 103 LINE 25 COLUMNS 02- 81
COMMAND INPUT ===> SCROLL ===> CSR
PCM123I Authorized services are unavailable.
PAM030I INPUT Archive opened: TEST.FTP.SOA5021.GZ
PAM560I ARCHIVE FASTSEEK processing is disabled.
PDA000I DDNAME=SYS00001,DISP_STATUS=MOD,DISP_NORMAL=CATALOG,DISP_ABNORMAL=
PDA000I SPACE_TYPE=TRK,SPACE_TYPE=CYL,SPACE_TYPE=BLK
PDA000I SPACE_PRIMARY=4194304,SPACE_DIRBLKS=5767182,INFO_ALCFMT=00
PDA000I VOLUMES=DPPT71,INFO_CNTL=,INFO_STORCLASS=,INFO_MGMTCLASS=
PDA000I INFO_DATACLASS=,INFO_VSAMRECORG=00,INFO_VSAMKEYOFF=0
PDA000I INFO_COPYDD=,INFO_COPYMDL=,INFO_AVGRECU=00,INFO_DSTYPE=00
PEX013W Record(s) being truncated to lrecl= 996. Record# 1 is 1000 bytes.
PEX002I TEST.FTP.SOA5021
PEX003I Extracted to TEST.FTP.SOA5021I.TXT
PAM140I FILES: EXTRACTED EXCLUDED BYPASSED IN ERROR
PAM140I 1 0 0 0
PMT002I PKUNZIP processing complete. RC=00000004 4(Dec) Start: 12:59:48.86 End
Is there a better set of site commands to transfer a .gz file from Linux to z/OS to avoid this error?
**** Update ****
Using SaggingRufus's answer below, it turns out it doesn't much matter how you send the .gz file, as long as it's binary. His suggestion pointed us to the parameters sent to the pkunzip for the output file, which was VB and was truncating 4 bytes off the record.
Because it is a variable block file, there are 4 bytes allocated to the record attributes. Allocate the file with an LRECL of 1004 and it will be fine.
Rather than generating a .zip file, perhaps generate a .tar.gz file and transfer it to z/OS UNIX? Tar is shipped with z/OS by default, and Rocket Software provides a port of gzip that is optimized for z/OS.

Working with input splits(HADOOP)

I have a .txt file as follows:
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
(ignoring the blank line after each record)
I have set the block size as 64 bytes. What I am trying to check is, whether there exists a situation when a single record is broken into two blocks or not.
Now logically, since the block size is 64 bytes , after uploading the file to HDFS, it should create 3 blocks of size 64,64,27 bytes respectively, which it does. Also since the size of the first block is 64 bytes, it should contain the following data only :
This is xyz
This is my home
This is my PC
This is my room
Th
Now I want to see if the first block is like this or not, if I browse the HDFS via the browser and download the file, it downloads the entire file not a single block.
So I decided to run a map-reduce job which would only display the record values only.( Setting reducers=0, and mapper output as context.write(null,record_value), also changing the default delimiter to "")
Now while running the job the job counters show 3 splits, which is obvious, but after completion when I check the output directory, it shows 3 mapper output files out of which 2 are empty and the first mapper output file has all the content of the file as it is.
Can anyone help me with this? Is there a possibility that the newer versions of hadoop handle incomplete records automatically?
Steps followed to reproduce the scenario
1) Created a file sample.txt with the content with total size ~153B
cat sample.txt
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
2) Added the property to hdfs-site.xml
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>10</value>
</property>
and loaded into HDFS with block size as 64B.
hdfs dfs -Ddfs.bytes-per-checksum=16 -Ddfs.blocksize=64 -put sample.txt /
This created three blocks of sizes 64B, 64B and 25B.
Content in Block0:
This is xyz
This is my home
This is my PC
This is my room
This i
Content in Block1:
s ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx
Content in Block2:
xx xxxxxxxxxxxxxxxxxxxxx
3) A simple mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line
4) Hadoop Streaming with 0 reducers:
yarn jar hadoop-streaming-2.7.1.jar -Dmapreduce.job.reduces=0 -file mapper.py -mapper mapper.py -input /sample.txt -output /splittest
Job ran with 3 input splits invoking 3 mappers and generated 3 output files with one file holding the entire content of sample.txt and the rest 0B files.
hdfs dfs -ls /splittest
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/_SUCCESS
-rw-r--r-- 3 user supergroup 168 2017-03-22 11:13 /splittest/part-00000
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00001
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00002
The file sample.txt is split into 3 splits and these splits are assigned to each mapper as,
mapper1: start=0, length=64B
mapper2: start=64, length=64B
mapper3: start=128, length=25B
This only determines which portion of file has to be read by the mapper, not necessary that it has to be exact. The actual content that is read by a mapper is determined by the FileInputFormat and its boundaries, here TextFileInputFormat.
This uses LineRecordReader to read the content from each split and uses \n as delimiter (line boundary). For a file that isn't compressed, the lines are read by each mapper as explained below.
For the mapper whose start index is 0, the line reading starts from the start of the split. If the split ends with \n the reading ends at the split boundary else it looks for the first \n post the length of the split assigned (here 64B). Such that it does not end up processing a partial line.
For all the other mappers (start index != 0), it checks whether the preceding character from its start index (start - 1) is \n, if yes it reads the content from the start of the split else it skips the content that is present between its start index and the first \n character encountered in that split (as this content is handled by other mapper) and starts to read from the first \n.
Here, mapper1 (start index is 0) starts with Block0 whose split ends at the middle of a line. Thus, it continues to read the line which consumes the entire Block1 and since Block1 does not have a \n character, mapper1 continues to read until it finds a \n which ends with consuming of entire Block2 as well. That is how the entire content of sample.txt ended up in single mapper output.
mapper2 (start index != 0), one character preceding to its start index is not a \n, so skips the line and ends up with no content. Empty mapper output. mapper3 has the identical scenario as mapper2.
Try changing the content of sample.txt like this to see different results
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxxxxxxxxxxxxxxxxxxx
Use the following command to get the block list for your file on HDFS
hdfs fsck PATH -files -blocks -locations
where PATH is the full HDFS path where your file is located.
The output (shown below partially) will be something like this (the line numbers 2, 3... ignore)
Connecting to namenode via http://ec2-54-235-1-193.compute-1.amazonaws.com:50070/fsck?ugi=student6&files=1&blocks=1&locations=1&path=%2Fstudent6%2Ftest.txt
FSCK started by student6 (auth:SIMPLE) from /172.31.11.124 for path /student6/test.txt at Wed Mar 22 15:33:17 UTC 2017
/student6/test.txt 22 bytes, 1 block(s): OK 0. BP-944036569-172.31.11.124-1467635392176:blk_1073755254_14433 len=22 repl=1 [DatanodeInfoWithStorage[172.31.11.124:50010,DS-4a530a72-0495-4b75-a6f9-75bdb8ce7533,DISK]]
Copy the bold part of output command (excluding the _14433) as shown in above example output
Go to Linux file system on your datanode to the directory where the blocks are stored (this will be pointed to by dfs.datanode.data.dir parameter of hdfs-site.xml and search in the entire subtree from that location for a filename that has the bold string you just copied. That will tell you which subdirectory under dfs.datanode.data.dir contains a file with that string in its name (exclude any filename with .meta suffix). Once you have located such a file name you can run a Linux cat command on that file name to see your file contents.
Remember although the file is an HDFS file, under the covers the file is actually stored on the Linux file system and each block of the HDFS file is a unique Linux file. The block is identified by the Linux file system with the name as shown in the bold string of step 2

How can two 100% identical files have different sizes?

I have two 100% identical empty .sh shell script files on Mac:
encrypt.sh: 299 bytes
decrypt.sh: 13 bytes (Actually this size is correct, since I have 13 bytes: 11 character + two new line)
The contents of encrypt.sh and its hexdump:
The contents of decrypt.sh and its hexdump:
The file info window of encrypt.sh:
The file info window of decrypt.sh:
They have the exact same hexdump, then how is it possible that they have different sizes?
Mac OS X file system is implementing forks, so the larger one is likely having something specific stored in its resource fork.
Use ls -l# to get more details.

Word Count Hadoop Example

I am running word count ex on a 41 GB file ( with default configuration setting ) that comes with Hadoop( Version: 0.20.3-dev) . But this code is giving correct output for the small file but it is giving some garbage for the 41 GB file. Why is this happening ?
Thanks for everybody.It may create wrong output because Hadoop by default does not know your file format it treats every file as a simple text file.

Resources