Working with input splits(HADOOP)

Working with input splits(HADOOP) - hadoop

I have a .txt file as follows:
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
(ignoring the blank line after each record)
I have set the block size as 64 bytes. What I am trying to check is, whether there exists a situation when a single record is broken into two blocks or not.
Now logically, since the block size is 64 bytes , after uploading the file to HDFS, it should create 3 blocks of size 64,64,27 bytes respectively, which it does. Also since the size of the first block is 64 bytes, it should contain the following data only :
This is xyz
This is my home
This is my PC
This is my room
Th
Now I want to see if the first block is like this or not, if I browse the HDFS via the browser and download the file, it downloads the entire file not a single block.
So I decided to run a map-reduce job which would only display the record values only.( Setting reducers=0, and mapper output as context.write(null,record_value), also changing the default delimiter to "")
Now while running the job the job counters show 3 splits, which is obvious, but after completion when I check the output directory, it shows 3 mapper output files out of which 2 are empty and the first mapper output file has all the content of the file as it is.
Can anyone help me with this? Is there a possibility that the newer versions of hadoop handle incomplete records automatically?

Steps followed to reproduce the scenario
1) Created a file sample.txt with the content with total size ~153B
cat sample.txt
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
2) Added the property to hdfs-site.xml
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>10</value>
</property>
and loaded into HDFS with block size as 64B.
hdfs dfs -Ddfs.bytes-per-checksum=16 -Ddfs.blocksize=64 -put sample.txt /
This created three blocks of sizes 64B, 64B and 25B.
Content in Block0:
This is xyz
This is my home
This is my PC
This is my room
This i
Content in Block1:
s ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx
Content in Block2:
xx xxxxxxxxxxxxxxxxxxxxx
3) A simple mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line
4) Hadoop Streaming with 0 reducers:
yarn jar hadoop-streaming-2.7.1.jar -Dmapreduce.job.reduces=0 -file mapper.py -mapper mapper.py -input /sample.txt -output /splittest
Job ran with 3 input splits invoking 3 mappers and generated 3 output files with one file holding the entire content of sample.txt and the rest 0B files.
hdfs dfs -ls /splittest
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/_SUCCESS
-rw-r--r-- 3 user supergroup 168 2017-03-22 11:13 /splittest/part-00000
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00001
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00002
The file sample.txt is split into 3 splits and these splits are assigned to each mapper as,
mapper1: start=0, length=64B
mapper2: start=64, length=64B
mapper3: start=128, length=25B
This only determines which portion of file has to be read by the mapper, not necessary that it has to be exact. The actual content that is read by a mapper is determined by the FileInputFormat and its boundaries, here TextFileInputFormat.
This uses LineRecordReader to read the content from each split and uses \n as delimiter (line boundary). For a file that isn't compressed, the lines are read by each mapper as explained below.
For the mapper whose start index is 0, the line reading starts from the start of the split. If the split ends with \n the reading ends at the split boundary else it looks for the first \n post the length of the split assigned (here 64B). Such that it does not end up processing a partial line.
For all the other mappers (start index != 0), it checks whether the preceding character from its start index (start - 1) is \n, if yes it reads the content from the start of the split else it skips the content that is present between its start index and the first \n character encountered in that split (as this content is handled by other mapper) and starts to read from the first \n.
Here, mapper1 (start index is 0) starts with Block0 whose split ends at the middle of a line. Thus, it continues to read the line which consumes the entire Block1 and since Block1 does not have a \n character, mapper1 continues to read until it finds a \n which ends with consuming of entire Block2 as well. That is how the entire content of sample.txt ended up in single mapper output.
mapper2 (start index != 0), one character preceding to its start index is not a \n, so skips the line and ends up with no content. Empty mapper output. mapper3 has the identical scenario as mapper2.
Try changing the content of sample.txt like this to see different results
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxxxxxxxxxxxxxxxxxxx

Use the following command to get the block list for your file on HDFS
hdfs fsck PATH -files -blocks -locations
where PATH is the full HDFS path where your file is located.
The output (shown below partially) will be something like this (the line numbers 2, 3... ignore)
Connecting to namenode via http://ec2-54-235-1-193.compute-1.amazonaws.com:50070/fsck?ugi=student6&files=1&blocks=1&locations=1&path=%2Fstudent6%2Ftest.txt
FSCK started by student6 (auth:SIMPLE) from /172.31.11.124 for path /student6/test.txt at Wed Mar 22 15:33:17 UTC 2017
/student6/test.txt 22 bytes, 1 block(s): OK 0. BP-944036569-172.31.11.124-1467635392176:blk_1073755254_14433 len=22 repl=1 [DatanodeInfoWithStorage[172.31.11.124:50010,DS-4a530a72-0495-4b75-a6f9-75bdb8ce7533,DISK]]
Copy the bold part of output command (excluding the _14433) as shown in above example output
Go to Linux file system on your datanode to the directory where the blocks are stored (this will be pointed to by dfs.datanode.data.dir parameter of hdfs-site.xml and search in the entire subtree from that location for a filename that has the bold string you just copied. That will tell you which subdirectory under dfs.datanode.data.dir contains a file with that string in its name (exclude any filename with .meta suffix). Once you have located such a file name you can run a Linux cat command on that file name to see your file contents.
Remember although the file is an HDFS file, under the covers the file is actually stored on the Linux file system and each block of the HDFS file is a unique Linux file. The block is identified by the Linux file system with the name as shown in the bold string of step 2

Related

Hadoop streaming job create huge temp files

I was trying to run hadoop job to do the word shingling, and all my nodes soon get unhealthy state since the storage is used up.
Here is my mapper part:
shingle = 5
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
for i in range(0, len(line)-shingle+1):
print ('%s\t%s' % (line[i:i+shingle], 1))
For my understanding that 'print' will generate temp file on each node which occupy stroage space. If I took a txt file as an example:
cat README.txt |./shingle_mapper.py >> temp.txt
I can see the size of the original and temp file:
-rw-r--r-- 1 root root 1366 Nov 13 02:46 README.txt
-rw-r--r-- 1 root root 9744 Nov 14 01:43 temp.txt
The temp file size is over 7 times of the input file, so I guess this is the reason that each of my node is used up all storage.
My question is do I understand the temp file correctly? If so, is there any better way to reduce the size of temp files (adding additional storage is not an option for me)?

hive insert overwrite directory only overwrite direct path of generated file not the directory

-bash-4.1$ hadoop fs -ls /mytest/warehouse/mytable/
Found 4 items
-------------
- -rwxrwxrwx 3 myvm users 1163 2016-11-24 03:11 /mytest/warehouse/mytable/000000_0
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_1
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_2
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_3
QUESTION
insert overwrite directory "/mytest/warehouse/mytable" select * from my_table
Above command will only overwrite the file it is generating that is: /mytest/warehouse/mytable/000000_0
I expected it to remove all the files under the path and create 1 file with the desired output.
It seems to be working fine before going for hive-1.1.0-cdh5.5.1.

it is generating 4 part files because your number of reducers are 4 . for generating only one part file in output
you can set hive property in your hive terminal
set mapred.reduce.tasks=1
also
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.

Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Generating "terasort" input data set with TeraGen

I want to generate a data set (for my own "terasort" MapReduce job) by running the TeraGen program that ships with Hadoop (inside hadoop-examples.jar):
hadoop jar /<full-path>/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar teragen 1000 ./teragen
I am not getting the expected output that should follow the format:
(10 bytes key) (10 bytes rowid) (78 bytes filler) \r \n
I am getting a file that:
starts with JimGrayRIP followed by a NUL character (when I am trying to paste it, it gets truncated; I uploaded a copy to Dropbox),
contains two characters repeated every 100 bytes, but - instead of OD OA - they are EE FF.
What can be wrong?
Can this be any encoding issue?
Is a sample "terasort" data set available for download anywhere?

Hadoop WordCount Output

I am new to hadoop and am running some of the examples to become more familiar with it. I ran wordcount and when I went to check the output hadoop fs -cat outt I got 3 directories instead of the usual one named outt/part-00000. Here are the directories I have:
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 15 2014-07-11 20:13 outt/part-r-00000
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/part-r-00001
When I do hadoop fs -cat outt/_SUCCESS and hadoop fs -cat outt/part-r-00001, nothing appears. However, when I do hadoop fs -cat outt/part-r-00000 I get: record_count 1.
My file just says "Hello World" so I am expecting the result: Hello 1 World 1.
Does anyone know how to get the correct output?

1.)The _success and part-r-00000/1 are not directories but files. Directory is more like a set of files and other directories
2.) _Success file is automatically created by hadoop if the submitted job is performed successfully by all the nodes and reducers and the result set is complete.
3.)If you are getting two part files it implies that you have two reducers in your job description. Check the code to find if there is any statement like job.setNumReduceTasks(2);. The part named 00000 is the output of first reducer and 00001 is the output of the second reducer. 'r' implies that the output is from reducer. If see 'm' instead of 'r' it means that you dont have a reducer and the job is map only job.

When you are doing hadoop fs -cat outt/part-r-00000 and getting output as : record_count 1
Which mean probably you are counting the number of lines in the input file.
Once you read a line, you need to tokenize the line and take each word (token) out of this.
Here is sample code:
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
You can find the full code here: WordCount
Here instead of StringTokenizer you can you split method of java API.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Working with input splits(HADOOP) - hadoop

Related

Hadoop streaming job create huge temp files

hive insert overwrite directory only overwrite direct path of generated file not the directory

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

Generating "terasort" input data set with TeraGen

Hadoop WordCount Output

Categories

Resources