Hadoop streaming job create huge temp files - hadoop

I was trying to run hadoop job to do the word shingling, and all my nodes soon get unhealthy state since the storage is used up.
Here is my mapper part:
shingle = 5
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
for i in range(0, len(line)-shingle+1):
print ('%s\t%s' % (line[i:i+shingle], 1))
For my understanding that 'print' will generate temp file on each node which occupy stroage space. If I took a txt file as an example:
cat README.txt |./shingle_mapper.py >> temp.txt
I can see the size of the original and temp file:
-rw-r--r-- 1 root root 1366 Nov 13 02:46 README.txt
-rw-r--r-- 1 root root 9744 Nov 14 01:43 temp.txt
The temp file size is over 7 times of the input file, so I guess this is the reason that each of my node is used up all storage.
My question is do I understand the temp file correctly? If so, is there any better way to reduce the size of temp files (adding additional storage is not an option for me)?

Related

What is the numerical difference in the number of files in two different directories for every sequence (seq 1-current)?

Every time I write a new amount of data, two new directories are created called a sequence.
Directory 1 should always be 9 files larger than Directory 2.
I’m using ls | wc –l to output the number of files in each directory then manually doing the difference.
For example
Sequence 151
Directory 1 /raid2/xxx/xxxx/NHY274938WSP1151-OnlineSEHD-hyp (1911 files) – after WSP1 is the seq number.
Directory 2 - /raid/xxx/ProjectNumber/xxxx/seq0151 (1902 files)
Sequence 152
Directory 1 /raid2/xxx/xxxx/NHY274938WSP1152-OnlineSEHD-hyp (1525 files)
Directory 2 - /raid/xxx/ProjectNumber/xxxx/seq0152 (1516 files)
Is there a script that will output the difference (minus 9) for every sequence.
Ie
151 diff= 0
152 diff =0
That works great however:
I can now see some sequences in
Directory 1 (RAW/all files) it contains extra files that i dont want compared against diectory 2 these are:
At the beginning Warmup files (not set amount every sequence)
Duplicate files with an _
For example :
20329.uutt -warmup
20328.uutt -warmup
.
.
21530.uutt First good file after warmup
.
.
19822.uutt
19821.uutt
19820.uutt
19821_1.uutt
Directory 2 (reprocessed /missing files) doesn’t include warmup shots or Duplicate files with an _
For example :
Missing shots
*021386 – first available file (files are missing before).
*021385
.
.
*019822
*019821
*019820
Could we remove warmup files and any duplicates I should have number of missing files?
Or output
diff, D1#warmup files, D1#duplicate files, TOTdiff
to get D1#duplicate files maybe I could count the total number of occurances of _.uutt
to get D1#warmup files I have a log file where warmup shots have a "WARM" at the end of each line. in /raid2/xxx/xxxx/NHY274938WSP1151.log
i.e.
"01/27/21 15:33:51 :FLD211018WSP1004: SP:21597: SRC:2: Shots:1037: Manifold:2020:000 Vol:4000:828 Spread: 1.0:000 FF: nan:PtP: 0.000:000 WARM"
"01/27/21 15:34:04 :FLD211018WSP1004: SP:21596: SRC:4: Shots:1038: Manifold:2025:000 Vol:4000:000 Spread: 0.2:000 FF: nan:PtP: 0.000:000 WARM"
Is there a script that will output the difference (minus 9) for every sequence. Ie 151 diff= 0 152 diff =0
There it is:
#!/bin/bash
d1p=/raid2/xxx/xxxx/NHY274938WSP1 # Directory 1 prefix
d1s=-OnlineSEHD-hyp # Directory 1 suffix
d2=/raid/xxx/ProjectNumber/xxxx/seq0
for d in $d2*
do s=${d: -3} # extract sequence from Directory 2
echo $s diff=$(expr `ls $d1p$s$d1s|wc -l` - `ls $d|wc -l` - 9)
done
With filename expansion * we get all the directory names, and by removing the fixed part with the parameter expansion ${parameter:offset} we get the sequence.
For comparison here's a variant using arrays as suggested by tripleee:
#!/bin/bash
d1p=/raid2/xxx/xxxx/NHY274938WSP1 # Directory 1 prefix
d1s=-OnlineSEHD-hyp # Directory 1 suffix
d2=/raid/xxx/ProjectNumber/xxxx/seq0
shopt -s nullglob # make it work also for 0 files
for d in $d2*
do s=${d: -3} # extract sequence from Directory 2
f1=($d1p$s$d1s/*) # expand files from Directory 1
f2=($d/*) # expand files from Directory 2
echo $s diff=$((${#f1[#]} - ${#f2[#]} - 9))
done

Working with input splits(HADOOP)

I have a .txt file as follows:
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
(ignoring the blank line after each record)
I have set the block size as 64 bytes. What I am trying to check is, whether there exists a situation when a single record is broken into two blocks or not.
Now logically, since the block size is 64 bytes , after uploading the file to HDFS, it should create 3 blocks of size 64,64,27 bytes respectively, which it does. Also since the size of the first block is 64 bytes, it should contain the following data only :
This is xyz
This is my home
This is my PC
This is my room
Th
Now I want to see if the first block is like this or not, if I browse the HDFS via the browser and download the file, it downloads the entire file not a single block.
So I decided to run a map-reduce job which would only display the record values only.( Setting reducers=0, and mapper output as context.write(null,record_value), also changing the default delimiter to "")
Now while running the job the job counters show 3 splits, which is obvious, but after completion when I check the output directory, it shows 3 mapper output files out of which 2 are empty and the first mapper output file has all the content of the file as it is.
Can anyone help me with this? Is there a possibility that the newer versions of hadoop handle incomplete records automatically?
Steps followed to reproduce the scenario
1) Created a file sample.txt with the content with total size ~153B
cat sample.txt
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxxxxxxxxxxxxxxx
2) Added the property to hdfs-site.xml
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>10</value>
</property>
and loaded into HDFS with block size as 64B.
hdfs dfs -Ddfs.bytes-per-checksum=16 -Ddfs.blocksize=64 -put sample.txt /
This created three blocks of sizes 64B, 64B and 25B.
Content in Block0:
This is xyz
This is my home
This is my PC
This is my room
This i
Content in Block1:
s ubuntu PC xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xx
Content in Block2:
xx xxxxxxxxxxxxxxxxxxxxx
3) A simple mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line
4) Hadoop Streaming with 0 reducers:
yarn jar hadoop-streaming-2.7.1.jar -Dmapreduce.job.reduces=0 -file mapper.py -mapper mapper.py -input /sample.txt -output /splittest
Job ran with 3 input splits invoking 3 mappers and generated 3 output files with one file holding the entire content of sample.txt and the rest 0B files.
hdfs dfs -ls /splittest
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/_SUCCESS
-rw-r--r-- 3 user supergroup 168 2017-03-22 11:13 /splittest/part-00000
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00001
-rw-r--r-- 3 user supergroup 0 2017-03-22 11:13 /splittest/part-00002
The file sample.txt is split into 3 splits and these splits are assigned to each mapper as,
mapper1: start=0, length=64B
mapper2: start=64, length=64B
mapper3: start=128, length=25B
This only determines which portion of file has to be read by the mapper, not necessary that it has to be exact. The actual content that is read by a mapper is determined by the FileInputFormat and its boundaries, here TextFileInputFormat.
This uses LineRecordReader to read the content from each split and uses \n as delimiter (line boundary). For a file that isn't compressed, the lines are read by each mapper as explained below.
For the mapper whose start index is 0, the line reading starts from the start of the split. If the split ends with \n the reading ends at the split boundary else it looks for the first \n post the length of the split assigned (here 64B). Such that it does not end up processing a partial line.
For all the other mappers (start index != 0), it checks whether the preceding character from its start index (start - 1) is \n, if yes it reads the content from the start of the split else it skips the content that is present between its start index and the first \n character encountered in that split (as this content is handled by other mapper) and starts to read from the first \n.
Here, mapper1 (start index is 0) starts with Block0 whose split ends at the middle of a line. Thus, it continues to read the line which consumes the entire Block1 and since Block1 does not have a \n character, mapper1 continues to read until it finds a \n which ends with consuming of entire Block2 as well. That is how the entire content of sample.txt ended up in single mapper output.
mapper2 (start index != 0), one character preceding to its start index is not a \n, so skips the line and ends up with no content. Empty mapper output. mapper3 has the identical scenario as mapper2.
Try changing the content of sample.txt like this to see different results
This is xyz
This is my home
This is my PC
This is my room
This is ubuntu PC xxxx xxxx xxxx xxxx
xxxx xxxx xxxx xxxx xxxx xxxx xxxx
xxxxxxxxxxxxxxxxxxxxx
Use the following command to get the block list for your file on HDFS
hdfs fsck PATH -files -blocks -locations
where PATH is the full HDFS path where your file is located.
The output (shown below partially) will be something like this (the line numbers 2, 3... ignore)
Connecting to namenode via http://ec2-54-235-1-193.compute-1.amazonaws.com:50070/fsck?ugi=student6&files=1&blocks=1&locations=1&path=%2Fstudent6%2Ftest.txt
FSCK started by student6 (auth:SIMPLE) from /172.31.11.124 for path /student6/test.txt at Wed Mar 22 15:33:17 UTC 2017
/student6/test.txt 22 bytes, 1 block(s): OK 0. BP-944036569-172.31.11.124-1467635392176:blk_1073755254_14433 len=22 repl=1 [DatanodeInfoWithStorage[172.31.11.124:50010,DS-4a530a72-0495-4b75-a6f9-75bdb8ce7533,DISK]]
Copy the bold part of output command (excluding the _14433) as shown in above example output
Go to Linux file system on your datanode to the directory where the blocks are stored (this will be pointed to by dfs.datanode.data.dir parameter of hdfs-site.xml and search in the entire subtree from that location for a filename that has the bold string you just copied. That will tell you which subdirectory under dfs.datanode.data.dir contains a file with that string in its name (exclude any filename with .meta suffix). Once you have located such a file name you can run a Linux cat command on that file name to see your file contents.
Remember although the file is an HDFS file, under the covers the file is actually stored on the Linux file system and each block of the HDFS file is a unique Linux file. The block is identified by the Linux file system with the name as shown in the bold string of step 2

hive insert overwrite directory only overwrite direct path of generated file not the directory

-bash-4.1$ hadoop fs -ls /mytest/warehouse/mytable/
Found 4 items
-------------
- -rwxrwxrwx 3 myvm users 1163 2016-11-24 03:11 /mytest/warehouse/mytable/000000_0
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_1
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_2
- -rwxrwxrwx 3 myvm users 0 2016-11-24 03:09 /mytest/warehouse/mytable/000000_3
QUESTION
insert overwrite directory "/mytest/warehouse/mytable" select * from my_table
Above command will only overwrite the file it is generating that is: /mytest/warehouse/mytable/000000_0
I expected it to remove all the files under the path and create 1 file with the desired output.
It seems to be working fine before going for hive-1.1.0-cdh5.5.1.
it is generating 4 part files because your number of reducers are 4 . for generating only one part file in output
you can set hive property in your hive terminal
set mapred.reduce.tasks=1
also
Number of reducers depends also on size of the input file
By default it is 1GB (1000000000 bytes). You could change that by setting the property hive.exec.reducers.bytes.per.reducer:
either by changing hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>1000000</value>
</property>
or using set
$ hive -e "set hive.exec.reducers.bytes.per.reducer=1000000"

Hadoop Log File Analysis from 2 separate machines

I am a fresher to Hadoop. I have to find the trend of symbols traded among users.
I have 2 machines b040n10 and b040n11. The files in the machine are as mentioned below:
b040n10:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 482342353 Feb 8 2014 A.log
-rw-r--r-- 1 root root 481231231 Feb 8 2014 B.log
b040n11:/u/ssekar>ls -lrt
-rw-r--r-- 1 root root 412312312 Feb 8 2014 C.log
-rw-r--r-- 1 root root 412356315 Feb 8 2014 D.log
There is a field called "symbol_name" on all these logs (example below).
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:13:05
IP=145.45.34.2;***symbol_name=XYZ;***timestamp=12:13:56
IP=145.45.34.2;***symbol_name=ABC;***timestamp=12:14:56
I have Hadoop running on my Laptop and I have 2 machines connected to my Laptop (can be used as Datanodes).
My task now is to get the list of symbol_name and the Symbol count.
As mentioned below:
ABC - 2
XYZ - 1
Should I now:
1. copy all the files (A.log,B.log,C.log,D.log) from b040n10 and b040n11 to my Laptop,
2. Issue a copyFromLocal command to HDFS system and analyze the data?
or is there a better way to findout the symbol_name and count without copying these files to my laptop?
The question is a basic one, but I am new to Hadoop, please help me to understand and use Hadoop to better. Please let me know if more information on the question is need.
Thanks
Copying the files from Hadoop to your local laptop defies the entire purpose of Hadoop which is to move the processing to the data not the other way. Because when you really have "BigData", you won't be able to move the data around to process it locally.
Your problem is a typical case of Map/Reduce, all what you need is a job that counts the occurrence of each symbol. Just search for Map/Reduce WordCount example and adapt it to your case

what does terminal command: ls -l show?

I know that it outputs the "long" version but what do each of the sections mean?
On my mac, when I type in
ls -l /Users
I get
total 0
drwxr-xr-x+ 33 MaxHarris staff 1122 Jul 1 14:06 MaxHarris
drwxrwxrwt 8 root wheel 272 May 20 13:26 Shared
drwxr-xr-x+ 14 admin staff 476 May 17 11:25 admin
drwxr-xr-x+ 44 hugger staff 1496 Mar 17 21:13 hugger
I know that the first line it the permissions, although I don't know what the order is. It would be great if that could be explained too. Then whats the number after it?
Basically, what do each one of these things mean? Why are the usernames written twice sometimes and don't match other times?
The option '-l' tells the command to use a long list format. It gives back several columns wich correspond to:
Permissions
Number of hardlinks
File owner
File group
File size
Modification time
Filename
The first letter in the permissions column show the file's type. A 'd' means a directory and a '-' means a normal file (there are other characters, but those are the basic ones).
The next nine characters are divided into 3 groups, each one a permission. Each letter in a group correspond to the read, write and execute permission, and each group correspond to the owner of the file, the group of the file and then for everyone else.
[ File type ][ Owner permissions ][ Group permissions ][ Everyone permissions ]
The characters can be one of four options:
r = read permission
w = write permission
x = execute permission
- = no permission
Finally, the "+" at the end means some extended permissions.
If you type the command
$ man ls
You’ll get the documentation for ls, which says in part:
The Long Format
If the -l option is given, the following information is displayed for
each file: file mode, number of links, owner name, group name, number of
bytes in the file, abbreviated month, day-of-month file was last modified, hour file last modified, minute file last modified, and the pathname. In addition, for each directory whose contents are displayed, the
total number of 512-byte blocks used by the files in the directory is
displayed on a line by itself, immediately before the information for the
files in the directory. If the file or directory has extended
attributes, the permissions field printed by the -l option is followed by
a '#' character. Otherwise, if the file or directory has extended security information (such as an access control list), the permissions field
printed by the -l option is followed by a '+' character.
…
The man command is short for “manual”, and the articles it shows are called “man pages”; try running man manpages to learn even more about them.
The following information is provided:
permissions
number of linked hardlinks
owner of the file
to which group this file belongs to
size
modification/creation date and time
file/directory name

Resources