hadoop mapreduce.partition.keypartitioner.options not working - hadoop

I only want to partition the data where the first field of key is same as the reducer. For example, [ 11 * * * ] data .
But it seems keypartitioner does not work, I really don't know why.
Environment
Hadoop Version
The code run.sh is here --->
#!/usr/bin/sh
hadoop fs -rm -r /training/likang/tmp2
hadoop fs -rm /training/likang/tmp/testfile
hadoop fs -put testfile1 /training/likang/tmp/testfile
hadoop-streaming -D stream.map.output.field.separator="\t" \
-D stream.num.map.output.key.fields=2 \
-D map.output.key.field.separator="\t" \
-D mapreduce.partition.keypartitioner.options=-k1,1 \
-D mapreduce.job.maps=2 \
-D mapreduce.job.reduces=2 \
-D mapred.job.name="lk_filt_rid" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-input /training/likang/tmp/testfile \
-output /training/likang/tmp2 \
-mapper "cat" -reducer "cat"
hadoop fs -cat /training/likang/tmp2/part-00000
echo "------------------"
hadoop fs -cat /training/likang/tmp2/part-00001
The Input File is testfile1 --->
11 5 333 111
11 5 777 000
11 3 888 999
11 9 988 888
11 7 234 2342
11 5 4 4
15 9 230 134
12 8 232 834
15 77 220 000
15 33 256 399
11 5 999 888
15 9 222 111
14 88 372 233
15 9 66 77
11 5 821 221
11 0 11 11
15 0 22 22
12 0 33 33
14 0 44 44
The result is here, that all the [ 11 * * * * ] data is not sent to the same reducer... Does anybody know why? Thank you.

Now I konw , it's useful to delete this line
-D map.output.key.field.separator="\t" \
After delete this option, the result will be right, but more confused for the reason.
The default value of map.output.key.field.separator seem's just a Tab,but after I write it here, It makes fault.........

Related

How to sort by key and value in mapreduce?

I have a text file:
10 1 15
10 12 30
10 9 45
10 8 40
10 15 55
12 9 0
12 7 18
12 10 1
9 1 1
9 2 1
9 0 1
14 5 5
And I would like to get this file as an output of my MapReduce job:
9 0 1
9 1 1
9 2 1
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
12 7 18
12 9 0
12 10 1
14 5 5
It means it has to be sorted by 1st, 2nd and 3rd columns.
I use this command:
#!/bin/bash
IN_DIR="/user/cloudera/temp"
OUT_DIR="/user/cloudera/temp_out"
NUM_REDUCERS=1
hdfs dfs -rmr ${OUT_DIR} > /dev/null
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Parsing mista pages job 1 (parsing)" \
-D stream.num.map.output.key.fields=3 \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options='-k1,1n -k2,2n -k3,3n' \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-mapper 'cat' \
-reducer 'cat' \
-input ${IN_DIR} \
-output ${OUT_DIR}
hdfs dfs -cat ${OUT_DIR}/* | head -100
And get exactly what I want. BUT. When I do NUM_REDUCERS=2 I get this output:
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00000 | head -100
9 1 1
10 9 45
10 12 30
10 15 55
12 7 18
12 10 1
14 5 5
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00001 | head -100
9 0 1
9 2 1
10 1 15
10 9 40
12 9 0
Why partitioner splits my data with same keys (for example '9') to different reducers?
How can I force partitioner to split Mapper output by the key and sort it by value. For example, if I have 4 reducers the reducers input should be:
reducer 1
9 0 1
9 1 1
9 2 1
reducer 2
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
reducer 3
12 7 18
12 9 0
12 10 1
reducer 4:
14 5 5
you can overwrite the default Partioner to put each key into diferent reduce .Set the same Nums of reduce . let each reduce to deal with only one key .
for example()
groupMap.put("9", 0);
groupMap.put("10", 1);
groupMap.put("12", 2);
groupMap.put("14", 3);
Add -partitioner argument to use your own partition in your job.
I think it might works for you

Is there a way to understand what Oracle DataDump util updates in dmp file after extract?

I do not want to wait for Oracle DataDump expdb to finish writing to dump file.
So I start reading data from the moment it's created.
Then I write this data to another file.
It worked ok - file sizes are the same (the one that OracleDump created and the one my data monitoring script created).
But when I run cmp it shows difference in 27 bytes:
cmp -l ora.dmp monitor_10k_rows.dmp
3 263 154
4 201 131
5 174 173
6 103 75
48 64 70
58 0 340
64 0 1
65 0 104
66 0 110
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 15
20509 0 157
20510 0 230
20526 0 10
20532 0 15
20533 0 225
20534 0 150
913437 0 226
913438 0 37
913454 0 10
913460 0 1
913461 0 104
913462 0 100
ls -al ora.dmp
-rw-r--r-- 1 oracle oinstall 999424 Jun 20 11:35 ora.dmp
python -c 'print 999424-913462'
85962
od ora.dmp -j 913461 -N 1
3370065 000100
3370066
od monitor_10k_rows.dmp -j 913461 -N 1
3370065 000000
3370066
Even if I extract more data the difference is still 27 bytes but different addresses/values:
cmp -l ora.dmp monitor_30k_rows.dmp
3 245 134
4 222 264
5 377 376
6 54 45
48 36 43
57 0 2
58 0 216
64 0 1
65 0 104
66 0 120
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 50
20509 0 126
20510 0 173
20526 0 10
20532 0 50
20533 0 174
20534 0 120
2674717 0 226
2674718 0 47
2674734 0 10
2674740 0 1
2674741 0 104
2674742 0 110
Some writes are the same.
Is there a way know addresses of bytes which will differ?
ls -al ora.dmp
-rw-r--r-- 1 bicadmin bic 2760704 Jun 20 11:09 ora.dmp
python -c 'print 2760704-2674742'
85962
How can update my monitored copy after DataDump updated the original at adress 2674742 using Python for example?
Exact same thing happens if I use COMPRESSION=DATA_ONLY option.
Update: Figured how to sync bytes that differ between 2 files:
def patch_file(fn, diff):
for line in diff.split(os.linesep):
if line:
addr, to_octal, _ = line.strip().split()
with open(fn , 'r+b') as f:
f.seek(int(addr)-1)
f.write(chr(int (to_octal,8)))
diff="""
3 157 266
4 232 276
5 272 273
6 16 25
48 64 57
58 340 0
64 1 0
65 104 0
66 110 0
541 61 60
545 61 60
552 61 60
559 61 60
20508 15 0
20509 157 0
20510 230 0
20526 10 0
20532 15 0
20533 225 0
20534 150 0
913437 226 0
913438 37 0
913454 10 0
913460 1 0
913461 104 0
913462 100 0
"""
patch_file(f3,diff)
wrote a patch using Python:
addr=[3 , 4 , 5 , 6 , 48 , 58 , 64 , 65 , 66 , 541 , 545 , 552 , 559 , 20508 , 20509 , 20510 , 20526 , 20532 , 20533 , 20534 ]
last_range=[85987, 85986, 85970, 85964, 85963, 85962]
def get_bytes(addr):
out =[]
with open(f1 , 'r+b') as f:
for a in addr:
f.seek(a-1)
data= f.read(1)
hex= binascii.hexlify(data)
binary = int(hex, 16)
octa= oct(binary)
out.append((a,octa))
return out
def patch_file(fn, bytes_to_update):
with open(fn , 'r+b') as f:
for (a,to_octal) in bytes_to_update:
print (a,to_octal)
f.seek(int(a)-1)
f.write(chr(int (to_octal,8)))
if 1:
from_file=f1
fsize=os.stat(from_file).st_size
bytes_to_read = addr + [fsize-x for x in last_range]
bytes_to_update = get_bytes(bytes_to_read)
to_file =f3
patch_file(to_file,bytes_to_update)
The reason I do dmp file monitoring is because it cuts backup time in half.

Remove rows that have a specific numeric value in a field

I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?
I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.
Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt

Proper way to Re-Balance a 2-3 Tree after deleting the root node

The goal is to remove 22 from the root node and re-balance the tree.
First I remove 22, and replace it by its in-order successor 28.
Secondly I rebalance the resulting tree, by moving the empty node to the left. The resulting tree is below.
Is moving 28 up the right procedure, and did I balance the left side correctly in the end?
22,34
/ | \
16 28 37
/ \ / \ / \
15 21 25 33 35 43
[28],34
/ | \
16 * 37
/ \ / \ / \
15 21 25 33 35 43
34
/ \
16,28 37
/ | \ / \
15 21,25 33 35 43
Thanks!
To delete 22 from
22,34
/ | \
16 28 37
/ \ / \ / \
15 21 25 33 35 43 ,
we replace it by its in-order successor 25, leaving a hole (*).
25,34
/ | \
16 28 37
/ \ / \ / \
15 21 * 33 35 43
We can't fix the hole by borrowing, so we merge its parent into its sibling, moving the hole up.
25,34
/ | \
16 * 37
/ \ | / \
15 21 28,33 35 43
The hole has two siblings now, so we can redistribute one of the parent's keys down.
34
/ \
16,25 37
/ | \ / \
15 21 28,33 35 43
(I'm working from this set of lecture notes. Don't bother memorizing the details here unless it's for an exam. Even then... I really wish data structure courses did not emphasize balanced search trees to the degree that they do.)

convert comma separated list in text file into columns in bash

I've managed to extract data (from an html page) that goes into a table, and I've isolated the columns of said table into a text file that contains the lines below:
[30,30,32,35,34,43,52,68,88,97,105,107,107,105,101,93,88,80,69,55],
[28,6,6,50,58,56,64,87,99,110,116,119,120,117,114,113,103,82,6,47],
[-7,,,43,71,30,23,28,13,13,10,11,12,11,13,22,17,3,,-15,-20,,38,71],
[0,,,3,5,1.5,1,1.5,0.5,0.5,0,0.5,0.5,0.5,0.5,1,0.5,0,-0.5,-0.5,2.5]
Each bracketed list of numbers represents a column. What I'd like to do is turn these lists into actual columns that I can work with in different data formats. I'd also like to be sure to include that blank parts of these lists too (i.e., "[,,,]")
This is basically what I'm trying to accomplish:
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
. . . .
. . . .
. . . .
I'm parsing data from a web page, and ultimately planning to make the process as automated as possible so I can easily work with the data after I output it to a nice format.
Anyone know how to do this, have any suggestions, or thoughts on scripting this?
Since you have your lists in python, just do it in python:
l=[["30", "30", "32"], ["28","6","6"], ["-7", "", ""], ["0", "", ""]]
for i in zip(*l):
print "\t".join(i)
produces
30 28 -7 0
30 6
32 6
awk based solution:
awk -F, '{gsub(/\[|\]/, ""); for (i=1; i<=NF; i++) a[i]=a[i] ? a[i] OFS $i: $i}
END {for (i=1; i<=NF; i++) print a[i]}' file
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
..........
..........
Another solution, but it works only for file with 4 lines:
$ paste \
<(sed -n '1{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '2{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '3{s,\[,,g;s,\],,g;s|,|\n|g;p}' t) \
<(sed -n '4{s,\[,,g;s,\],,g;s|,|\n|g;p}' t)
30 28 -7 0
30 6
32 6
35 50 43 3
34 58 71 5
43 56 30 1.5
52 64 23 1
68 87 28 1.5
88 99 13 0.5
97 110 13 0.5
105 116 10 0
107 119 11 0.5
107 120 12 0.5
105 117 11 0.5
101 114 13 0.5
93 113 22 1
88 103 17 0.5
80 82 3 0
69 6 -0.5
55 47 -15 -0.5
-20 2.5
38
71
Updated: or another version with preprocessing:
$ sed 's|\[||;s|\][,]\?||' t >t2
$ paste \
<(sed -n '1{s|,|\n|g;p}' t2) \
<(sed -n '2{s|,|\n|g;p}' t2) \
<(sed -n '3{s|,|\n|g;p}' t2) \
<(sed -n '4{s|,|\n|g;p}' t2)
If a file named data contains the data given in the problem (exactly as defined above), then the following bash command line will produce the output requested:
$ sed -e 's/\[//' -e 's/\]//' -e 's/,/ /g' <data | rs -T
Example:
cat data
[30,30,32,35,34,43,52,68,88,97,105,107,107,105,101,93,88,80,69,55],
[28,6,6,50,58,56,64,87,99,110,116,119,120,117,114,113,103,82,6,47],
[-7,,,43,71,30,23,28,13,13,10,11,12,11,13,22,17,3,,-15,-20,,38,71],
[0,,,3,5,1.5,1,1.5,0.5,0.5,0,0.5,0.5,0.5,0.5,1,0.5,0,-0.5,-0.5,2.5]
$ sed -e 's/[//' -e 's/]//' -e 's/,/ /g' <data | rs -T
30 28 -7 0
30 6 43 3
32 6 71 5
35 50 30 1.5
34 58 23 1
43 56 28 1.5
52 64 13 0.5
68 87 13 0.5
88 99 10 0
97 110 11 0.5
105 116 12 0.5
107 119 11 0.5
107 120 13 0.5
105 117 22 1
101 114 17 0.5
93 113 3 0
88 103 -15 -0.5
80 82 -20 -0.5
69 6 38 2.5
55 47 71

Resources