TotalOrderPartitioner in Mapreduce example - hadoop

I am trying to run the sample provided in the alex holmes book
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sort/total/TotalSortMapReduce.java
However when I run the same program after making as a jar, I am getting an exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.hadoop.mapred.lib.InputSampler.writePartitionFile(InputSampler.java:338)
at
com.manning.hip.ch4.sort.total.TotalSortMapReduce.runSortJob(TotalSortMapReduce.java:44) at
com.manning.hip.ch4.sort.total.TotalSortMapReduce.main(TotalSortMapReduce.java:12)
Can someone please help me in understanding how to run the code. I have provided the following arguments.
args[0] --> the input path to names.txt(file which needs to be sorted). Its in hadoop.
args[1]--> the sample partition file which should be generated. Path of hadoop.
args[2]--> the output directory where the sorted file shold be genrated.
Please guide me the way I need to run this code.

The reason for that problem is probably the input data file is very small, but in the code :
InputSampler.Sampler<Text, Text> sampler =
new InputSampler.RandomSampler<Text,Text>
(0.1,
10000,
10);
you set the maxSplitsSampled to 10 in RandomSampler<Text,Text> (double freq, int numSamples, int maxSplitsSampled)
You can solve the problem by set that parameter to 1, or just make sure it is not larger than the splits number of you input file.

So, I know this thread is more than 5 years old, but I came across the same issue just today and Mike's answer did not work for me. (I think by now hadoop internally also makes sure you don't exceed the number of available splits).
However, I found what's caused the issue for me and so I post this hoping that it will help anyone else whose google search led them to this truly ancient hadoop thread.
In my case, the problem was that the input file I specified hat too little samples and my sampling frequency was too low. In this case it can happen (not everytime, mind you, only sometimes to really drive you insane) that you generate fewer samples than the number of reducers that you specified. Everytime that happened, my system crashed with this error message:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 9 out of bounds for length 9
at org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:336)
at ...
In this case for example, only 9 samples were generated and I tried to employ more than 9 reducers.

Related

Looking for a more efficient way to pull data from multiple datasets in SAS

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!
I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.
First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.
You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

mclapply and spark_read_parquet

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.
Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

Hive query: Is there a way to use UDTF with `cluster by`?

Solved:
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me an explanation, I would be most grateful.
Following is my original question:
I know I'm not supposed to use cluster by after UDTF, so select myudtf("stringValue") cluster by rand() wouldn't work.
But since my udtf outputs 7000+ and growing rows every hour, so I really need to distribute the subsequent processing to all my hadoop cluster slave units.
And I imagine I don't get that without using cluster by rand(), so I tried the following cheat:
First I wrap the result up with an other table, select key from (select myudtf("stringValue") as key) t limit 1; and it gives correct result,
OK
some/key/value/string
Time taken: 0.035 seconds, Fetched: 1 row(s)
Then I add the cluster by part, select key from (select myudtf("stringValue") as key) t cluster by rand() limit 1, then I get error:
WARNING: Hive-on-MR is deprecated in Hive ...
....
Task with the most failures(4):
-----
Task ID:
task_....
URL:
http:....
....
-----
Diagnostic Messages for this Task:
Error: tried to access class sun.security.ssl.SSLSessionContextImpl from class sun.security.ssl.SSLSessionContextImplConstructorAccess
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I did this trying to cheat hive to treat the temporary table t as a "normal" table which I can apply cluster by to, hoping that it will distribute the work load to all the hadoop slaves, but unfortunately hive is clever enough to see through my poorly attempted trick.
So, could some one please help me to clarify my mis-conceptions, or give me some hint of the correct way to do this?
FYI I asked help from a highly experienced engineering guy in my company, and he thinks it maybe a deeper system level bug, he tried to trace the problem for 20 something minutes before he left work, he did find some lib version issues but couldn't fix the problem after all. ...And I just guess it must be something I did wrongly.
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me a more specific explanation, I would be most grateful.

Mapreduce - Right way to confirm whether the file is split or not

We had a lot of xml files and we wanted to process one xml using one mapper task because of obvious reasons to make the processing ( parsing ) simpler.
We wrote a mapreduce program to achieve that by overriding isSplitable method of input format class.It seems it is working fine.
However, we wanted to confirm if one mapper is used to process one xml file. IS there is a way to confirm by looking at the logs produced by driver program or any other way .
Thanks
To answer your question, Just check the number of mapper count.
It should be equal to your number of input files.
Example :
/ds/input
/file1.xml
/file2.xml
/file3.xml
Then the mapper count should be 3.
Here is the command.
mapred job -counter job_1449114544347_0001 org.apache.hadoop.mapreduce.JobCounter TOTAL_LAUNCHED_MAPS
You can get many details using mapred job -counter command. You can check video 54 and 55 from this playlist. It covers counters in detail.

How to get CPU load / RAM usage out of QNX?

I'm currently trying to get information about the CPU load and RAM usage out of an PowerPC with QNX running on it. The idea is to write that information in a text file with a time stamp over a certain amount of time, but this ain't my problem here once I have the information as a "standard value". My programm will be in C++ and I already did this kind of program for Windows (via PDH API). Maybe you have page like this but for QNX? Probably I'm looking for the wrong keywords.
Can you help me with this problem? Any kind of direction would be most welcome as I'm new to QNX and this kind of programming. Thanks a lot!
You will work with the /proc filesystem.
From the command line you can check the size of the memory space of the process that has process ID = 1234 by:
ls -l /proc/1234/as
"as" stands for "address space" and the size of this virtual file will indicate a good estimate of the memory used by the process in question, 1236992 bytes in this example:
-rw-r--r-- 1 root root 1236992 Aug 21 21:25 as
To get the same value programmatically you will need to use the stat() function on the /proc/PID/as file.
You can refer the following page in the documentation for a more detailed explanation of the same:
http://www.qnx.com/developers/docs/660/index.jsp?topic=%2Fcom.qnx.doc.neutrino.cookbook%2Ftopic%2Fs3_procfs_pid_directories.html
In order to get the CPU time (system/user) used by the process you can use the DCMD_PROC_INFO devctly() on the /proc/PID/as file. You will need to refer the "utime" and "stime" members of the debug_process_t structure passed to the devctl().
You can find a detailed explanation and sample code on the following page in the QNX documentation:
http://www.qnx.com/developers/docs/660/index.jsp?topic=%2Fcom.qnx.doc.neutrino.cookbook%2Ftopic%2Fs3_procfs_DCMD_PROC_INFO.html

Resources