Hive query: Is there a way to use UDTF with `cluster by`? - hadoop

Solved:
It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me an explanation, I would be most grateful.
Following is my original question:
I know I'm not supposed to use cluster by after UDTF, so select myudtf("stringValue") cluster by rand() wouldn't work.
But since my udtf outputs 7000+ and growing rows every hour, so I really need to distribute the subsequent processing to all my hadoop cluster slave units.
And I imagine I don't get that without using cluster by rand(), so I tried the following cheat:
First I wrap the result up with an other table, select key from (select myudtf("stringValue") as key) t limit 1; and it gives correct result,
OK
some/key/value/string
Time taken: 0.035 seconds, Fetched: 1 row(s)
Then I add the cluster by part, select key from (select myudtf("stringValue") as key) t cluster by rand() limit 1, then I get error:
WARNING: Hive-on-MR is deprecated in Hive ...
....
Task with the most failures(4):
-----
Task ID:
task_....
URL:
http:....
....
-----
Diagnostic Messages for this Task:
Error: tried to access class sun.security.ssl.SSLSessionContextImpl from class sun.security.ssl.SSLSessionContextImplConstructorAccess
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I did this trying to cheat hive to treat the temporary table t as a "normal" table which I can apply cluster by to, hoping that it will distribute the work load to all the hadoop slaves, but unfortunately hive is clever enough to see through my poorly attempted trick.
So, could some one please help me to clarify my mis-conceptions, or give me some hint of the correct way to do this?
FYI I asked help from a highly experienced engineering guy in my company, and he thinks it maybe a deeper system level bug, he tried to trace the problem for 20 something minutes before he left work, he did find some lib version issues but couldn't fix the problem after all. ...And I just guess it must be something I did wrongly.

It turns out to be a mistake in my UDTF. I find out a fix but I don't quite understand why it worked. At the beginning when I was implementing the UDTF, Eclipse suggested that initialize is deprecated. But I got error if I skip it, so I implemented it anyway. I put a variable initialization in that method, guessing init is only to be done once. The jar worked for some simpler scenarios, but if I were to use the UDTF output with a UDF, then use the UDF output to do something, like the cheating cluster by or insert, I got the previously mentioned error. The engineer friend of mine found out that the initialize actually got executed more than once. So I just put the initialization in process, with a if checking if the variable is null, and init it if is. Then everything works fine, my cheat also worked. Still, if someone can give me a more specific explanation, I would be most grateful.

Related

Oracle 12c startup error: ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 0

We have a number of databases at our company. Among them an oracle 12c (12.2.0.1.0 to be precise), but we have no (qualified) oracle DBAs. The performance has deteriorated drastically in the last 6 months or so and I now have the task of finding out why. My research suggested that I should up some memory parameters in the initDBN.ora file. Here's what the original looked like:
DBN.__data_transfer_cache_size=0
DBN.__db_cache_size=50331648
DBN.__inmemory_ext_roarea=0
DBN.__inmemory_ext_rwarea=0
DBN.__java_pool_size=79691776
DBN.__large_pool_size=8388608
DBN.__oracle_base='/orabin/app/oracle'#ORACLE_BASE set from environment
DBN.__pga_aggregate_target=197132288
DBN.__sga_target=734003200
DBN.__shared_io_pool_size=12582912
DBN.__shared_pool_size=536870912
DBN.__streams_pool_size=4194304
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_16k_cache_size=8388608
*.db_32k_cache_size=8388608
*.db_4k_cache_size=8388608
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.memory_max_target=0
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=700m
*.shared_pool_size=536870912
*.streams_pool_size=4194304
*.undo_tablespace='UNDOTBS1'
Please don't blame me for this, because I did not write it. It certainly doesn't look like the sample init.ora and I am not at all certain where the syntax came from. The values I changed were:
DBN.__sga_target=1024m
*.sga_target=1024m
*.memory_max_target=1408m
DBN.__pga_aggregate_target=384m and *.pga_aggregate_target=384m
That's the order in which I made the changes. After each change I used sqlplus to firstly recreate the spffile with:
create spfile='spfileDBN.ora' from pfile='initDBN.ora';
This was followed by an attempt to startup the database with startup nomount. In each case I got an error message which lead me to make the next change.
Finally I got the error which is in the title of this post. When I tried to search for information on this, the findings were grim. Mostly the information dealt with other parameters and did not explain what this error actually meant. The only thing that gave any real background was this link from Burleson Consulting. It didn't really help me solve the problem, so I decided to revert the initDBN.ora file and do some more research. A slow database is generally better than no database.
But Hey! I still get that same error, even after reerting to the original init file. I'm getting desperate now. I have no idea how to fix this. From what I've read to date, setting "underscore variables" in your init file is a "NO NO".
Can anybody provide me with some helpful tips as to how to get rid of this error?
We don't know if the apps running on this database need specific block sizes, but if the priority is getting the database open, you can shrink the init.ora down the smallest, simplest set of parameters that gets you moving forward.
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1000m
*.undo_tablespace='UNDOTBS1'
should get you an open database. Notice I have bumped up the sga_target to 1000m which is about the minimum you need to get a database started. The true values for sga_target and pga_aggregate_target really need to be set based on your expected usage, and the server configuration. But the init.ora above should get your database running.
I am not sure that this really qualifies as a "solution", but it does fix the initial problem. As mentioned in my reply to Connor McDonald, I set the parameter _shared_pool_reserved_min_alloc to 3000 in the initDBN.ora file, which I copied from Connor's example (thanks for that). After recreating the spfile and trying to restart the database, I got the following error:
ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 11953766
That got me thinking that the value 0 in the original error was probably a standin value which really means "the maximum allowed". By actually setting the parameter, I have apparently managed to generate an error which is more meaningful.
The value of _shared_pool_reserved_min_alloc is now set to 4200, which is a value I recall reading in one of the less helpful posts. (No, that post did not say that this is a value that should be used, just that it could be used.) This time, after re-creating the spfile I was able to start the database.
Before I do any more fiddling with parameters, I will do a bit more research... or maybe a lot more.

Is there a way to parallelize spark.read.load(string*) when reading many files?

I noticed that in spark-shell (spark 2.4.4), when I do a simple spark.read.format(xyz).load("a","b","c",...), it looks like spark uses a single ipc client (or "thread") to load the files a, b, c, ... sequentially (they are path to hdfs).
Is this expected?
The reason I am asking is, for my case, I am trying to load 50K files, and the sequential load takes a long time.
Thanks
PS, I am trying to see it in the source code, but not sure if this is the one:
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L180
Might not be an exact "answer" to my original question, but I found out the reason for my particular case: from name node's audit log, it was found that there were some runaway jobs pegging name node, which greatly slowed down the rpc calls. After killing these bad jobs, the spark's load speed was greatly improved.

mclapply and spark_read_parquet

I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to read files in parallel from s3 (AWS) to spark (local computer) as part of a test system. I have used mclapply, but when set more that 1 core, it fails...
Example: (the same code works when using one core, but fails when using 2)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 1)
new_rdd_global <- mclapply(seq(file_paths), function(i){spark_read_parquet(sc, name=paste0("rdd_",i), path=file_paths[i])}, mc.cores = 2)
Warning message:
In mclapply(seq(file_paths), function(i) { :
all scheduled cores encountered errors in user code
Any suggestion???
Thanks in advance.
Just read everything into one table via 1 spark_read_parquet() call, this way Spark handles the parallelization for you. If you need separate tables you can split them afterwards assuming there's a column that tells you which file the data came from. In general you shouldn't need to use mcapply() when using Spark with R.

TotalOrderPartitioner in Mapreduce example

I am trying to run the sample provided in the alex holmes book
https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sort/total/TotalSortMapReduce.java
However when I run the same program after making as a jar, I am getting an exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.hadoop.mapred.lib.InputSampler.writePartitionFile(InputSampler.java:338)
at
com.manning.hip.ch4.sort.total.TotalSortMapReduce.runSortJob(TotalSortMapReduce.java:44) at
com.manning.hip.ch4.sort.total.TotalSortMapReduce.main(TotalSortMapReduce.java:12)
Can someone please help me in understanding how to run the code. I have provided the following arguments.
args[0] --> the input path to names.txt(file which needs to be sorted). Its in hadoop.
args[1]--> the sample partition file which should be generated. Path of hadoop.
args[2]--> the output directory where the sorted file shold be genrated.
Please guide me the way I need to run this code.
The reason for that problem is probably the input data file is very small, but in the code :
InputSampler.Sampler<Text, Text> sampler =
new InputSampler.RandomSampler<Text,Text>
(0.1,
10000,
10);
you set the maxSplitsSampled to 10 in RandomSampler<Text,Text> (double freq, int numSamples, int maxSplitsSampled)
You can solve the problem by set that parameter to 1, or just make sure it is not larger than the splits number of you input file.
So, I know this thread is more than 5 years old, but I came across the same issue just today and Mike's answer did not work for me. (I think by now hadoop internally also makes sure you don't exceed the number of available splits).
However, I found what's caused the issue for me and so I post this hoping that it will help anyone else whose google search led them to this truly ancient hadoop thread.
In my case, the problem was that the input file I specified hat too little samples and my sampling frequency was too low. In this case it can happen (not everytime, mind you, only sometimes to really drive you insane) that you generate fewer samples than the number of reducers that you specified. Everytime that happened, my system crashed with this error message:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 9 out of bounds for length 9
at org.apache.hadoop.mapreduce.lib.partition.InputSampler.writePartitionFile(InputSampler.java:336)
at ...
In this case for example, only 9 samples were generated and I tried to employ more than 9 reducers.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Resources