Setting queue name in pig v0.15 - hadoop

I am getting below exception while trying to execute pig script via shell.
JobId Alias Feature Message Outputs
job_1520637789949_340250 A,B,D,top_rec GROUP_BY Message: java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1520637789949_340250 to YARN : Application rejected by queue placement policy
I understand that it is due to not setting the correct queue name for MR execution. In order to find that how to set a queuename for mapreduce job, I tried searching thorough help, pig --help, it listed below options
Apache Pig version 0.15.0-mapr-1611 (rexported)
compiled Dec 06 2016, 05:50:07
USAGE: Pig [options] [-] : Run interactively in grunt shell.
Pig [options] -e[xecute] cmd [cmd ...] : Run cmd(s).
Pig [options] [-f[ile]] file : Run cmds found in file.
options include:
-4, -log4jconf - Log4j configuration file, overrides log conf
-b, -brief - Brief logging (no timestamps)
-c, -check - Syntax check
-d, -debug - Debug level, INFO is default
-e, -execute - Commands to execute (within quotes)
-f, -file - Path to the script to execute
-g, -embedded - ScriptEngine classname or keyword for the ScriptEngine
-h, -help - Display this message. You can specify topic to get help for that topic.
properties is the only topic currently supported: -h properties.
-i, -version - Display version information
-l, -logfile - Path to client side log file; default is current working directory.
-m, -param_file - Path to the parameter file
-p, -param - Key value pair of the form param=val
-r, -dryrun - Produces script with substituted parameters. Script is not executed.
-t, -optimizer_off - Turn optimizations off. The following values are supported:
ConstantCalculator - Calculate constants at compile time
SplitFilter - Split filter conditions
PushUpFilter - Filter as early as possible
MergeFilter - Merge filter conditions
PushDownForeachFlatten - Join or explode as late as possible
LimitOptimizer - Limit as early as possible
ColumnMapKeyPrune - Remove unused data
AddForEach - Add ForEach to remove unneeded columns
MergeForEach - Merge adjacent ForEach
GroupByConstParallelSetter - Force parallel 1 for "group all" statement
PartitionFilterOptimizer - Pushdown partition filter conditions to loader implementing LoadMetaData
PredicatePushdownOptimizer - Pushdown filter predicates to loader implementing LoadPredicatePushDown
All - Disable all optimizations
All optimizations listed here are enabled by default. Optimization values are case insensitive.
-v, -verbose - Print all error messages to screen
-w, -warning - Turn warning logging on; also turns warning aggregation off
-x, -exectype - Set execution mode: local|mapreduce|tez, default is mapreduce.
-F, -stop_on_failure - Aborts execution on the first failed job; default is off
-M, -no_multiquery - Turn multiquery optimization off; default is on
-N, -no_fetch - Turn fetch optimization off; default is on
-P, -propertyFile - Path to property file
-printCmdDebug - Overrides anything else and prints the actual command used to run Pig, including
any environment variables that are set by the pig command.
18/03/30 13:03:05 INFO pig.Main: Pig script completed in 163 milliseconds (163 ms)
I tried pig -p mapreduce.job.queuename=my_queue; and was able to login into grunt without any error.
However, on the first command itself, it threw below
ERROR 2997: Encountered IOException. org.apache.pig.tools.parameters.ParseException: Encountered " <OTHER> ".job.queuename=my_queue "" at line 1, column 10.
Was expecting:
"=" ...
I am not sure, if I am doing it right?

To set queuename in pig 0.15, I got below options (it may works for other version too):
1) pig comes with an option to start the pig session using a queue name.
Simple use below commands
pig -Dmapreduce.job.queuename=my_queue
2) Another option is to set the same in the grunt shell or in the pig script itself.
set mapreduce.job.queuename my_queue;

Related

'Wildcards' object has no attribute 'output'

I get an error for a rather simple rule. I have to write a task file for another program, expecting a tsv file. I read a certain number of parameters from my config file and write them to a file with a shell command.
Code:
rule create_tasks:
output:
temp("tasks_{sample}.tsv")
params:
ID="{sample}",
file=lambda wc: samples["path"][wc.sample] ,
bigwig=lambda wc: samples["bigwig"][wc.sample] ,
ambig=lambda wc: samples["ambig"][wc.sample]
shell:
'echo -e "{params.ID}\t{params.file}" > {output}'
When I execute the workflow, I get the following error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job counts:
count jobs
1 create_tasks
1
[Mon Oct 12 14:48:15 2020]
rule create_tasks:
output: tasks_sampleA.tsv
jobid: 0
wildcards: sample=sampleA
echo -e "sampleA /Path/To/sampleA.bed " > tasks_sampleA.tsv
WorkflowError in line 23 of /path/to/workflow.snakefile:
'Wildcards' object has no attribute 'output'
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 111, in run_jobs
File "/path/to/miniconda/envs/snakemake_submit/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 1233, in run
I should mention, that two of the variables are empty and that I expect the tabs/whitespaces in the echo command.
Does anybody have an explanation, why snakemake is trying to find output in the wildcards? I am expecially confused, because it is printing the correct command.
I've run into this same problem.
The issue is probably in how you invoked Snakemake from the command line.
For example, this was my Snakefile rule:
rule sort:
input:
"{file}.bam",
output:
"{file}.sorted.bam",
"{file}.sorted.bai",
shell:
"sambamba sort {input}"
I don't even have params or wildcards explicitly anywhere in there.
But when I run it on my Slurm HPC I get the same error:
snakemake -j 10 -c "sbatch {cluster.params}" -u cluster.yaml
The Wildcards (note the capital "W") and params objects weren't from the rule.
They came from the cluster execution of the rule, and the error was thrown when trying to parse the cluster.yaml file.
There was no cluster parameter specification in my cluster.yaml file for the sort rule, so the error was thrown.
I fixed this by adding
sort:
params: "..."
to my cluster.yaml file.
In your case, add cluster submission options under a create_tasks: ... list.
You can also add a __default__: ... list as the default submission parameters for any job, by default, unless it matches another rule.

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

How can I get DEBUG messages HAWQ in log?

Are there any GUCs or commands which I can get debug messages in HAWQ log? Now, I can only get ERROR or FATAL message but can not get any DEBUG messages. How to print these DEBUG messages in Log file?
You can set log_min_messages level in postgres.conf of hawq master data directory. Log level can be the following values in order of decreasing detail:
# debug5
# debug4
# debug3
# debug2
# debug1
# info
# notice
# warning
# error
# log
# fatal
# panic
It needs to restart cluster if you change the postgres.conf. But you can set GUC log_min_messages in the PSQL session if you just want to log the debug info within this session.
Different component of apache hawq support different level of debugging message.
The overall supported levels are as below. You may refer to https://github.com/apache/incubator-hawq/blob/master/src/include/utils/elog.h for details.
/* Error level codes */
Level Value
------------------
DEBUG5 10
DEBUG4 11
DEBUG3 12
DEBUG2 13
DEBUG1 14
LOG 15
COMMERROR 16
INFO 17
NOTICE 18
WARNING 19
ERROR 20
FATAL 21
PANIC 22
To get the DEBUG message you want, you need to check the component you care about regarding the supported level of debugging. Then before run your query, use below setting to get debug information:
either persistent level of GUC ("hawq config -c log_min_messages -v DEBUG_LEVEL" and then "hawq restart cluster -a")
or use session level debugging ("set log_min_messages = DEBUG_LEVEL")
If you don't find enough log information even with highest level debugging level, you can try to add it in apache hawq source code yourself.
DEBUG you refer to may have two meaning, One is DEBUG log level in hawq code, which is answered by ztao1987, and the other is when you debug using gdb/lldb, where is the output of your print function.
The answer is in the master/segment log too. stdout has been redirected to log file by HAWQ, For example, when you want to print a tupletableslot in lldb, just type"expr print_slot(yourslot)", and tail -f your.log, the slot info will be printed on the screen.

hive output consists of these 2 warnings at the end. How do I suppress these 2 warnings

Hive query output that is using UDFs consists of these 2 warnings at the end. How do I suppress these 2 warnings. Please note that the 2 warnings come right after the output as part of output.
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.
hadoop version
Hadoop 2.6.0-cdh5.4.0
hive --version
Hive 1.1.0-cdh5.4.0
If you use beeline instead of Hive the error goes away. Not the best solution, but I'm planning to post to the CDH user group asking the same question to see if it's a bug that can be fixed.
This error occurs due to adding of assembly jar which which contains classes from icl-over-slf4j.jar (which is causing the stdout messages) and slf4j-log4j12.jar.
You can try couple of things to begin with:
Try removing the assembly jar, in case if using.
Look at the following link: https://issues.apache.org/jira/browse/HIVE-12179
This suggest that we can trigger a flag in Hive where spark-assembly is loaded only if HIVE_ADD_SPARK_ASSEMBLY = "true".
https://community.hortonworks.com/questions/34311/warning-message-in-hive-output-after-upgrading-to.html :
Although there is a workaround if to avoid any end time changes and that is to manually remove the 2 lines from the end of the files using shell script.
Have tried to set HIVE_ADD_SPARK_ASSEMBLY=false, but it didn't work.
Finally, I found a post question at Cloudera community. See: https://community.cloudera.com/t5/Support-Questions/Warning-message-in-Hive-output-after-upgrading-to-hive/td-p/157141
You could try the follow command, it works for me!
hive -S -d ns=$hiveDB -d tab=$t -d dunsCol=$c1 -d phase="$ph1" -d error=$c2 -d ts=$eColumnArray -d reporting_window=$rDate -f $dir'select_count.hsql' | grep -v "^WARN" > $gOutPut 2> /dev/null

"Doesn't exist in RM" backend error in Pig

I'm getting an error in the Cloudera QuickStart VM I downloaded from http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html.
I am trying a toy example from Tom White's Hadoop: The Definitive Guide book called map_temp.pig, which "finds the maximum temperature by year".
I created a file called temps.txt that contains (year, temperature, quality) entries on each line:
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
Using the example code in the book, I typed the following Pig code into the Grunt terminal:
records = LOAD '/home/cloudera/Desktop/temps.txt'
AS (year:chararray, temperature:int, quality:int);
DUMP records;
After I typed DUMP records;, I got the error:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
…
Details at logfile: /home/cloudera/Desktop/pig_1400782722689.log
I attempted to find out what was causing the error through a Google search: https://www.google.com/search?q=%22application+with+id%22+%22doesn%27t+exist+in+RM%22.
The results there weren't helpful. For example, http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-troubleshoot-error-vpc.html mentioned this bug and said "To solve this problem, you must configure a VPC that includes a DHCP Options Set whose parameters are set to the following values..."
Amazon's suggested fix doesn't seem to be the problem because I'm not using using AWS.
EDIT:
I think the HDFS file path is correct.
[cloudera#localhost Desktop]$ ls
Eclipse.desktop gnome-terminal.desktop max_temp.pig temps.txt
[cloudera#localhost Desktop]$ pwd
/home/cloudera/Desktop
there's another exception before your error :
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
Is your file in HDFS? Have you checked the file path?
I was able to solve this problem by doing pig -x local to start the Grunt interpreter instead of just pig.
I should have used local mode because I did not have access to a Hadoop cluster.
This gave me the errors:
2014-05-22 11:33:34,286 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias records. Backend error : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1400775973236_0006' doesn't exist in RM.
2014-05-22 11:33:28,799 [JobControl] WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: hdfs://localhost.localdomain:8020/home/cloudera/Desktop/temps.txt
From http://pig.apache.org/docs/r0.9.1/start.html:
Pig has two execution modes or exectypes:
Local Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).
Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).
You can run Pig in either mode using the "pig" command (the bin/pig Perl script) or the "java" command (java -cp pig.jar ...).
Running the toy example from Tom White's Hadoop: The Definitive Guide book:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'temps.txt' AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
MAX(filtered_records.temperature);
DUMP max_temp;
against the following data set in temps.txt (remember that Pig's default input is tab-delimited files):
1950 0 1
1950 22 1
1950 -11 1
1949 111 1
gives this:
grunt> [cloudera#localhost Desktop]$ pig -x local -f max_temp.pig 2>log
(1949,111)
(1950,22)

Resources