WEKA RandomForest model loading. Weka exception: No training file and no object input file given - shell

I am currently using two shell scripts to train/test Random Forest model.
When I train the model I have no problems at all. However, when calling the test.sh which tries to load the model, generates the following error:
Weka exception: No training file and no object input file given.
General options:
-h or -help
Output help information.
-synopsis or -info
Output synopsis for classifier (use in conjunction with -h)
-t <name of training file>
Sets training file.
...
In train.sh I have:
java -Xmx1g -classpath $CLASSPATH:weka.jar weka.classifiers.trees.RandomForest \
-t "$FileNameFeaturesTrain_weka" \
-no-cv \
-I $numTrees -K $numFeat -S 0 \
-p 0 -distribution -d "$fileNameModel" > "$fname_output_pred"
In test.sh I have:
java -Xmx1g -classpath $CLASSPATH:weka.jar weka.classifiers.trees.RandomForest \
-l "$FileNameModel" \
-T "$FileNameFeaturesTest_weka" \
-no-cv \
-p 0 -distribution > "$fname_output_pred"
I don't understand why is the model not loading in my test.sh script. However the same logic and flags work well with weka.classifiers.functions.Logistic, weka.classifiers.functions.MultilayerPerceptron, etc. This error is only happening with RandomForest. I am using WEKA 3.6.12.
I would appreciate any tips or comments.
Thank you,

Related

Bowtie2 error : " does not exist or is not a Bowtie 2 index"

I'm very new to bioinformatics and RNA-seq so go easy on me. I keep getting the same error when I try perform my analysis. For background: I have a folder called tutorial, in the folder there are 4 other folders (important ones for now are trimmed_fastq, bt2Index and insert_size). Using bowtie2-build, I created a the bt2 files with the index Zv9 in the bt2Index folder. Now I want to align my trimmed reads (in trimmed_fastq) to the genome.
here's what I do and the error I get:
cd desktop/tutorial/insert_size
bowtie2 \
--minins 0 \
--maxins 1000 \
--very-fast \
-x $HOME/desktop/tutorial/bt2Index/Zv9 \
-1 $HOME/desktop/tutorial/trimmed_fastq/2cell1.trim.paired.fq \
-2 $HOME/desktop/tutorial/trimmed_fastq/2cell2.trim.paired.fq \
-S 2cells.sam
(ERR): "/Users/eag519/desktop/tutorial/bt2Index/Zv9" does not exist or is not a
Bowtie 2 index
Exiting now ...
I've tried writing it out in several different ways but can't solve it. Any ideas? Any help would be much appreciated. I'm a complete beginner.

MarkDuplicates Picard

I am using Picard to mark only optical duplicates for which I read the manual of MarkDuplicates. My script looks like this
#!/usr/bin/bash
java -jar build/libs/picard.jar MarkDuplicates \
I=sorted.bam \
O=mark_opticalduplicate.bam \
MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 \
TAGGING_POLICY=OpticalOnly \
M=markedoptical_dup_metrics.txt
I am not sure I am getting only optical duplicates when I am using the samtool flag 0x400
Any suggestions at this point is highly appreciated.

GNU parallel read from several files

I am trying to use GNU parallel to convert individual files with a bioinformatic tool called vcf2maf.
My command looks something like this:
${parallel} --link "perl ${vcf2maf} --input-vcf ${1} \
--output-maf ${maf_dir}/${2}.maf \
--tumor-id ${3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
VCF_files, results and tumor_ids contain one entry per line and correspond to one another.
When I try and run the command I get the following error for every file:
ERROR: Both input-vcf and output-maf must be defined!
This confused me, because if I run the command manually, the program works as intended, so I dont think that the input/outpit paths are wrong. To confirm this, I also ran
${parallel} --link "cat ${1}" :::: ${VCF_files} ${results} ${tumor_ids},
which correctly prints the contents of the VCF files, whose path is listed in VCF_files.
I am really confused what I did wrong, if anyone could help me out, I'd be very thankful!
Thanks!
For a command this long I would normally define a function:
doit() {
...
}
export -f doit
Then test this on a single input.
When it works:
parallel --link doit :::: ${VCF_files} ${results} ${tumor_ids}
But if you want to use a single command it will look something like:
${parallel} --link "perl ${vcf2maf} --input-vcf {1} \
--output-maf ${maf_dir}/{2}.maf \
--tumor-id {3} \
--tmp-dir ${vcf_dir} \
--vep-path ${vep_script} \
--vep-data ${vep_data} \
--ref-fasta ${fasta} \
--filter-vcf ${filter_vcf}" :::: ${VCF_files} ${results} ${tumor_ids}
GNU Parallel's replacement strings are {1}, {2}, and {3} - not ${1}, ${2}, and ${3}.
--dryrun is your friend when GNU Parallel does not do what you expect it to do.

snowsql not found from cron tab

I am trying to execute snowsql from an shell script which i have scheduled with cron job. But i am getting error like snowsql: command not found.
I went through many links where they are asking us to give full path of the snowflake. i tried with that also but no luck.
https://support.snowflake.net/s/question/0D50Z00007ZBOZnSAP/snowsql-through-shell-script. Below is my code snippet abc.sh:
#!/bin/bash
set -x
snowsql --config /home/basant.jain/snowsql_config.conf \
-D cust_name=mean \
-D feed_nm=lbl \
-o exit_on_error=true \
-o timing=false \
-o friendly=false \
-o output_format=csv \
-o header=false \
-o variable_substitution=True \
-q 'select count(*) from table_name'
and my crontab looks like below:
*/1 * * * * /home/basant.jain/abc.sh
Cron doesn't set PATH like your login shell does.
As you already wrote in your question you could specify the full path of snowsql, e.g.
#!/bin/bash
/path/to/snowsql --config /home/basant.jain/snowsql_config.conf \
...
Note: /path/to/snowsql is only an example. Of course you should find out the real path of snowsql, e.g. using type snowsql.
Or you can try to source /etc/profile. Maybe this will set up PATH for calling snowsql.
#!/bin/bash
. /etc/profile
snowsql --config /home/basant.jain/snowsql_config.conf \
...
see How to get CRON to call in the correct PATHs

How to set PIG_HEAPSIZE value in a shell script triggering a pig job

Also what can be the maximum value set for this. Please let me know any preconditions that I need to consider while setting this flag.
Thanks!
I configured the the value "PIG_HEAPSIZE=6144" just before the pig script as in the below:
PIG_HEAPSIZE=6144 pig \
-logfile ${pig_log_file} \
-p ENV=${etl_env} \
-p OUTPUT_PATH=${pad_output_dir} \
${pig_script} >> $log_file 2>&1;
And it worked!

Resources