I have a two column tab delimited file the contains input for a command.
The input file looks like this:
2795.bam 2865.bam
2825.bam 2865.bam
2794.bam 2864.bam
the command line is:
macs2 callpeak -t trt.bam -c ctrl.bam -n Macs.name.bam --gsize hs --nomodel
where trt.bam are the names of files in column 1 and ctrl.bam are the names of files in col2.
what I trying is to read these values from input file and run them.
To do achieve this I am doing following:
cat temp | awk '{print $1 "\t" $2 }' | macs2 callpeak -t $1 -c $2 -n Macs.$1 --gsize hs --nomodel
This is failing. The error that I get is:
usage: macs2 callpeak [-h] -t TFILE [TFILE ...] [-c [CFILE [CFILE ...]]]
[-f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}]
[-g GSIZE] [--keep-dup KEEPDUPLICATES]
[--buffer-size BUFFER_SIZE] [--outdir OUTDIR] [-n NAME]
[-B] [--verbose VERBOSE] [--trackline] [--SPMR]
[-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal]
[--nomodel] [--shift SHIFT] [--extsize EXTSIZE]
[-q QVALUE | -p PVALUE] [--to-large] [--ratio RATIO]
[--down-sample] [--seed SEED] [--tempdir TEMPDIR]
[--nolambda] [--slocal SMALLLOCAL] [--llocal LARGELOCAL]
[--broad] [--broad-cutoff BROADCUTOFF]
[--cutoff-analysis] [--call-summits]
[--fe-cutoff FECUTOFF]
macs2 callpeak: error: argument -t/--treatment: expected at least one argument
In an ideal situation this should be taking inputs like this:
macs2 callpeak -t 2795.bam -c 2865.bam -n Macs.2795 --gsize hs --nomodel
Where Macs is a standalone software that runs on linux. In the present situation, the software is failing to read the input from the file.
Any inputs are deeply appreciated.
I believe what you want to achieve is a loop over all lines in your input file. In bash, you can achieve this as :
while read -r tfile cfile; do
macs2 callpeak -t "$tfile" -c "$cfile" -n "Macs.$tfile" --gsize hs --nomodel
done < "input_file.txt"
See: https://mywiki.wooledge.org/BashFAQ/001 (cfr. Sundeep's comment)
original answer:
while read -a a; do
macs2 callpeak -t "${a[0]}" -c "${a[1]}" -n "Macs.${a[0]}" --gsize hs --nomodel
done < "input_file.txt"
This will read the input file input_file.txt line by line and store it in a bash array named a using read -a a. From that point forward, you process your command with the variables ${a[0]} and ${a[1]}.
Related
This question already has answers here:
How to read variables from file, with multiple variables per line?
(2 answers)
Closed last month.
I am trying to assign variables obtained by awk, from a 2 columned txt file.
To a command, which includes every two value as two variables in it.
For example, the file I use is;
foo.txt
10 20
33 40
65 78
my command is aiming to print ;
end=20 start=10
end=40 start=33
end=78 start=65
Basically, I want to iterate the code for every line, and for output, there will be two variables from the two columns of the input file.
I am not an awk expert (I am trying my best), what I could have done so far is this fusion;
while read -r line ; do awk '{ second_variable=$2 ; first_variable=$1 ; }' ; echo "end=$first_name start=$second_name"; done <foo.txt
but it only gives this output;
end= start=
only one time without any variable. I would appreciate any suggestion. Thank you.
In bash you only need while, read and printf:
while read -r start end
do printf 'end=%d start=%d\n' "$end" "$start"
done < foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
With awk, you could do:
awk '{print "end=" $2, "start=" $1}' foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
With sed you'd use regular expressions:
sed -E 's/([0-9]+) ([0-9]+)/end=\2 start=\1/' foo.txt
end=20 start=10
end=40 start=33
end=78 start=65
Just in Bash:
while read -r end start; do echo "end=$end start=$start"; done <foo.txt
What about using xargs?
xargs -n2 sh -c 'echo end=$1 start=$2' sh < file.txt
Demo
xargs -n2 sh -c 'echo end=$1 start=$2' sh <<INPUT
10 20
33 40
65 78
INPUT
Output
end=10 start=20
end=33 start=40
end=65 start=78
Updated question:
I have a config.file in which I define a few variables that are ultimately called in a different script.
$cat config.file
#1 Accession number ref
ref=L41223.2
#2 Accession number SRA
SRA=SRA7361534
#3 Path to SRA
path_SRA='/Volumes/5TB/sra/'
#4 Path to ref
path_ref='/Volumes/5TB/results/species1/'
The #3 (path to SRA) is constant and never changes. For the other variables ($ref, $sra and $path_ref), I would like to read them one-by-one from different fields of an input.file:
$cat input.file
species1 L41223.2 SRA7361534
species2 D45023.5 SRA9473231
species3 L42823.6 SRA0918881
...
All these variables are called several times in a script.sh:
#!/bin/bash
# Path to the configuration file
. /Users/Main/config.file
# Use NCBI's e-utilities to download reference files
esearch -db nucleotide -query $ref | efetch -format fasta > $path_ref$ref.fasta
# Using NCBI's sratoolkit to download SRA file
prefetch $SRA
cd $path_SRA
mv *.sra $path_ref
# Decompress the SRA file
cd $path_ref; if fastq-dump --split-3 $SRA.sra ; then
echo "SRA file successfully decompressed. Deleting the SRA file now..."
rm $SRA.sra
else
echo "Could not decompress SRA file"
fi
# Use bwa to align DNA reads to the reference sequence
cd $path_ref;
bwa index -p INDEX $ref.fasta
bwa aln -t $core INDEX *_1.fastq > 1.sai
bwa aln -t $core INDEX *_2.fastq > 2.sai
bwa sampe INDEX 1.sai 2.sai *_1.fastq *_2.fastq | samtools view -hq 5 > $SRA.Q5.sam
# Use samtools for conversion
samtools view -bT $ref.fasta $SRA.Q5.sam > $SRA.Q5.bam
samtools sort $SRA.Q5.bam -o $SRA.sorted
# use bedtools for coverage
bedtools genomecov -d -ibam $SRA.sorted.bam > $SRA.gencov.txt
# use awk for extraction
awk '$2 ~ /81|161|97|145/ {print $0}' $SRA.Q5.sam > $SRA.OTW.sam
samtools view -bT $ref.fasta $SRA.OTW.sam > $SRA.OTW.bam
samtools sort $SRA.OTW.bam -o $SRA.OTW.sorted.bam
# Extract FLAG, POS, CIGAR and TLEN for outward-oriented reads
awk '$2 ~ /81|161|97|145/ {print $2, $4, $6, $9}' $SRA.Q5.sam > $SRA.OTW.txt
# Get per-base coverage for outward-oriented reads
bedtools genomecov -d -ibam $SRA.OTW.sorted.bam > $SRA.OTW.gencoverage.txt
# Simplify the output by averaging read coverage over 50 bp window; prints the average count value and last genomic position
awk '{sum+=$3; count++} FNR % 50 == 0 {print $2, (sum/count); count=sum = ""}' $SRA.OTW.gencoverage.txt > $SRA.OTW.50sum.txt
#### End of the script
What I would like to do is "read" from the input.file into the config.file. The first field (species1...) would be used as input for $path_ref, field 2 (L41223.2...) would be used as input for $ref and third field (SRA7361534...) would be used as input for $SRA variable. Once the first round (basically the first line) has been done, the script.sh would run again and read fields 1,2 and 3 from the line 2 and so on. Basically, a loop, but somewhat more complicated than the answer below because different variables are called at different places in the script.
This works fine for one variable, however I couldn't implement it with three different variables called throughout the script:
while read -r c1 c2 c3; do
bwa index -p INDEX ${c2}.fasta
# place rest of your script here
done < input.file
Many thanks in advance.
In script.sh, after the line . /Users/Main/config.file, add these lines:
number_of_inputs=$(wc -l < input.file)
for (( i=1 ; i <= number_of_inputs ; i++ )); do
# extract columns $1, $2, $3 here, from line $i - please change appropriately
ref=$( awk "NR==$i{print \$1}" input.file)
SRA=$( awk "NR==$i{print \$2}" input.file)
path_ref=$(awk "NR==$i{print \$3}" input.file)
then add a done at the end of the file, so the whole thing loops over the values in each line of input.file, setting the values accordingly
The following is output of "cat /proc/softirqs " :
CPU0 CPU1 CPU2 CPU3
HI: 24 13 7 54
TIMER: 344095632 253285150 121234786 108207697
NET_TX: 2366955 319 695 316044
NET_RX: 16337920 16030558 250497436 117201444
BLOCK: 19631 2747 2353 5067051
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 298 93 157 20965
SCHED: 74354472 28133393 30646119 26217748
HRTIMER: 4123645358 2409060621 2466360502 401470590
RCU: 26083738 17708780 15330534 16857905
My another machine has 24 cpu cores and the output is hard to read ,
I like the output to be only cpu0 , cpu2 , cpu4 , cpu6, ....
I know cut or awk might be ued to do that ,
but no idea how to use it to get even output columns .
Edit :
awk -F" " '{printf("%10s\t%s\n", $2,$4) }'
will get
CPU1 CPU3
24 7
344095632 121234786
2366955 695
16337920 250497436
19631 2353
0 0
298 157
74354472 30646119
4123645358 2466360502
26083738 15330534
unfortunately , CPU1 should be CPU0 , CPU3 should be CPU2 ,
the first line has only 4 columns , may I skip the first line
in this shell ?!
Edit2 :
watch -d "cat /proc/softirqs | awk -F" " '{printf("%10s\t%s\n",$2,$4)}' "
encounter errors like the following :
Every 2.0s: cat /proc/softirqs | awk -F '{print }' Tue Jun 21 10:23:22 2016
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options: (standard)
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
Short options: GNU long options: (extensions)
-b --characters-as-bytes
-c --traditional
-C --copyright
-d[file] --dump-variables[=file]
-e 'program-text' --source='program-text'
-E file --exec=file
-g --gen-pot
-h --help
-L [fatal] --lint[=fatal]
-n --non-decimal-data
-N --use-lc-numeric
-O --optimize
-p[file] --profile[=file]
-P --posix
-r --re-interval
-S --sandbox
-t --lint-old
-V --version
To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.
gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.
Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
what else should I try ?!
Edit3 :
The final workable shell would like :
# define function encapsulating code; this prevents any need for extra layers of quoting
# or escaping.
run() {
awk 'NR>1{printf("%20s\t%10s\t%s\n",$1,$2,$4)}' </proc/softirqs|egrep 'TIMER|RX'
}
# export function
export -f run
# run function in subshell of watch, ensuring that that shell is bash
# (other shells may not honor exported functions)
watch -d "bash -c run"
One easy way to communicate code to a subprocess of watch that avoids escaping errors is to use an exported function:
# define function encapsulating code; this prevents any need for extra layers of quoting
# or escaping.
run() {
awk -F" " '{printf("%10s\t%s\n",$2,$4)}' </proc/softirqs
}
# export function
export -f run
# run function in subshell of watch, ensuring that that shell is bash
# (other shells may not honor exported functions)
watch "bash -c run"
To avoid the dependency on exported functions, one can also use declare -f to retrieve the function's source in an evalable form, and printf %q to escape it to survive processing by the outer shell invoked by watch:
run() {
awk -F" " '{printf("%10s\t%s\n",$2,$4)}' </proc/softirqs
}
printf -v run_str '%q' "$(declare -f run); run"
watch "bash -c $run_str"
To skip the first line, do:
awk -F" " 'NR>1{printf("%10s\t%s\n", $2,$4) }'
Why do you need -F" ", is a mystery to me. You can as well write:
awk 'NR>1{printf("%10s\t%s\n", $2,$4) }'
(As for the watch part, see other answer/s.)
I am currently building a bash script for class, and I am trying to use the grep command to grab the values from a simple calculator program and store them in the variables I assign, but I keep receiving a syntax error message when I try to run the script. Any advice on how to fix it? my script looks like this:
#!/bin/bash
addanwser=$(grep -o "num1 + num2" Lab9 -a 5 2)
echo "addanwser"
subanwser=$(grep -o "num1 - num2" Lab9 -s 10 15)
echo "subanwser"
multianwser=$(grep -o "num1 * num2" Lab9 -m 3 10)
echo "multianwser"
divanwser=$(grep -o "num1 / num2" Lab9 -d 100 4)
echo "divanwser"
modanwser=$(grep -o "num1 % num2" Lab9 -r 300 7)
echo "modawser"`
You want to grep the output of a command.
grep searches from either a file or standard input. So you can say either of these equivalent:
grep X file # 1. from a file
... things ... | grep X # 2. from stdin
grep X <<< "content" # 3. using here-strings
For this case, you want to use the last one, so that you execute the program and its output feeds grep directly:
grep <something> <<< "$(Lab9 -s 10 15)"
Which is the same as saying:
Lab9 -s 10 15 | grep <something>
So that grep will act on the output of your program. Since I don't know how Lab9 works, let's use a simple example with seq, that returns numbers from 5 to 15:
$ grep 5 <<< "$(seq 5 15)"
5
15
grep is usually used for finding matching lines of a text file. To actually grab a part of the matched line other tools such as awk are used.
Assuming the output looks like "num1 + num2 = 54" (i.e. fields are separated by space), this should do your job:
addanwser=$(Lab9 -a 5 2 | awk '{print $NF}')
echo "$addanwser"
Make sure you don't miss the '$' sign before addanwser when echo'ing it.
$NF selects the last field. You may select nth field using $n.
In the sections below, you'll see the shell script I am trying to run on a UNIX machine, along with a transcript.
When I run this program, it gives the expected output but it also gives an error shown in the transcript. What could be the problem and how can I fix it?
First, the script:
#!/usr/bin/bash
while read A B C D E F
do
E=`echo $E | cut -f 1 -d "%"`
if test $# -eq 2
then
I=`echo $2`
else
I=90
fi
if test $E -ge $I
then
echo $F
fi
done
And the transcript of running it:
$ df -k | ./filter.sh -c 50
./filter.sh: line 12: test: capacity: integer expression expected
/etc/svc/volatile
/var/run
/home/ug
/home/pg
/home/staff/t
/packages/turnin
$ _
Before the line that says:
if test $E -ge $I
temporarily place the line:
echo "[$E]"
and you'll find something very much non-numeric, and that's because the output of df -k looks like this:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 954316620 212723892 693109608 24% /
udev 10240 0 10240 0% /dev
: :
The offending line there is the first, which will have its fifth field Use% turned into Use, which is definitely not an integer.
A quick fix may be to change your usage to something like:
df -k | sed -n '2,$p' | ./filter -c 50
or:
df -k | tail -n+2 | ./filter -c 50
Either of those extra filters (sed or tail) will print only from line 2 onwards.
If you're open to not needing a special script at all, you could probably just get away with something like:
df -k | awk -vlimit=40 '$5+0>=limit&&NR>1{print $5" "$6}'
The way it works is to only operate on lines where both:
the fifth field, converted to a number, is at least equal to the limit passed in with -v; and
the record number (line) is two or greater.
Then it simply outputs the relevant information for those matching lines.
This particular example outputs the file system and usage (as a percentage like 42%) but, if you just want the file system as per your script, just change the print to output $6 on its own: {print $6}.
Alternatively, if you do the percentage but without the %, you can use the same method I used in the conditional: {print $5+0" "$6}.