write a value to specific row and column using shell - bash

I am very new to shell scripting, Task in my hand are divided into 2 shell scripts.
I want to do Pipelining of two shell scripts(which should run from a different directory rather than where the scripts are written) and which is presently working well.
First shell script contains:
Combining around 90 .lvm files stored inside a folder.
Crops each .lvm file, removes header and crops till the end of data.
Now I need to print a value in the 18th column once each file have been iterated to distinguish the end of file (here I am trying to write 500)
#!/bin/sh
clear
for file in "$1/"*.lvm; do
a=$(awk '/X_Value/{ print NR; exit }' "$file")
b=$(awk 'END {print NR}' "$file")
awk '{OFS= "\t"} {NR==$b $18,"500"}' "$file"
#specified row is $b and column number is 18
sed "s|\$a|${b}|" "$file"
done
Second shell script contains:
Reading specific columns from a first shell script.
Which is:
#!/bin/sh
clear
while read line; do
sleep 1
awk -v OFS='\t' '{print $1, $2, $3, $8, $18}'
done
Output now:
file1 196287,265000 3,902977 -39,226354 0,873427
file1 196287,266000 3,890747 -51,032699 0,519405
file1 196287,267000 3,900080 -51,472975 -0,446108
....
....
....
file2 196287,268000 3,904586 -50,627182 -0,092086
file2 196287,269000 3,870793 -30,687314 1,195265
file2 196287,270000 3,897505 -30,073244 0,744692
....
....
Desired Output:
file1 196287,265000 3,902977 -39,226354 0,873427 0
file1 196287,266000 3,890747 -51,032699 0,519405 0
file1 196287,267000 3,900080 -51,472975 -0,446108 500
...
...
...
file2 196287,268000 3,904586 -50,627182 -0,092086 0
file2 196287,269000 3,870793 -30,687314 1,195265 0
file2 196287,270000 3,897505 -30,073244 0,744692 500
...
...

I do not like awk. I do not use awk. Mock me if you must.
If bash can't handle something, I rewrite it in perl. =o)
That said - here's a clumsy all-bash version.
Many improvements to be made, but work keeps interrupting my fun, lol
nl="
"
for f in "$1/"*.lvm
do typeset out=''
while read a b c d e
do out="$out$( printf "%-25s%14s%14s%14s%14s%14s" $f $a $b $c $d $e )$nl"
done <<< "$(cut -f 1,2,3,8,18 -d ' ' $f )"
stack="$stack$nl$(printf "${out% *} 500$nl")"
done
echo "$stack"|grep -v '^$'

Related

bash for loop only return last value xtimes of xlength of arra

I have a file with IDs such as below:
A
D
E
And I have a second file with the same IDs and extra info that I need:
A 50 G25T1 7.24 298
B 20 G234T2 8.3 80
C 5 G1I1 5.2 909
D 500 G458T3 0.4 79
E 321 G46I2 45.8 901
I want to output the third column of the second file by selecting the first column of the second file using the ids from first file:
G25T1
G458T3
G46I2
The issue I have is while the for loop runs, the output is as follows:
G46I2
G46I2
G46I2
Here is my code:
a=0; IFS=$'\r\n' command eval 'ids=($(awk '{print$1}' shared_single_copies.txt | sed -e 's/[[:space:]]//g'))'; for id in "${ids[#]}"; do a=$(($a+1)); echo $a' '"$id"; awk '{$1=="${id}"} END {print $3}' run_Busco_A1/A1_single_copy_ids.txt >> A1_genes_sc_Buscos.txt; done
Your code is way too complicated. Try one of these solutions: "file1" contains the ids, "file2" contains the extra info:
$ join -o 2.3 file1 file2
G25T1
G458T3
G46I2
$ awk 'NR==FNR {id[$1]; next} $1 in id {print $3}' file1 file2
G25T1
G458T3
G46I2
For more help about join, check the man page.
For more help about awk, start with the awk info page.
#glenn jackman's answer was by far the most succinct and elegant imo. If you want to use loops, though, then this can work:
#!/bin/bash
# if output file already exists, clear it so we don't
# inadvertently duplicate data:
> A1_genes_sc_Buscos.txt
while read -r selector
do
while read -r c1 c2 c3 garbage
do
[[ "$c1" = "$selector" ]] && echo "$c3" >> A1_genes_sc_Buscos.txt
done < run_Busco_A1/A1_single_copy_ids.txt
done < shared_single_copies.txt
That should work for your use-case provided the formatting is valid between what you gave as input and your real files.

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
The beauty of awk is that you can do this in one awk line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
This is functionally the same as #JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: #kvantour).
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file

Shell script to compare two specific rows in a single CSV file

I am trying to learn shell script. I have a single CSV file, which is in bellow format:
Time, value1, value2, value3
12-17 17:47:55.380,1,2,9
12-17 17:48:55.380,8,4,9
12-17 17:49:55.380,1,2,9
12-17 17:50:55.380,8,4,9
I am looking for csv output something like bellow:
0,0,0,0
1,7,2,0
1,-7,-2,0
1,7,2,0
Till now I have written code:
First_value=ps -ef |awk "NR==1{print ;exit}" try.csv Second_value=ps -ef |awk "NR==2{print ;exit}" try.csv echo diff = $Second_value - $First_value
But I am getting error like:
read.sh: 14: read.sh: 12-17:not found.
Following are my queries:
I am not able to put this in loop and get the output. I would like to
know, how i can write the result back to same csv file,but at
particular row and column.
Following script (csvdiff.sh) will compare two lines of your choosing and output these lines with the original separation characters.
#! /bin/bash
# input: $1 - CVS file
# $2 - line number of line to subtract from
# $3 - line number of line to subtract
# Save separators
head -n1 $1 | sed 's/[0-9]\+/\n/g' | head -n-1 | tail -n+2 > .seps
# check for reversed compare
if [ $3 -lt $2 ]
then
first=$3
second=$2
reversed=1
else
first=$2
second=$3
reversed=0
fi
# get requested lines ($2 & $3) from the CVS file as supplied in $1
awk -v first=$2 -v second=$3 -F'[,: ]' 'BEGIN { OFS="\n" }
NR==first{ split($0,v) }
NR==second{
split($0,w)
res=0
for (i in w) {
$i = v[i]-w[i]
}
print
}' $1 > .vals
# handle reversed compare
if [ $reversed -eq 1 ]
then
awk '{print $1 * -1}' .vals > .tmp
mv .tmp .vals
fi
# paste the differences with the original seperation characters
paste -d'\0' .vals .seps | paste -sd'\0'
# remove used files
rm .vals .seps
Example usage:
$ cat file
2,2,3,4,5:10
1,2,3,4,5:12
$ chmod +x csvdiff.sh
$ bash csvdiff.sh file 1 2
1,0,0,0,0:-2
$ bash csvdiff.sh file 2 1
-1,0,0,0,0:2
Note that this script will compare fields separated by several delimiters such as colons and commas. It will, however, not take semantics like time etc. into account. Meaning that dates wont be subtracted as a whole but component-wise.

for loop and if statements in awk

I am a biologist that is starting to have to learn some elementary scripting skills to deal with large DNA sequence data sets. So please go easy on me. I am doing this all in bash. I have a file with my data formatted like this:
CLocus_58919_Sample_25_Locus_33235_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0
TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
What I need is to do is loop through this file and write all the sequences from the same sample into their own file. Just to be clear, these sequences come from samples 25 and 9. So my idea was to use awk to reformat my file in the following way:
CLocus_58919_Sample_25_Locus_33235_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
CLocus_58919_Sample_9_Locus_54109_Allele_0_TGCAGGTGCTTCCAGTTGTCTTTGTAGCGTCCCACCATGATCTGCAGGTCCTTG
then pipe this into another awk if statement to say "if sample=$i then write out that entire line to a file named sample.$i" Here is my code so far:
#!/bin/bash
a=`ls /scratch/tkchafin/data/raw | wc -l`;
b=1;
c=$((a-b));
mkdir /scratch/tkchafin/data/phylogenetics
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ '{if($4==$i) print}' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;
I understand this is not working because $i is in single quotes so bash is not recognizing it. I know awk has a -v option for passing external variables to it, but I don't know how I would apply that in this case. I tried to move the for loop inside the awk statement but this does not produce the desired result either. Any help would be much appreciated.
You can have awk write directly to the desired output file, without a shell loop:
awk -F_ '(NR % 2) == 1 { line1 = $0; fn="/scratch/tkchafin/data/phylogenetics/sample."$4; }
(NR % 2) == 0 { print line1"_"$0 > fn; }' "$1"
But to show how you would use -v in your version, it would be:
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ -v i=$i '$4 == i' >> /scratch/tkchafin/data/phylogenetics/sample.$i
done;

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Resources