bash to get information from two files - bash

I have file1:
NM_000014 A2M
NM_000015 NAT2
NM_000016 ACADM
NM_000017 ACADS
NM_000018 ACADVL
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000021 PSEN1
NM_000022 ADA
And file2:
NM_000019
NM_000020
NM_000020
NM_12345
I need to get information from my file1 and put it to file2 - so create file3:
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO
Note - I can not change a original sort order (so not use comm and diff). I have duplication line in file2 - this I need keep (wc -l file2 == wc -l file3). If there is no match - print NO
I have about 70K rows and I do not need fastest solution.
My code is able just compare and print the same results.
code:
#!/bin/bash
while read -r c; do
grep $c file1 | uniq
done < file2 > file3

Using awk:
$ awk 'NR==FNR{a[$1]=$2;next} {print ($1 in a?$1 OFS a[$1]:$1 OFS "NO")}' file1 file2
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO
Explained:
NR==FNR{ # process the first file
a[$1]=$2 # hash records to a, $1 as key
next # skip to next record
}
{ # process the second file
print ($1 in a?$1 OFS a[$1]:$1 OFS "NO") # print hashed value if found or NO
# if($1 in a) # another way of saying above
# print $1, a[$1]
# else
# print $1, "NO"
}

So basically you have one file with patterns, and a second one that you want to search using those patterns:
#!/bin/bash
for PATTERN in $(cat $2); do
TMP=$(egrep $PATTERN $1)
if [ ! -z "$TMP" ]; then
echo "$TMP"
else
echo "$PATTERN NO"
fi
done
and a quick test:
$ bash filter.sh file1 file2
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO

Try with this if sentence added to your code:
if ! grep -q $i fileone ; then
echo -e $i " NO"
fi
For example:
#!/bin/bash
while read -r c; do grep $c fileone | uniq; done < filetwo
for i in $(cat filetwo)
do
if ! grep -q $i fileone ; then
echo -e $i " NO"
fi
done
It will print NO in case of no matches of a line of file2 in file1.

Try paste command. This is less noble form than awk. I prefer awk but paste command should help you.
paste file1 file2 file3... etc ..fileN
You can redirect command output to a file as usual.
paste file1 file2 file3... etc ..fileN > fileN+1 (or whatever)
Thats read files line by line and paralelize output sequentially way.
That's it. It is not very elegant but sometimes it is very useful until you find a different way to get the results you are looking for.
Hope that helps

Related

how to pull data from a vcf table

i have two files:
SCR_location - which has information about a SNP location in an ascending order.
19687
36075
n...
modi_VCF - a vcf table that has information about every SNP.
19687 G A xxx:255,0,195 xxx:255,0,206
20398 G C 0/0:0,255,255 0/0:0,208,255
n...
i want to save just the lines with the matching SNP location into a new file
i wrote the following script but it doesn't work
cat SCR_location |while read SCR_l; do
cat modi_VCF |while read line; do
if [ "$SCR_l" -eq "$line" ] ;
then echo "$line" >> file
else :
fi
done
done
Would you please try a bash solution:
declare -A seen
while read -r line; do
seen[$line]=1
done < SCR_location
while read -r line; do
read -ra ary <<< "$line"
if [[ ${seen[${ary[0]}]} ]]; then
echo "$line"
fi
done < modi_VCF > file
It first iterates over SCR_location and stores SNP locations in an associative array seen.
Next it scans modi_VCF and if the 1st column value is found in the associative array, then print the line.
If awk is your option, you can also say:
awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) print}' SCR_location modi_VCF > file
[Edit]
In order to filter out the unmached lines, just negate the logic as:
awk 'NR==FNR {seen[$1]++; next} {if (!seen[$1]) print}' SCR_location modi_VCF > file_unmatched
The code above outputs the unmatched lines only. If you want to sort the matched lines and the unmatched lines at once, please try:
awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) {print >> "file_matched"} else {print >> "file_unmatched"} }' SCR_location modi_VCF
Hope this helps.

Print all lines in "file2" which have line number stored in "file1" $2

File1:
count line_num
xy 55
ab 67
File2:
a|b|c
d|e|f
I want to print 55, 67 line numbers of file2
am trying:
#!/usr/bin/ksh
while read file_name; do
line_num=`echo $file_name | awk '{print $2}'`
awk 'NR==$line_num{print;exit}' file2 >> file3.txt
done < file1
but it's not working!
Using awk you can do:
awk 'NR==FNR{line[$2]; next} FNR in line' file1 file2
We iterate the first file and store second column in a map called line (we could ignore the first line which is the header by doing NR>1 but since it doesn't contain numbers we don't need to). Once the first file is loaded in map, we iterate the second file and print out lines that are in our map. NR and FNR are awk variables that remembers the line numbers.
You can use awk to read the line numbers in a loop and sed to print out the specific lines:
while read a; do sed -n ${a}p f2.txt; done < <(awk 'NR>1{print$2}' f1.txt)
If you have a bigger file, performance can be an issue as Ed pointed out, in that case you can use awk alone:
awk 'NR==FNR{if(NR>1)l[$2]=1;next}{if(l[FNR])print $0}' f1.txt f2.txt
Another way, is to use xargs:
awk 'NR>1{print $2}' f1.txt | xargs -n1 -I {} sed -n {}p f2.txt
Use sed to construct a sed one-liner (in the case of file1 it'd output and run sed -n "55p;67p;" file2):
sed -n "$(sed -n '2~1{s/.* //;s/.*/&p/p}' file1)" file2
A good advertisement for awk, alas!

find and replace words in files

I have two files: file1 and file2.
Any match in file2 should append "-W" to the word in file1.
File1:
Verb=Applaud,Beg,Deliver
Adjective=Bitter,Salty,Minty
Adverb=Quickly,Truthfully,Firmly
file2:
Gate
Salty
Explain
Quickly
Hook
Deliver
Earn
Jones
Applaud
Take
Output:
Verb=Applaud-W,Beg,Deliver-W
Adjective=Bitter,Salty-W,Minty
Adverb=Quickly-W,Truthfully,Firmly
Tried but not working and may take too long:
for i in `cat file2` ; do
nawk -v DEE="$i" '{gsub(DEE, DEE"-W")}1' file1 > newfile
mv newfile file1
done
This should work:
sed 's=^=s/\\b=;s=$=\\b/\&-W/g=' file2 | sed -f- file1
Output:
Verb=Applaud-W,Beg,Deliver-W
Adjective=Bitter,Salty-W,Minty
Adverb=Quickly-W,Truthfully,Firmly
To make changes in place:
sed 's=^=s/\\b=;s=$=\\b/\&-W/g=' file2 | sed --in-place -f- file1
Your approach was not that bad but I would prefer sed here, since it has an in place option.
while read i
do
sed -i "s/$i/$i-W/g" file1
done < file2
Here is one using pure bash:
#!/bin/bash
while read line
do
while read word
do
if [[ $line =~ $word ]]; then
line="${line//$word/$word-W}"
fi
done < file2
echo $line
done < file1
An awk:
awk 'BEGIN{FS=OFS=",";RS="=|\n"}
NR==FNR{a[$1]++;next}
{
for (i=1;i<=NF;i++){
$i=($i in a) ? $i"-W":$i
}
printf("%s%s",$0,FNR%2?"=":"\n")
}' file2 file1
Results
Verb=Applaud-W,Beg,Deliver-W
Adjective=Bitter,Salty-W,Minty
Adverb=Quickly-W,Truthfully,Firmly

output of oddlines in sed not appearing on separate lines

I have the following file:
>A6NGG8_201_I_F
line2
>B1AK53_719_S_R
line4
>B1AK53_744_D_N
line5
>B7U540_205_R_H
line6
>B7U540_354_T_M
line7
where I want to print out all odd lines. I can do this by:
$ sed -n 1~2p file
>A6NGG8_201_I_F
>B1AK53_719_S_R
>B1AK53_744_D_N
>B7U540_205_R_H
>B7U540_354_T_M
and so I want to store the number in each line as a variable in bash, however I run into a problem - storing the result of sed puts the output all on one line:
#!/bin/bash
line1=$(sed -n 1~2p)
echo ${line1}
in which the output is:
>A6NGG8_201_I_F >B1AK53_719_S_R >B1AK53_744_D_N >B7U540_205_R_H >B7U540_354_T_M
so that when I do something like:
#!/bin/bash
line1=$(sed -n 1~2p)
pos=$(echo ${line1} | awk -F"[__]" 'NF>2{print $2}')
echo ${pos}
I get
201
where I of course want:
201
719
744
205
354
How do I store the result of sed into separate lines so that they are processed properly when piped into my awk statement? I see you can use the /anotation, however when I tried sed -n '/1~2p/a' filethis does not work in my bash script. Thanks
As said in comments, you need to quote the variable to make this happen:
echo "${line1}"
instead of
echo ${line1}
However, you can directly say:
awk -F_ 'NR%2 && NF>2 {print $2}' file
This will process even lines and, in them, print the 2nd field on _ separated, just if it there are more than 2 fields.
From tripleee's answer I observe that a FASTA file can contain a different format. If so, I guess you will still want to get the ID in the lines starting with ">". This can be translated as:
awk -F_ '/^>/ && NF>2 {print $2}' file
See an example of how quoting preserves the format:
The file:
$ cat a
hello
bye
Read it into a variable:
$ var=$(< a)
echo without quoting:
$ echo $var
hello bye
Let's quote!
$ echo "$var"
hello
bye
If you are trying to get the header lines out of a FASTA file, your problem statement is wrong -- the data between the headers could be more than one line. You could simply do
sed -n '/^>/!d;s/^[^_]*//;s/_.*//p' file.fasta
to get just the second underscore-delimited field out of each header line; or equivalently, in Awk,
awk -F _ '/^>/ { print $2 }' file.fasta

unix command to get lines from in between first and last occurence of a word and write to a file

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.
You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11
If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file
I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.
Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Resources