how to pull data from a vcf table - bash

i have two files:
SCR_location - which has information about a SNP location in an ascending order.
19687
36075
n...
modi_VCF - a vcf table that has information about every SNP.
19687 G A xxx:255,0,195 xxx:255,0,206
20398 G C 0/0:0,255,255 0/0:0,208,255
n...
i want to save just the lines with the matching SNP location into a new file
i wrote the following script but it doesn't work
cat SCR_location |while read SCR_l; do
cat modi_VCF |while read line; do
if [ "$SCR_l" -eq "$line" ] ;
then echo "$line" >> file
else :
fi
done
done

Would you please try a bash solution:
declare -A seen
while read -r line; do
seen[$line]=1
done < SCR_location
while read -r line; do
read -ra ary <<< "$line"
if [[ ${seen[${ary[0]}]} ]]; then
echo "$line"
fi
done < modi_VCF > file
It first iterates over SCR_location and stores SNP locations in an associative array seen.
Next it scans modi_VCF and if the 1st column value is found in the associative array, then print the line.
If awk is your option, you can also say:
awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) print}' SCR_location modi_VCF > file
[Edit]
In order to filter out the unmached lines, just negate the logic as:
awk 'NR==FNR {seen[$1]++; next} {if (!seen[$1]) print}' SCR_location modi_VCF > file_unmatched
The code above outputs the unmatched lines only. If you want to sort the matched lines and the unmatched lines at once, please try:
awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) {print >> "file_matched"} else {print >> "file_unmatched"} }' SCR_location modi_VCF
Hope this helps.

Related

Shell Script for combining 3 files

I have 3 files with below data
$cat File1.txt
Apple,May
Orange,June
Mango,July
$cat File2.txt
Apple,Jan
Grapes,June
$cat File3.txt
Apple,March
Mango,Feb
Banana,Dec
I require the below output file.
$Output_file.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec
Requirement here is the take out the first column and then common data in column 1 in each file need to be searched and second column needs to be "|" separated. If there is no common column, then same needs to be printed in the output file.
I have tried putting this in a while loop, but it takes time as the file size increase. Wanted a simple solution using shell script.
This should work :
#!/bin/bash
for FRUIT in $( cat "$#" | cut -d "," -f 1 | sort | uniq )
do
echo -ne "${FRUIT},"
awk -F "," "\$1 == \"$FRUIT\" {printf(\"%s|\",\$2)}" "$#" | sed 's/.$/\'$'\n/'
done
Run it as :
$ ./script.sh File1.txt File2.txt File3.txt
A purely native-bash solution (calling no external tools, and thus limited only by the performance constraints of bash itself) might look like:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4 or newer required" >&2; exit 1;; esac
declare -A items=( )
for file in "$#"; do
while IFS=, read -r key value; do
items[$key]+="|$value"
done <"$file"
done
for key in "${!items[#]}"; do
value=${items[$key]}
printf '%s,%s\n' "$key" "${value#'|'}"
done
...called as ./yourscript File1.txt File2.txt File3.txt
This is fairly easy done with a single awk command:
awk 'BEGIN{FS=OFS=","} {a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i in a) print i, a[i]}' File{1,2,3}.txt
Orange,June
Banana,Dec
Apple,May|Jan|March
Grapes,June
Mango,July|Feb
If you want output in the same order as strings appear in original files then use this awk:
awk 'BEGIN{FS=OFS=","} !($1 in a) {b[++n] = $1}
{a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i=1; i<=n; i++) print b[i], a[b[i]]}' File{1,2,3}.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec

bash to get information from two files

I have file1:
NM_000014 A2M
NM_000015 NAT2
NM_000016 ACADM
NM_000017 ACADS
NM_000018 ACADVL
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000021 PSEN1
NM_000022 ADA
And file2:
NM_000019
NM_000020
NM_000020
NM_12345
I need to get information from my file1 and put it to file2 - so create file3:
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO
Note - I can not change a original sort order (so not use comm and diff). I have duplication line in file2 - this I need keep (wc -l file2 == wc -l file3). If there is no match - print NO
I have about 70K rows and I do not need fastest solution.
My code is able just compare and print the same results.
code:
#!/bin/bash
while read -r c; do
grep $c file1 | uniq
done < file2 > file3
Using awk:
$ awk 'NR==FNR{a[$1]=$2;next} {print ($1 in a?$1 OFS a[$1]:$1 OFS "NO")}' file1 file2
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO
Explained:
NR==FNR{ # process the first file
a[$1]=$2 # hash records to a, $1 as key
next # skip to next record
}
{ # process the second file
print ($1 in a?$1 OFS a[$1]:$1 OFS "NO") # print hashed value if found or NO
# if($1 in a) # another way of saying above
# print $1, a[$1]
# else
# print $1, "NO"
}
So basically you have one file with patterns, and a second one that you want to search using those patterns:
#!/bin/bash
for PATTERN in $(cat $2); do
TMP=$(egrep $PATTERN $1)
if [ ! -z "$TMP" ]; then
echo "$TMP"
else
echo "$PATTERN NO"
fi
done
and a quick test:
$ bash filter.sh file1 file2
NM_000019 ACAT1
NM_000020 ACVRL1
NM_000020 ACVRL1
NM_12345 NO
Try with this if sentence added to your code:
if ! grep -q $i fileone ; then
echo -e $i " NO"
fi
For example:
#!/bin/bash
while read -r c; do grep $c fileone | uniq; done < filetwo
for i in $(cat filetwo)
do
if ! grep -q $i fileone ; then
echo -e $i " NO"
fi
done
It will print NO in case of no matches of a line of file2 in file1.
Try paste command. This is less noble form than awk. I prefer awk but paste command should help you.
paste file1 file2 file3... etc ..fileN
You can redirect command output to a file as usual.
paste file1 file2 file3... etc ..fileN > fileN+1 (or whatever)
Thats read files line by line and paralelize output sequentially way.
That's it. It is not very elegant but sometimes it is very useful until you find a different way to get the results you are looking for.
Hope that helps

comparing two files and priniting lines with similar strings in one file [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 6 years ago.
I have two file which I need to compare, and if the first column in file1 matches part of the fisrt columns in file2, then add them side by side in file3, below is an example:
File1:
123123,ABC,2016-08-18,18:53:53
456456,ABC,2016-08-18,18:53:53
789789,ABC,2016-08-18,18:53:53
123123,ABC,2016-02-15,12:46:22
File2
789789_TTT,567774,223452
123123_TTT,121212,343434
456456_TTT,323232,223344
output:
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,18:53:53,123123_TTT,121212,343434
Thanks..
Usin Gnu AWK:
$ awk -F, 'NR==FNR{a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0;next} $1 in a{print $0","a[$1]}' file2 file1
123123,ABC,2016-08-18,18:53:53 123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53 456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53 789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22 123123_TTT,121212,343434
Explanation:
NR==FNR { # for the first file (file2)
a[gensub(/([^_]*)_.*/,"\\1","g",$1)]=$0 # store to array
next
}
$1 in a { # if the key from second file in array
print $0","a[$1] # output
}
awk solution matches keys formed from file2 against column 1 of file1 - should also work on Solaris using /usr/xpg4/bin/awk - I took the liberty of assuming the last line of OP output has a typo
file1=$1
file2=$2
AWK=awk
[[ $(uname) == SunOS ]] && AWK=/usr/xpg4/bin/awk
$AWK -F',' '
BEGIN{OFS=","}
# file2 key is part of $1 till underscore
FNR==NR{key=substr($1,1,index($1,"_")-1); f2[key]=$0; next}
$1 in f2 {print $0, f2[$1]}
' $file2 $file1
tested
123123,ABC,2016-08-18,18:53:53,123123_TTT,121212,343434
456456,ABC,2016-08-18,18:53:53,456456_TTT,323232,223344
789789,ABC,2016-08-18,18:53:53,789789_TTT,567774,223452
123123,ABC,2016-02-15,12:46:22,123123_TTT,121212,343434
Pure bash solution
file1=$1
file2=$2
while IFS= read -r line; do
key=${line%%_*}
f2[key]=$line
done <$file2
while IFS= read -r line; do
key=${line%%,*}
[[ -n ${f2[key]} ]] || continue
echo "$line,${f2[key]}"
done <$file1

Shell Script - Extract number at X column in current line in file

I am reading a file (test.log.csv) line by line until the end of the file, and I want to extract the value at 4th column of current line read then output the value to a text file. (output.txt)
For example, right now I read until 2nd line (INSERT,SLT_TEST_1,TEST,1127192896,0,DEBUG1) and I want to extract the number at column 4 in the current line and output to a text file named as output.txt.
test.log.csv
INSERT,SLT_TEST_1,TEST,1127192896,0,DEBUG1
INSERT,SLT_TEST_1,TEST,1127192896,0,DEBUG1
INSERT,SLT_TEST_1,TEST,1127192896,0,DEBUG1
The desired output is
output.txt
1127192896
1127192896
1127192896
Right now my script is as below
#! /bin/bash
clear
rm /home/mobaxterm/Script/output.txt
while IFS= read -r line
do
if [[ $line == *"INSERT"* ]] && [[ $line == *"$1"* ]]
then
echo $line >> /home/mobaxterm/Script/output.txt
lastID=$(awk -F "," '{if (NR==curLine) { print $4 }}' curLine="${lineCount}")
echo $lastID
else
if [ lastID == "$1" ]
then
echo $line >> /home/mobaxterm/Script/output.txt
fi
fi
lineCount=$(($lineCount+1))
done < "/home/mobaxterm/Script/test.log.csv"
The parameter ($1) will be 1127192896
I tried declaring a counter in the loop and compare NR with the counter, but the script just stopped after it found the first one.
Find all the lines where the 4th field is 1127192896 and output the 4th field:
awk -F, -v SEARCH="1127192896" '$4 ~ SEARCH {print $4}' test.log.csv
1127192896
1127192896
1127192896
Find all the lines containing the word "INSERT" and where the 4th field is 1127192896
awk -F, -v SEARCH="1127192896" '$4 ~ SEARCH && /INSERT/ {print $4}' test.log.csv
If you have the number you want to look for in a variable called $1, put that in place of the 1127192896, like this:
awk -F, -v SEARCH="$1" '$4 ~ SEARCH && /INSERT/ {print $4}' test.log.csv
You can combine variable substitution and definition of array.
array_variable=( ${line//,/ /} )
sth_you_need=${array_variable[1]}
Or you can just use awk/cut
sth_you_need=$(echo $line | awk -F, 'NR==2{print $2}')
# or
sth_you_need=$(echo $line | cut -d, -f2)

unix command to get lines from in between first and last occurence of a word and write to a file

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.
You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11
If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file
I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.
Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Resources