How do I remove duplicated by position SNPs using PLink? - bioinformatics

I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?

If we already have files in plink format then we should have .bim for binary plink files or .map for text plink files. In either case the positions are on the 3rd column and SNP names are on 2nd column.
We need to create a list of SNPs that are duplicated:
sort -k3n myFile.map | uniq -f2 -D | cut -f2 > dupeSNP.txt
Then run plink with --exclude flag:
plink --file myFile --exclude dupeSNP.txt --out myFileSubset

you can also do it directly in PLINK1.9 using the --list-duplicate-vars flag
together with the <require-same-ref>, <ids-only>, or <suppress-first> modifiers depending on what you want to do.
check https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars for more details
If you want to delete all occurences of a variant with duplicates, you will have to use the --exclude flag on the output file of --list-duplicate-vars ,
which should have a .dupvar extention.

I should caution that the two answers given below yield different results. This is because the sort | uniq method only takes into account SNP and bp location; whereas, the PLINK method (--list-duplicate-vars) takes into account A1 and A2 as well.
Similar to sort | uniq on the .map file we could use AWK on a .gen file, that looks like this:
22 rs1 12 A G 1 0 0 1 0 0
22 rs1 12 G A 0 1 0 0 0 1
22 rs2 16 C A 1 0 0 0 1 0
22 rs2 16 C G 0 0 1 1 0 0
22 rs3 17 T CTA 0 0 1 0 1 0
22 rs3 17 CTA T 1 0 0 0 0 1
# Get list of duplicate rsXYZ ID's
awk -F' ' '{print $2}' chr22.gen |\
sort |\
uniq -d > chr22_rsid_duplicates.txt
# Get list of duplicated bp positions
awk -F' ' '{print $3}' chr22.gen |\
sort |\
uniq -d > chr22_location_duplicates.txt
# Now match this list of bp positions to gen file to get the rsid for these locations
awk 'NR==FNR{a[$1]=$2;next}$3 in a{print $2}' \
chr22_location_duplicates.txt \
chr22.gen |\
sort |\
uniq \
> chr22_rsidBylocation_duplicates.txt
cat chr22_rsid_duplicates.txt \
chr22_rsidBylocation_duplicates.txt \
> tmp
# Get list of duplicates (by location and/or rsid)
cat tmp | sort | uniq > chr22_duplicates.txt
plink --gen chr22.gen \
--sample chr22.sample \
--exclude chr22_duplicates.txt \
--recode oxford \
--out chr22_noDups
This will classify rs2 as a duplicate; however, for the PLINK list-duplicate-vars method rs2 will not be flagged as a duplicate.
If one want's to obtain the same results using PLINK (a non-trivial task for BGEN file formats since awk, sed etc. do not work on binary files!) you can use the --rm-dup command from PLINK2.0. The list of all duplicate SNPs removed can be logged (to a file ending in .rmdup.list) using the list parameter, like so:
plink2 --bgen chr22.bgen \
--sample chr22.sample \
--rm-dup exclude-all list \
--export bgen-1.1 \ # Export as bgen version 1.1
--out chr22_noDups
Note: I'm saving the output as version 1.1 since plink1.9 still has commands not available in plink version 2.0. Therefore the only way to use bgen files with plink1.9 (at this time) is with the older 1.1 version.

Related

How can I count and display only the words that are repeated more than once using unix commands?

I am trying to count and display only the words that are repeated more than once in a file. The basic idea is:
You are given a file with names and characters like commas, colons, slashes, etc..
Use the cut command to display only the first names in the file (other commands are also allowed).
Count and then display only the names repeated more than once.
I got to the point of counting and displaying all the names. However, I haven't found a way to display and to count only those names repeated more than once.
Here is a section of the file:
user1:x:80:200:Mia,Spurs:/home/user1:/bin/bash
user2:x:80:200:Martha,Dalton:/home/user2:/bin/bash
user3:x:80:200:Lucy,Carlson:/home/user3:/bin/bash
user4:x:80:200:Carl,Bingo:/home/user4:/bin/bash
Here is what I have been able to do:
Daniel#Daniel-MacBook-Pro Files % cut -d ":" -f 5-5 file1 | cut -d "," -f 1-1 | sort -n | uniq -c
1 Mia
3 Martha
1 Lucy
1 Carl
1 Jessi
1 Joke
1 Jim
2 Race
1 Sem
1 Shirly
1 Susan
1 Tim
You can filter out the rows with count 1 with grep.
cut -d ":" -f 5 file1 | cut -d "," -f 1 | sort | uniq -c | grep -v '^ *1 '

sort: wrong order when comparing according to numerical value

I'm trying to sort the lines in the following file according to numerical value in the second column:
2 117.336136
1 141.003021
1 342.389160
1 169.059006
1 208.173370
1 117.608192
However, for some reason, the following command returns the lines in the wrong order:
cat file | sort -n -k2
1 117.608192
2 117.336136
1 141.003021
1 169.059006
1 208.173370
1 342.389160
The first two lines are swapped. For other lines, the content of the first column does not affect the result.
Without the -k argument, the sort works exacly as expected:
cat file | cut -d' ' -f2 | sort -n
117.336136
117.608192
141.003021
169.059006
208.173370
342.389160
Why is that? Did I misunderstand the meaning of the -k argument?
Additional information:
LC_ALL=cs_CZ.utf8
sort --version gives sort (GNU coreutils) 8.31
Sort sorts files according to your locale settings.
As mentioned by #KamilCuk, the decimal separator for cs_CZ is , instead of .. You can override the default locale with LC_ALL=C.UTF-8 or LC_ALL=C (no UTF-8 support), to use the C local settings for sorting.
For sorting floating point values, you need -g as -n just sorts the integer part.
Also important when using sort is to restrict the sorting to the specific column -k 2,2g as by default sort also will use the rest of the line for sorting.
$ LC_ALL=C.UTF-8 sort -k 2,2g test_sort.txt
2 117.336136
1 117.608192
1 141.003021
1 169.059006
1 208.173370
1 342.389160
$ LC_ALL=C sort -k 2,2g test_sort.txt
2 117.336136
1 117.608192
1 141.003021
1 169.059006
1 208.173370
1 342.389160
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n'
1 5.3 6.0
2 5.3 5.0
# Sort uses the rest of the line to sort.
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n' | LC_ALL=C.UTF-8 sort -k 2
2 5.3 5.0
1 5.3 6.0
# Sort only uses the second column.
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n' | LC_ALL=C.UTF-8 sort -k 2,2
1 5.3 6.0
2 5.3 5.0

unix sort groups by their associated maximum value?

Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Minimal two column numeric input data for `sort` example, with distinct permutations

What's the least number of rows of two-column numeric input needed to produce four unique sort outputs for the following four options:
1. -sn -k1 2. -sn -k2 3. -sn -k1 -k2 4. -sn -k2 -k1 ?
Here's a 6 row example, (with 4 unique outputs):
6 5
3 7
6 3
2 7
4 4
5 2
As a convenience, a function to count those four outputs given 2 columns of numbers, (requires the moreutils pee command), which prints the number of unique outputs:
# Usage: foo c1_1 c2_1 c1_2 c2_2 ...
foo() { echo "$#" | tr -s '[:space:]' '\n' | paste - - | \
pee "sort -sn -k1 | md5sum" \
"sort -sn -k2 | md5sum" \
"sort -sn -k1 -k2 | md5sum" \
"sort -sn -k2 -k1 | md5sum" | \
sort -u | wc -l ; }
So to count the unique permutations of this input:
8 5
3 5
8 4
Run this:
foo 8 5 3 1 8 3
Output:
2
(Only two unique outputs. Not enough...)
Note: This question was inspired by the obscurity of the current version of the sort manual, specifically COLUMNS=65 man sort | grep -A 17 KEYDEF | sed 3,18d. The info sort page's treatment of KEYDEFs is much better.
KEYDEFs are more useful than they might first seem. The -u or --unique switch works nicely with the KEYDEFs, and in effect allows sort to delete unwanted redundant lines, and therefore can furnish a more concise substitute for certain sed or awk scripts and similar pipelines.
I can do it in 3 by varying the whitespace:
1 1
2 1
1 2
Your foo function doesn't produce this kind of output, but since it was only a "convenience" and not a part of the question proper, I declare this answer correct and minimal!
Sneakier version:
2 1
11 1
2 2
(The last line contains a tab; the others don't.)
With the -s option, I can't exploit non-numeric comparisons, but then I can exploit the stability of the sort:
1 2
2 1
1 1
The 1 1 line goes above both of the others if both fields are compared numerically, regardless of which comparison is done first. The ordering of the two comparisons determines the ordering of the other two lines.
On the other hand, if one of the fields isn't used for comparison, the 1 1 line stays below one of the other lines (and which one that is depends on which field is used for comparison).

Resources