vcf to ped format: redefine non-dbSNPs

vcf to ped format: redefine non-dbSNPs - bioinformatics

When I am converting a vcf file to ped format (with vcftools or with vcf to ped converter of 1000G), I run into the problem that the IDs of the variants that don't have a dbSNP ID get the base pair position of that variant as an ID. Example of couple of variants:
1 rs35819278 0 23333187
1 23348003 0 23348003
1 23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 23910834 0 23910834
However, I would like the variants without dbSNP ID to get the the format "chr:basepairposition". So the example of above would look like:
1 rs35819278 0 23333187
1 chr1:23348003 0 23348003
1 chr1:23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 chr1:23910834 0 23910834
Would be great if anyone could help me to explain what command or which script I have to use to change this 2nd column for the variants without a dbSNP ID.
Thanks!

This can be done with sed. Since tabs are involved, the exact syntax may vary a bit depending on what sed is installed on your system; the following should work for Linux:
cat [.map filename] | sed 's/^\([0-9]*\)\t\([0-9]\)/\1\tchr\1:\2/g' > [new filename]
This looks for lines starting with [number][tab][digit], and makes them start with [number][tab]chr[number]:[digit] instead, while leaving other lines unchanged.
OS X is a bit more painful (you'll need to use ctrl-V or [[:blank:]] to deal with the tab).

This can be done with plink2. You just need to use the --set-missing-var-ids option (https://www.cog-genomics.org/plink2/data#set_missing_var_ids) accordingly:
plink --vcf [filename] \
--keep-allele-order \
--vcf-idspace-to _ \
--double-id \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--set-missing-var-ids chr#:# \
--make-bed \
--out [prefix]
However, notice that you could have multiple variants being assigned the same IDs using this method and plink2 will not tolerate variants with the same ID. To learn more about converting VCF files to plink, the following resource has further insights: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html

Related

Data Mining understanding ``` join aux_data.txt aux_cat.txt --header -1 1 -2 1 -t '|' -a 1 > aux_ticdata1.txt```

I'm starteing with datamining on the terminal. Does anyone can explain me what does the line join aux_data.txt aux_cat.txt --header -1 1 -2 1 -t '|' -a 1 > aux_ticdata1.txt does? I know that it's joining together two files and nameing that file "aux_ticdata1.txt". Also I know that the line is doing two cosecutive intruccions. But I don't undersant what -a 1 > aux_ticdata1.txt does either. Any suggestions would be great!

replace exact number in shell

I have following matrix:
0.380451 0.381955 0 0.237594
0.317293 0.362406 0 0.320301
0.261654 0.38797 0 0.350376
0 0 0 1
0 1 0 0
0 0 0 1
0 0.001504 0 0.998496
0.270677 0.35188 0.018045 0.359398
0.36391 0.305263 0 0.330827
0.359398 0.291729 0.037594 0.311278
0.359398 0.276692 0.061654 0.302256
And I want to replace only the zeros not the zeros followed by points to 0.001, how can I do that with sed or gsub?

This is not elegant, and not super portable, but it works on your specific example:
sed -e 's=^0 =X =g
s= 0$= X=g
s= 0 = X =g' data.txt
First of all, it assumes that the fields in the input file are separated by one or more white spaces. The first part looks for "0" at the beginning of the line, the second at the end of the line, and the third finds "0" with spaces on both sides.
Any particular reason to use only sed for this? I am sure that a simple awk script could do a better job, and also be more robust.

Match whitespace in your replacement.
echo 0 0.001504 0 0.998496 | sed 's/0[\t ]/Z /g'

search lines of file for email address - returning whole line, with bash

Suppose I have a file (sizes.txt)
daveclark#foo.com 0 23252 0
mikeclark#foo.com 0 45131 1
clark#foo.com 0 55235 0
joeclark#bar.net 33632 1
maryclark#bar.net 0 55523 0
clark#bar.net 0 99356 0
Now I have another file (users.txt)
clark#foo.com
clark#bar.net
What I want to do is find each line in sizes.txt for the specific email addresses in users.txt...using a loop, bash or one-liner in CentOS. Here's the key point, I need to find lines that only contain clark#foo.com and then clark#bar.net - meaning this should be one line only for each.
The most simple way that comes to mind...
for i in `cat users.txt`; do grep $i sizes.txt; done
...but this does not work because processing the first line of users.txt will return the lines containing daveclark#foo.com, mikeclark#foo.com and clark#foo.com. I explicitly want the line containing "clark#foo.com" (the third line of sizes.txt). Processing second line of users.txt, will have the same problem (it will return maryclark#bar.net and clark#bar.net lines) I know this has to be something totally simple that I'm overlooking.

What you are looking for is the exact match with grep. In your case that would be the -w option.
So
for i in cat users.txt do
grep -w "^$i" sizes.txt
done
should do the trick.
Cheers.

You can try something like this using only bash built-in functions and syntax:
while read -r user ; do
while read -r s_user s_column_2 s_column_3 s_column_4 ; do
[ "${s_user}" = "${user}" ] && printf "%b\t%b\t%b\t%b\n" "${s_user}" "${s_column_2}" "${s_column_3}" "${s_column_4}"
done < sizes.txt
done < users.txt
this nested while could be slow when using big size.txt files. In those cases you could use this in combination with awk

BASH script conditional statement issues

I'm trying to use a Bash script to run a large number of calculations (just over 2 million) using a terminal-based program called uvspec. But I've hit a serious barrier following the latest addition to the calculation...
The script, opens an input file which has 2e^6 lines looking like this:
0 66.3426 -9.999 -9999
0 66.6192 -9.999 -9999
0 61.9212 1.655 1655
0 61.9999 1.655 1655
...
Each of these values represents a different value I want to substitute into the input file (using sed), so I read each line into an array. Many of these lines contain negative values in the 4th column e.g. -9999, which result in errors in the program so I would like to omit those lines and return a standard output - I'm doing this with the if statement... Problem is something terribly wrong is coming out of my output and I'm 99.9% sure the problem is a mistake in the following script as I'm fairly new to bash.
Can anyone spot anything here that doesn't make sense or is bad syntax?
Any comments on the script in general would also be useful feedback.
cat ".../Maps/dniinput" | while IFS=$' ' read -r -a myArray
do
if [ "${myArray[3]}" -gt 0 ]
then
sed s/TAU/"${myArray[0]}"/ x.template x.template > a.template
sed s/SZA/"${myArray[1]}"/ a.template a.template > b.template
sed s/ALT/"${myArray[2]}"/ b.template b.template > x.inp
../bin/uvspec < x.inp >> dni.out
else
echo "0 -9999" >> dnijul.out
fi
done

Sed can do all three substitutions in one go and you can pipe the output straight into your analysis program without creating any intermediate a.template and b.template files...
sed -e "s/.../.../" -e "s/.../.../" -e "s/.../.../" x.template | ../bin/uvspec
By the way, you can also get rid of the "cat" at the start, and replace your array with variables whose names better match what they are, if you use a loop like this:
while IFS=S' ' read tau sza alt p4
do
echo $tau $sza $alt $p4
done < a
0 66.3426 -9.999 -9999
0 66.6192 -9.999 -9999
0 61.9212 1.655 1655
0 61.9999 1.655 1655
I named the fourth element "p4" because you refer to the 4th one as the altitude in your comment, but in your code you replace the word "ALT" with the third column - so I am not really sure what your parameters are, but you should hopefully get the idea from the example above.

You might want to combine those "sed" lines into something more like:
sed -e "s/TAU/${myArray[0]}/" -e "s/SZA/${myArray[1]}/" \
-e "s/ALT/${myArray[2]}/" < x.template \
| ../bin/uvspec >> dni.out

Checking if strings exist in a file (ksh)

What this KornShell (ksh) script should do is check in my dmesg for disks, internals and externals. On the dmesg output, internal drives appear as wd[0-9] and externals as sd[0-9]. I of course do not have so many disks but I want my script to cover as many possibilities as possible. Devices 0-9 will be checked. So this is the idea:
create two arrays of size 9, search through dmesg if wd0 exists, if so make the first element of internals 1, if wd1 does exist make the second element of array 1 or 0 otherwise.
If it was to search for one specific disk e.g. wd0, I could do something like:
internal=`dmesg | grep "^wd0" | head -n 1 | cut -d\ -f1`
which makes
internal = wd0
But how to check if the strings wd0-wd9 exists in the dmesg in a "loopy" way
# create arrays
set -A internals 0 0 0 0 0 0 0 0 0
set -A externals 0 0 0 0 0 0 0 0 0
(the code below is not ksh code, but presenting the idea in c-like syntax):
for (i=0;i<=8;i++){
if (wd0) # that is if wd0 exists, if wd1 exists etc.
internal[i] = 1;
else
internal[i] = 0;
}
And of course the same process should be followed for the externals.

I think the following should get you going, not exactly sure what values you want your arrays set to but here goes:
#!/bin/ksh
set -A internal
for i in {0..9}
do
echo "Looking for wd"$i
internal[$i]=`dmesg | grep "^wd$i" | head -n 1 | cut -d\ -f1`
if [[ ${internal[$i]} = "wd"$i ]];then
internal[$i]=1
else
internal[$i]=0
fi
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

vcf to ped format: redefine non-dbSNPs - bioinformatics

Related

Data Mining understanding ``` join aux_data.txt aux_cat.txt --header -1 1 -2 1 -t '|' -a 1 > aux_ticdata1.txt```

replace exact number in shell

search lines of file for email address - returning whole line, with bash

BASH script conditional statement issues

Checking if strings exist in a file (ksh)

Categories

Resources