How to remove snps with missing names - genetics

I have 1000 G data set which is in PLINK format, there are some snps with names as ".", is there any way in PLINK i can remove that snps?
I tried bcftool view which does not work correctly.

Execute the following command
plink --bfile $YOUR_GENOTYPE_FILE --extract SNPS_TO_EXCLUDE.txt --make-bed --out $NEW_GENOTYPE_FILE
where the $ variables are your desired PLINK BED/BIM/BAM file prefixes.
What does SNPS_TO_EXCLUDE.txt look like? From the PLINK website:
--extract normally accepts a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces), and removes all unlisted variants from the current analysis.
--exclude does the same for all listed variants.
Thus, SNPS_TO_EXCLUDE.txt should contain a line with ".".

Related

grep -w -f is not returning all matches from list

I am trying to use a list that looks like this:
List file:
1mAF
2mAF
4mAF
7mAF
9mAF
10mAF
11mAF
13mAF
18mAF
27mAF
33mAF
36mAF
37mAF
38mAF
39mAF
40mAF
41mAF
45mAF
46mAF
47mAF
49mAF
57mAF
58mAF
60mAF
61mAF
62mAF
63mAF
64mAF
67mAF
82mAF
86mAF
87mAF
95mAF
96mAF
to grab out lines that contain a word-level match in a tab delimited file that looks like this:
File_of_interest:
11mAF-NODE_111-g7687-JEFF-tig00000037_arrow-g7396-AFLA_058530 11mAF cluster63
17mAF-NODE_343-g9350 17mAF cluster07
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
22mAF-NODE_36-g3735 22mAF cluster28
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
47mAF-NODE_30-g3242-JEFF-tig00000037_arrow-g7396-AFLA_058530 47mAF cluster21
67mAF-NODE_54-g4372 67mAF cluster06
69mAF-NODE_27-g2754 69mAF cluster39
71mAF-NODE_44-g4178 71mAF cluster25
73mAF-NODE_47-g4895 73mAF cluster57
78mAF-NODE_4-g688 78mAF cluster53
but when I do grep -w -f list file_of_interest these are the only ones I get:
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
and this misses a bunch of the values that are in the original list for example note that "67mAF" is in the list and in the file but it isn't returned.
I have tried removing everything after "mAF" in the list and trying again -- no change. I have rewritten the list in a completely new file to no avail. Oddly, I get more of them if I "sort" the list into a new file and then do the grep, but I still don't get all of them. I have also removed all invisible characters using sed (sed $'s/[^[:print:]\t]//g'). no change.
I am on OSX and both files were created on OSX, but normally grep -f -w works in the fashion i'm describing above.
I am completely flummoxed. Is I thought grep -w -f would look for all word-level matches of items in the file in the target file... am I wrong?
Thanks!
My guess is at least one of these files originates from a Windows machine and has CRLF line endings. file(1) might be used to tell you. If that is the case do:
fromdos FILE
or, alternatively:
dos2unix FILE

How to write a script to fetch the address of the links to .rar files on a webpage?

Have a pile of 50 .rar files on a web server and I want to download them all.
And, the names of the files have nothing in common other than .rar.
I wanted to try aria2 to download all of them altogether, but I think I need to write a script to fetch the addresses of all the .rar files.
I have no idea how to start writing the scrip. Any hint will be appreciated.
You can try to play with wget with -A parameter in your shell script:
wget -r "https://foo/" -P /tmp -A "*.rar"
Here is an explanation of what -A does
Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix. In this case, you have to enclose the pattern into quotes to prevent your shell from expanding it, like in ‘-A ".mp3"’ or ‘-A '*.mp3'’.

Looping over files in a folder for shell script with multiple inputs

Specifying multiple inputs for command line tool?
I am new to bash and am wanting to loop a command line program over a folder containing numerous files.
The script takes two input files (in my case, these differ in one field of the file name ("...R1" vs "...R2"). Running a single instance of the tool looks like this:
tool_name infile1 infile2 -o outfile_suffix
Actual example:
casper sample_name_R1_001.out.fastq sample_name_R2_001.out.fastq -o sample_name_merged
File name format:
DCP-137-5102-T1A3_S33_L001_R1_001.fastq
DCP-137-5102-T1A3_S33_L001_R2_001.fastq
The field in bold will vary between different pairs (e.g., 2000, 2110, 5100 etc...) with each pair distinguished by either R1 or R2.
I would like to know how to loop the script over a folder containing numerous pairs of matched files, and also ensure that the output (-o) gets the 'sample_name' suffix.
I am familiar with the basic for file in ./*.*; do ... $file...; done but that obviously won't work for this example. Any suggestions would be appreciated!
You want to loop over the R1's and derive the R2 and merged-file names from that, something like:
for file1 in ./*R1*; do
file2=${file1/R1/R2}
merge=${file1#*R1}_merged
casper ${file1} ${file2} -o ${merge}
done
Note: Markdown is showing the #*R1}_merged as a comment -- it's not

Using bash to list files with a certain combination of characters

So I have a directory with ~50 files, and each contain different things. I often find myself not remembering which files contain what. (This is not a problem with the naming -- it is sort of like having a list of programs and not remembering which files contain conditionals).
Anyways, so far, I've been using
cat * | grep "desiredString"
for a string that I know is in there. However, this just gives me the lines which contain the desired string. This is usually enough, but I'd like it to give me the file names instead, if at all possible.
How could I go about doing this?
It sounds like you want grep -l, which will list the files that contain a particular string. You can also just pass the filename arguments directly to grep and skip cat.
grep -l "desiredString" *
In the directory containing the files among which you want to search:
grep -rn "desiredString" .
This can list all the files matching "desiredString", with file names, matching lines and line numbers.

How do I open / manipulate multiple files in bash?

I have a bash script that take advantage of a local toolbox to perform an operation
my question is fairly simple
I have multiple files that are the same quantities but different time steps i would like to first untar them all, and then use the toolbox to perform some manipulation but i am not sure if i am on the right track.
=============================================
The file is as follows
INPUTS
fname = a very large number or files with same name but numbering
e.g wnd20121.grb
wnd20122.grb
.......
wnd2012100.grb
COMMANDS
> cdo -f nc copy fname ofile(s)
(If this is the ofile(s)=output file how can i store it for sequent use ? Take the ofile (output file) from the command and use it / save it as input to the next, producing a new subsequent numbered output set of ofile(s)2)
>cdo merge ofile(s) ofile2
(then automatically take the ofile(s)2 and input them to the next command and so on, producing always an array of new output files with specific set name I set but different numbering for distinguishing them)
>cdo sellon ofile(s)2 ofile(s)3
------------------------------------
To make my question clearer, I would like to know the way in which I can instruct basically through a bash script the terminal to "grab" multiple files that are usually the same name but have a different numbering to make the separate their recorded time
e.g. file1 file2 ...file n
and then get multiple outputs , with every output corresponding to the number of the file it converted.
e.g. output1 output2 ...outputn
How can I set these parameters so the moment they are generated they are stored for subsequent use in the script, in later commands?
Your question isn't clear, but perhaps the following will help; it demonstrates how to use arrays as argument lists and how to parse command output into an array, line by line:
#!/usr/bin/env bash
# Create the array of input files using pathname expansion.
inFiles=(wnd*.grb)
# Pass the input-files array to another command and read its output
# - line by line - into a new array, `outFiles`.
# The example command here simply prepends 'out' to each element of the
# input-files array and outputs each (modified) element on its own line.
# Note: The assumption is that the filenames have no embedded newlines
# (which is usually true).
IFS=$'\n' read -r -d '' -a outFiles < \
<(printf "%s\n" "${inFiles[#]}" | sed s'/^/out-/')
# Note: If you use bash 4, you could use `readarray -t outFiles < <(...)` instead.
# Output the resulting array.
# This also demonstrates how to use an array as an argument list
# to pass to _any_ command.
printf "%s\n" "${outFiles[#]}"

Resources