Error when converting .gprobs files from Impute2 to PLINK format - bioinformatics

I have a set of .gprobs files that I need to import into Plink. However, I keep getting the same error -- a problem in a specific line, even after I removed that line and the lines around it.
The data: I concatenated all 22 chromosome .gprobs files. To do so, I did replace the '---' at the beginning of the individual .gprobs files with the corresponding chromosome number (so now each line starts CHR SNP BP A1 A2...) . I also removed the SNPs that weren't imputed well (INFO scores below 0.7)
Code:
plink --gen data_chrALL.gprobs_chrcol_below0.7inforemoved --sample data_chr1.sample --out data_chrALL.gprobs_plink
The error message:
--data: 13404k variants converted.Error: Line 13404781 of .gen file has fewer tokens than expected.
As I said above, I removed that specific line and reran it, and got the same exact error message. I tried removing the lines above and below (in case the numbering was off by a header or something?) but again, same exact error.
Any thoughts or suggestions would be greatly appreciated!!! I'm not sure if this is the best place to post this, but I'm in desperate need of help.

Plink is trying to tell you that it expects a certain number of items on each line (3N+5 fields where N is the number of samples) and on some lines it doesn´t see them. So,
(1) First of all, I would try to compare the lines causing errors to the ones which do not to see that the number of tockens/columns is actually the same, that it is correct and that there are no extra spaces or special characters which could cause escaping or misreading of the line. Also I would check which variants are causing troubles: maybe they are multiallelic or indels or something else and Plink doesn´t know how to deal with them. Or maybe there are no minor allele homozygotes at all for that variant and it is expressed in incorrect manner.
(2) I would check the specifications for the input files, both .gen and .sample to see that they are correct. As the files originate from Impute2 there might be some subtle differences.
(3) I would also update Plink version. From the code it seems that you are using either version 1.07 or 1.09. 1.x versions cannot represent probabilities and will make hard-calls so your lose a lot of information because of that. Plink 2.0 can utilize the probabilities and also should have better support for them. You will still be able to use hard-calls if you want.

Related

Bash: count how many times a word is contained in all the files of a given folder

I'm just trying to count the occurrences of a word without writing an iteration file by file. I don't mind which kind of file it is. The closest I got is:
COUNT=$(grep -r -n -i "theWordImSearchingFor" .)
echo $COUNT
I thought about splitting that by spaces, but the problem is the output does not contain just the filename and the line but also the content (and that may have tons of spaces). w.g. I got:
./doc1.txt:29: This is the content containing theWordImSearchingFor but also other stuff
./doc1.txt:43: This is another line containing theWordImSearchingFor
./dir123/doc2.txt:339: .This is another...file...theWordImSearchingFor....
Any idea on how to keep it simple? TIA
To count the number of occurrences of a specific word, you need to use the same layout of code, but simpler. There are many ways to do this, but there are two much simpler versions of the word count that you have listed here.
The much two simpler versions,
1st way
2nd way
They both should work, unless problem with package installation.

Use bash to extract data between two regular expressions while keeping the formatting

but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.
The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.
This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.

Regex for Git commit message

I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.
Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.

Adding text to the beginning of multiple files in Notepad++

I have many text files, and I need to add some text (e.g. MNP) to the beginning of the first line in each file.
How can I do this in Notepad++?
(I'm using v6.6.9)
Make sure to backup your work beforehand, and set proper extension of files to affect and folder to search through before you do this.
You can use regular expressions. Several places around the internet claim that the regex \A works, but it wasn't working for me, it was cycling byte by byte through. I found that \A^ sticks to 0 position of the file.
Oddly, I additionally found that I couldn't replace \A or \A^ and have it take effect. This is what worked for me.
Find: \A^(.*?)
Replace MNP\1
Truthfully, the \1 in Replace isn't even necessary since I'm cheating and basically telling notepad to look for 0 characters.
This should work just as well.
Find: \A^.*?
Replace MNP
Please backup your work beforehand though.
Alternatively, this also seems to work.
Find: .{0}(.*)
Replace: MNP\1
It effectively looks for 0 characters followed by the whole document/line (depending on whether . matches newline is checked, this choice won't matter for the outcome however).

Inserting characters before whatever is on a line, for many lines

I have been looking at regular expressions to try and do this, but the most I can do is find the start of a line with ^, but not replace it.
I can then find the first characters on a line to replace, but can not do it in such a way with keeping it intact.
Unfortunately I don´t have access to a tool like cut since I am on a windows machine...so is there any way to do what I want with just regexp?
Use notepad++. It offers a way to record an sequence of actions which then can be repeated for all lines in the file.
Did you try replacing the regular expression ^ with the text you want to put at the start of each line? Also you should use the multiline option (also called m in some regex dialects) if you want ^ to match the start of every line in your input rather than just the first.
string s = "test test\ntest2 test2";
s = Regex.Replace(s, "^", "foo", RegexOptions.Multiline);
Console.WriteLine(s);
Result:
footest test
footest2 test2
I used to program on the mainframe and got used to SPF panels. I was thrilled to find a Windows version of the same editor at Command Technology. Makes problems like this drop-dead simple. You can use expressions to exclude or include lines, then apply transforms on just the excluded or included lines and do so inside of column boundaries. You can even take the contents of one set of lines and overlay the contents of another set of lines entirely or within column boundaries which makes it very easy to generate mass assignments of values to variables and similar tasks. I use Notepad++ for most stuff but keep a copy of SPFSE around for special-purpose editing like this. It's not cheap but once you figure out how to use it, it pays for itself in time saved.

Resources