Error parsing strand (?) from GFF line - Happening in various programs - bioinformatics

I'm working with various genomic data and while trying to use gffread, stringtie and cufflinks I went through the same error:
Error parsing strand (?) from GFF line: NC_037304.1 RefSeq gene 58315 59481 . ? . ID=gene-DA397_mgp34;Dbxref=GeneID:36335702;Name=nad1;exception=trans-splicing;gbkey=Gene;gene=nad1;gene_biotype=protein_coding;locus_tag=DA397_mgp34;part=2
I recognize this is a formatting error from the GFF file (in this case I'm showing the one from TAIR 10), but it happened in various species. From what I read in this discussion, " '?' can be used for features whose strandedness is relevant, but unknown.". If that's the case, shouldn't these programs recognize the '?' character? More importantly, how can I pass through that? Would it be just by trying to find other annotations? I'd find that odd, because of the fact that the TAIR 10 genome is really well-known for its quality, so I'd doubt that their gff file has such a simple mistake...
Thank you!

Related

Why are '[' and ']' ascii codes not following each other?

Does anyone know why the design decision of having '[' and ']' or '{' and '}' ASCII key codes being two apart instead of one digit was made? OCD triggered.
Evolution of Character Codes
Have a look # the following article: https://web.archive.org/web/20050305043226/http://www.transbay.net/~enf/ascii/ascii.pdf
I found it from the following stackexchange article: https://softwareengineering.stackexchange.com/a/149901/94281
Ascii was an evolutionary result but was based on previous work/inventions such as telegrams and morse code.
Additionally, there were many different versions and proposals before we reached a final order and result.
It seems that in some of the initial proposals [ and ] had been placed together.
For example:
However, after the X3.2 meeting, the \ was added in between:
This is again visible in a publication from 1962:
Source: Source documents on the history of character codes, 1962-06
Substitution Characters
Reading the archive from Source Documents on the history of character Codes
page 38 shows that some of the characters were grouped together and were planned to be substituted with other characters in 26 character languages:
A similar note is made about the characters '< = >' in relation to substituting them for more business-friendly characters.
References:
The Evolution of Character Codes, 1874-1968 - Eric Fischer
https://github.com/ericfischer/ascii/blob/master/ascii.ms
Source documents on the history of character codes, 1962-06
Note: Researching this further to see why the X3.2 meeting resulted in this change.

Error when converting .gprobs files from Impute2 to PLINK format

I have a set of .gprobs files that I need to import into Plink. However, I keep getting the same error -- a problem in a specific line, even after I removed that line and the lines around it.
The data: I concatenated all 22 chromosome .gprobs files. To do so, I did replace the '---' at the beginning of the individual .gprobs files with the corresponding chromosome number (so now each line starts CHR SNP BP A1 A2...) . I also removed the SNPs that weren't imputed well (INFO scores below 0.7)
Code:
plink --gen data_chrALL.gprobs_chrcol_below0.7inforemoved --sample data_chr1.sample --out data_chrALL.gprobs_plink
The error message:
--data: 13404k variants converted.Error: Line 13404781 of .gen file has fewer tokens than expected.
As I said above, I removed that specific line and reran it, and got the same exact error message. I tried removing the lines above and below (in case the numbering was off by a header or something?) but again, same exact error.
Any thoughts or suggestions would be greatly appreciated!!! I'm not sure if this is the best place to post this, but I'm in desperate need of help.
Plink is trying to tell you that it expects a certain number of items on each line (3N+5 fields where N is the number of samples) and on some lines it doesn´t see them. So,
(1) First of all, I would try to compare the lines causing errors to the ones which do not to see that the number of tockens/columns is actually the same, that it is correct and that there are no extra spaces or special characters which could cause escaping or misreading of the line. Also I would check which variants are causing troubles: maybe they are multiallelic or indels or something else and Plink doesn´t know how to deal with them. Or maybe there are no minor allele homozygotes at all for that variant and it is expressed in incorrect manner.
(2) I would check the specifications for the input files, both .gen and .sample to see that they are correct. As the files originate from Impute2 there might be some subtle differences.
(3) I would also update Plink version. From the code it seems that you are using either version 1.07 or 1.09. 1.x versions cannot represent probabilities and will make hard-calls so your lose a lot of information because of that. Plink 2.0 can utilize the probabilities and also should have better support for them. You will still be able to use hard-calls if you want.

Prolog syntax error: operator expected

I was studying the Prolog, and met with the "syntax error: operator expected" for the following code:
odd_list(X,Y):-process_list(X,Y,1).
process_list(X,[N1|Y],N):-N1 is 2*N-1,N1 < X,N2 is N+1,process_list(X,Y,N2).
process_list(X,[],N):-2*N-1>=X.
That's all the code I wrote. What's the problem? I found some solutions saying that there are unexpected white spaces in the functors or arguments, but I think I do not include any white space in the above-mentioned places.
Thank you all for helping me!!!
Remark: I find that when I name the source code as "Test1.pl", I get this error. But when I name it as "test1.pl", there is no error. Does it mean that the file name cannot start with an upper case letter?
I found the reason for this problem. I used the file name 'Test1'. But Prolog does not support upper case letter in the file name. I modified the file name to 'test1' and it works now.

Using phrase_from_file to read a file's lines

I've been trying to parse a file containing lines of integers using phrase_from_file with the grammar rules
line --> I,line,{integer(I)}.
line --> ['\n'].
thusly: phrase_from_file(line,'input.txt').
It fails, and I got lost very quickly trying to trace it.
I've even tried to print I, but it doesn't even get there.
EDIT::
As none of the solutions below really fit my needs (using read/1 assumes you're reading terms, and sometimes writing that DCG might just take too long), I cannibalized this code I googled, the main changes being the addition of:
read_rest(-1,[]):-!.
read_word(C,[],C) :- ( C=32 ;
C=(-1)
) , !.
If you are using phrase_from_file/2 there is a very simple way to test your programs prior to reading actual files. Simply call the very same non-terminal with phrase/2. Thus, a goal
phrase(line,"1\n2").
is the same as calling
phrase_from_file(line,fichier)
when fichier is a file containing above 3 characters. So you can test and experiment in a very compact manner with phrase/2.
There are further issues #Jan Burse already mentioned. SWI reads in character codes. So you have to write
newline --> "\n".
for a newline. And then you still have to parse integers yourself. But all that is tested much easier with phrase/2. The nice thing is that you can then switch to reading files without changing the actual DCG code.
I guess there is a conceptional problem here. Although I don't know the details of phrase_from_file/2, i.e. which Prolog system you are using, I nevertheless assume that it will produce character codes. So for an integer 123 in the file you will get the character codes 0'1, 0'2 and 0'3. This is probably not what you want.
If you would like to process the characters, you would need to use a non-terminal instead of a bare bone variable I, to fetch them. And instead of the integer test, you would need a character test, and you can do the test earlier:
line --> [I], {0'0=<I, I=<0'9}, line.
Best Regards
P.S.: Instead of going the DCG way, you could also use term read operations. See also:
read numbers from file in prolog and sorting

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Resources