yaml.scanner.ScannerError: while scanning a directive

yaml.scanner.ScannerError: while scanning a directive - yaml

I use PyYAML to read a file, python code is:
with open('demo.yml') as f:
dataMap = yaml.load(f)
demo.yml:
%YAML:1.0
my_svm: !!opencv-ml-svm
svm_type: C_SVC
kernel: { type:LINEAR }
C: 1.
Then error is:
yaml.scanner.ScannerError: while scanning a directive
in "demo.yml", line 1, column 1
expected alphabetic or numeric character, but found ':'
in "demo.yml", line 1, column 6
Someone help me?

The directive should be %YAML 1.0 (with no colon). You also will need a "document start" (---) to separate your directives from the document. E.g.:
%YAML 1.0
---
my_svm: !!opencv-ml-svm
svm_type: C_SVC
kernel: { type: LINEAR }
C: 1.

you can modify the yaml file created by opencv 3.0
file1 from opencv:
1 %YAML:1.0
2 my_svm: !!opencv-ml-svm
3 svm_type: C_SVC
4 kernel: { type:LINEAR }
5 C: 1.
6 ...
file2:
1 my_svm: opencv-ml-svm
2 svm_type: C_SVC
3 kernel: { type: LINEAR }
4 C: 1.
5 ...
file1 -> file2:
delete line 1
delete "!!opencv-ml-svm"
add space after "type:" in line 4
then you can use yaml.load(filename) to load your data.

This worked for me:
from cv2 import cv
import numpy as np
filepath = "test.yml"
matrixA = np.array( cv.Load(filepath, cv.CreateMemStorage(), "matrixA") )
matrixB = np.array( cv.Load(filepath, cv.CreateMemStorage(), "matrixB") )
print "matrixA:", matrixA
print "matrixB:", matrixB
As seen in:
http://xudongai.blogspot.jp/2013/08/how-to-use-python-to-load-opencv-yml.html

Related

How to use my own corpus on word embedding model BERT

I am trying to create a question-answering model with the word embedding model BERT from google. I am new to this and would really want to use my own corpus for the training. At first I used an example from the huggingface site and that worked fine:
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
'question': "Wat is de hoofdstad van Nederland?"})
output
> {'answer': 'Amsterdam', 'end': 9, 'score': 0.825619101524353, 'start': 0}
So, I tried creating a .txt file to test if it was possible to interchange the sentence in the context parameter with the exact same sentence but in a .txt file.
with open('test.txt') as f:
lines = f.readlines()
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': lines,
'question': "Wat is de hoofdstad van Nederland?"})
But this gave me the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-2bae0ecad43e> in <module>()
10 qa_pipeline({
11 'context': lines,
---> 12 'question': "Wat is de hoofdstad van Nederland?"})
5 frames
/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py in _is_whitespace(c)
84
85 def _is_whitespace(c):
---> 86 if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
87 return True
88 return False
TypeError: ord() expected a character, but string of length 66 found
I was just experimenting with ways to read and use a .txt file, but I don't seem to find a different solution. I did some research on the huggingface pipeline() function and this is what was written about the question and context parameters:

Got it! The solution was really easy. I assumed that the variable 'lines' was already a str but that wasn't the case. Just by casting to a string the question-answering model accepted my test.txt file.
so from:
with open('test.txt') as f:
lines = f.readlines()
to:
with open('test.txt') as f:
lines = str(f.readlines())

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Error remove over-representative sequences : TypeError: coercing to Unicode: need string or buffer, NoneType found

Hi I am running this python script to remove over-representative sequences from my fastq files, but I keep getting the error. I am new to bioinfomatics and have been following a fixed set of pipeline for sequence assembly. I wanted to remove over-representative sequences with this script
python /home/TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py -1 R1_1.fq -2 R1_2.fq
**Here is the error
Traceback (most recent call last):
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 46, in
leftseqs=ParseFastqcLog(opts.l_fastqc)
File "TranscriptomeAssemblyTools/RemoveFastqcOverrepSequenceReads.py", line 33, in ParseFastqcLog
with open(fastqclog) as fp:
TypeError: coercing to Unicode: need string or buffer, NoneType found**
Here is the script :
import sys
import gzip
from os.path import basename
import argparse
import re
from itertools import izip,izip_longest
def seqsmatch(overreplist,read):
flag=False
if overreplist!=[]:
for seq in overreplist:
if seq in read:
flag=True
break
return flag
def get_input_streams(r1file,r2file):
if r1file[-2:]=='gz':
r1handle=gzip.open(r1file,'rb')
r2handle=gzip.open(r2file,'rb')
else:
r1handle=open(r1file,'r')
r2handle=open(r2file,'r')
return r1handle,r2handle
def FastqIterate(iterable,fillvalue=None):
"Grab one 4-line fastq read at a time"
args = [iter(iterable)] * 4
return izip_longest(fillvalue=fillvalue, *args)
def ParseFastqcLog(fastqclog):
with open(fastqclog) as fp:
for result in re.findall('Overrepresented sequences(.*?)END_MODULE', fp.read(), re.S):
seqs=([i.split('\t')[0] for i in result.split('\n')[2:-1]])
return seqs
if __name__=="__main__":
parser = argparse.ArgumentParser(description="options for removing reads with over-represented sequences")
parser.add_argument('-1','--left_reads',dest='leftreads',type=str,help='R1 fastq file')
parser.add_argument('-2','--right_reads',dest='rightreads',type=str,help='R2 fastq file')
parser.add_argument('-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1')
parser.add_argument('-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2')
opts = parser.parse_args()
leftseqs=ParseFastqcLog(opts.l_fastqc)
rightseqs=ParseFastqcLog(opts.r_fastqc)
r1_out=open('rmoverrep_'+basename(opts.leftreads).replace('.gz',''),'w')
r2_out=open('rmoverrep_'+basename(opts.rightreads).replace('.gz',''),'w')
r1_stream,r2_stream=get_input_streams(opts.leftreads,opts.rightreads)
counter=0
failcounter=0
with r1_stream as f1, r2_stream as f2:
R1=FastqIterate(f1)
R2=FastqIterate(f2)
for entry in R1:
counter+=1
if counter%100000==0:
print "%s reads processed" % counter
head1,seq1,placeholder1,qual1=[i.strip() for i in entry]
head2,seq2,placeholder2,qual2=[j.strip() for j in R2.next()]
flagleft,flagright=seqsmatch(leftseqs,seq1),seqsmatch(rightseqs,seq2)
if True not in (flagleft,flagright):
r1_out.write('%s\n' % '\n'.join([head1,seq1,'+',qual1]))
r2_out.write('%s\n' % '\n'.join([head2,seq2,'+',qual2]))
else:
failcounter+=1
print 'total # of reads evaluated = %s' % counter
print 'number of reads retained = %s' % (counter-failcounter)
print 'number of PE reads filtered = %s' % failcounter
r1_out.close()
r2_out.close()

Maybe you already solved it, I had the same error but now is running well.
Hope this help
(1) Files we need:
usage: RemoveFastqcOverrepSequenceReads.py [-h] [-1 LEFTREADS] [-2 RIGHTREADS] [-fql L_FASTQC] [-fqr R_FASTQC
(2) Specify fastqc_data.text files that are in the fastqc output, unzip the output directory
'-fql','--fastqc_left',dest='l_fastqc',type=str,help='fastqc text file for R1'
'-fqr','--fastqc_right',dest='r_fastqc',type=str,help='fastqc text file for R2'
(3) Keep the reads and the fastqc_data text in the same directory
(4) Specify the path location before each file
python RemoveFastqcOverrepSequenceReads.py
-1 ./bicho.fq.1.gz -2./bicho.fq.2.gz
-fql ./fastqc_data_bicho_1.txt -fqr ./fastqc_data_bicho_2.txt
(5) run! :)

statsmodels Error Message: "ValueError: v must be > 1 when p >= .9"

I am trying to perform multiple sample comparison and Tukey HSD using the statsmodels module, but I keep getting this error message, "ValueError: v must be > 1 when p >= .9". I have tried looking this up on the internet for a possible solution, but no avail. Any chance anyone familiar with this module could help me out decipher what I am doing wrong to prompt this error. I use Python version 2.7x and spyder. Below is a sample of my data and the print statement. Thanks!
import numpy as np
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,MultiComparison)
###--- Here are the data I am using:
data1 = np.array([ 1, 1, 1, 1, 976, 24, 1, 1, 15, 15780])
data2 = np.array(['lau15', 'gr17', 'fri26', 'bays29', 'dantzig4', 'KAT38','HARV50', 'HARV10', 'HARV20', 'HARV41'], dtype='|S8')
####--- Here's my print statement code:
print pairwise_tukeyhsd(data1, data2, alpha=0.05)

Seems you have to provide more data than a single observation per group, in order for the test to work.
Minimal example:
from statsmodels.stats.multicomp import pairwise_tukeyhsd,MultiComparison
data=[1,2,3]
groups=['a','b','c']
print("1st try:")
try:
print(pairwise_tukeyhsd(data,groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
data.append(2)
groups.append('a')
print("2nd try:")
try:
print( pairwise_tukeyhsd(data, groups, alpha=0.05))
except ValueError as ve:
print("whoops!", ve)
Output:
1st try:
/home/user/.local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3367: RuntimeWarning: Degrees of freedom <= 0 for slice
**kwargs)
/home/user/.local/lib/python3.7/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
whoops! v must be > 1 when p >= .9
2nd try:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
====================================================
group1 group2 meandiff p-adj lower upper reject
----------------------------------------------------
a b 0.5 0.1 -16.045 17.045 False
a c 1.5 0.1 -15.045 18.045 False
b c 1.0 0.1 -18.1046 20.1046 False
----------------------------------------------------

X10 reading from a file not as expected

I encountered following behavior when reading from a text file.
val input = new File(inputFileName);
val inp = input.openRead();
Console.OUT.println(inp.lines().next());
if (inp.lines().hasNext())
Console.OUT.println(inp.lines().next());
my input file contains
0 1
0 2
0 3
As a result I get
0 1
0 3
It seems that inp.lines().hasNext() has moved the pointer forward and as a result one line is skipped in the text file.
Is this a bug?

Yes, this looks like a bug. x10.io.FileReader.lines().hasNext() should not be skipping forward in the text file.
Could you please raise an issue in the X10 JIRA project?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

yaml.scanner.ScannerError: while scanning a directive - yaml

The directive should be %YAML 1.0 (with no colon). You also will need a "document start" (---) to separate your directives from the document. E.g.: %YAML 1.0 --- my_svm: !!opencv-ml-svm svm_type: C_SVC kernel: { type: LINEAR } C: 1.

Related

How to use my own corpus on word embedding model BERT

How to build an empirical codon substitution matrix from a multiple sequence alignment

Error remove over-representative sequences : TypeError: coercing to Unicode: need string or buffer, NoneType found

statsmodels Error Message: "ValueError: v must be > 1 when p >= .9"

X10 reading from a file not as expected

Categories

Resources