I have prepared a script for Pymol that works well in evaluating RMSD values for a list of residues of proteins of interest (targetted residues are generated by script embedded commands). However, I wished to implement the script to allow the user to select the analysed residues. I have tried to use the input function, but it fails to work. Since the code is a bit long, I simplified progressively the scripts. In that way, I ended up with the following two scripts:
"test.py":
import Bio.PDB
import numpy as np
from pymol import cmd
import os,glob
from LFGA_functions import user_entered
List_tot = user_entered()
print (List_tot)
which calls the simple "user_entered" function from inside the "My_function.py" script:
def user_entered():
List_res = []
List_tot = []
a = input("Please indicate the number of residues to analyze/per monomer:")
numberRes =int(a)
for i in range(numberRes):
Res = input("Please provide each residue NUMBER and hit ENTER:")
Res1 = int(Res)
Res2 = Res1+2000
Res3 = Res1+3000
Res4 = Res1+4000
Res5 = Res1+5000
Res6 = Res1+6000
List_res = (str(Res1),str(Res2),str(Res3),str(Res4),str(Res5),str(Res6))
List_tot.append(List_res)
return List_tot
The script "test.py" works well when executed by Python (3.7.5, which is installed with Pymol 2.3.4,under Windows 7 Prof) from a windows command line. An example of result:
[first input to indicate that 2 cases will be treated, followed by the identification number for each case][1]
However, when the script is run from Pymol GUI, I get the following error message: input():lost sys.stdin
Does anybody know what is the problem. Obviously, I am a very primitive Python user...
I profit also to ask something related to this problem... in the original script that works well in Pymol, before trying to implement the "input" option, I store transiently some data (residue name, type and chain) in the form of a "set". I need a set, and not a list, because it removes automatically data repeatitions. However, when I try to run it from PYthon (thinking of overcoming the abovementioned problem of input), it stops with a message: name 'close_to_A' not defined. However, it is as shown in the portion of code shown here below, and indeed works in Pymol.
...
# Step 3:
# listing of residues sufficiently close to chainA
# that will be considered in calculation of RMSD during trajectory
cmd.load("%s/%s" %(path3,"Average.pdb"), "Average", quiet=0)
cmd.select("around_A", "Average and chain B+C near_to 5 of chain A")
cmd.select("in_A", "Average and chain A near_to 5 of chain B+C")
close_to_A = set()
cmd.iterate("(around_A)","close_to_A.add((chain,resi,resn))")
cmd.iterate("(in_A)","close_to_A.add((chain,resi,resn))")
cmd.delete("around_A")
cmd.delete("in_A")
...
How shall I define a set in Python 3 ? Why does it work when run from Pymol ?
I would really appreciate if you could help me to solve these two problems. Thank you in advance,
Best regards,
Luis
Related
I try to train ResNet 50 for Image classification using this code: https://github.com/mlperf/training/blob/master/image_classification/tensorflow/official/resnet/imagenet_main.py, but in process i have error in line 322 seed = int(argv[1]) IndexError: list index out of range. How can I fix this error? I am using Tenserflow 2.0.
Your error message IndexError: list index out of range suggests that the list argv has length < 2 (since index 1 is already out of bounds).
sys.argv contains the list of argument values that you pass on the command line. The first element, argv[0] is always the name of the script you invoke.
In your case this means that you did not pass any arguments.
So looking at the code, it tries to read - as the first command line option - a seed that it can use to initialise various random number generators (RNG) with.
So what you want to do is invoke your script with something like:
python imagenet_main.py r4nd0ms33d
where r4nd0ms33d is some random value.
Sometimes explicitly seeding the RNG with a fixed value can be useful, e.g. when you want your training results to be reproducable.
You can play around with this simple Python program:
import sys
if __name__ == '__main__':
print(sys.argv)
Save it into test.py and see what it prints if you run it as python test.py versus python test.py random versus python test.py random seed
I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)
I have created syntax in SPSS that gives me 90 separate iterations of general linear model, each with slightly different variations fixed factors and covariates. In the output file, they are all just named as "General Linear Model." I have to then manually rename each analysis in the output, and I want to find syntax that will add a more specific name to each result that will help me identify it out of the other 89 results (e.g. "General Linear Model - Males Only: Mean by Gender w/ Weight covariate").
This is an example of one analysis from the syntax:
USE ALL.
COMPUTE filter_$=(Muscle = "BICEPS" & Subj = "S1" & SMU = 1 ).
VARIABLE LABELS filter_$ 'Muscle = "BICEPS" & Subj = "S1" & SMU = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0). FILTER BY filter_$.
EXECUTE.
GLM Frequency_Wk6 Frequency_Wk9
Frequency_Wk12 Frequency_Wk16
Frequency_Wk20
/WSFACTOR=Time 5 Polynomial
/METHOD=SSTYPE(3)
/PLOT=PROFILE(Time)
/EMMEANS=TABLES(Time)
/CRITERIA=ALPHA(.05)
/WSDESIGN=Time.
I am looking for syntax to add to this that will name this analysis as: "S1, SMU1 BICEPS, GLM" Not to name the whole output file, but each analysis within the output so I don't have to do it one-by-one. I have over 200 iterations at times that come out in a single output file, and renaming them individually within the output file is taking too much time.
Making an assumption that you are exporting the models to Excel (please clarify otherwise).
There is an undocumented command (OUTPUT COMMENT TEXT) that you can utilize here, though there is also a custom extension TEXT also designed to achieve the same but that would need to be explicitly downloaded via:
Utilities-->Extension Bundles-->Download And Install Extension Bundles--->TEXT
You can use OUTPUT COMMENT TEXT to assign a title/descriptive text just before the output of the GLM model (in the example below I have used FREQUENCIES as an example).
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
oms /select all /if commands=['output comment' 'frequencies'] subtypes=['comment' 'frequencies']
/destination format=xlsx outfile='C:\Temp\ExportOutput.xlsx' /tag='ExportOutput'.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - jobcat".
freq jobcat.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - gender".
freq gender.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - minority".
freq minority.
omsend tag=['ExportOutput'].
You could use TITLE command here also but it is limited to only 60 characters.
You would have to change the OMS tags appropriately if using TITLE or TEXT.
Edit:
Given the OP wants to actually add a title to the left hand pane in the output viewer, a solution for this is as follows (credit to Albert-Jan Roskam for the Python code):
First save the python file "editTitles.py" to a valid Python search path (for example (for me anyway): "C:\ProgramData\IBM\SPSS\Statistics\23\extensions")
#editTitles.py
import tempfile, os, sys
import SpssClient
def _titleToPane():
"""See titleToPane(). This function does the actual job"""
outputDoc = SpssClient.GetDesignatedOutputDoc()
outputItemList = outputDoc.GetOutputItems()
textFormat = SpssClient.DocExportFormat.SpssFormatText
filename = tempfile.mktemp() + ".txt"
for index in range(outputItemList.Size()):
outputItem = outputItemList.GetItemAt(index)
if outputItem.GetDescription() == u"Page Title":
outputItem.ExportToDocument(filename, textFormat)
with open(filename) as f:
outputItem.SetDescription(f.read().rstrip())
os.remove(filename)
return outputDoc
def titleToPane(spv=None):
"""Copy the contents of the TITLE command of the designated output document
to the left output viewer pane"""
try:
outputDoc = None
SpssClient.StartClient()
if spv:
SpssClient.OpenOutputDoc(spv)
outputDoc = _titleToPane()
if spv and outputDoc:
outputDoc.SaveAs(spv)
except:
print "Error filling TITLE in Output Viewer [%s]" % sys.exc_info()[1]
finally:
SpssClient.StopClient()
Re-start SPSS Statistics and run below as a test:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
title="##Model##: jobcat".
freq jobcat.
title="##Model##: gender".
freq gender.
title="##Model##: minority".
freq minority.
begin program.
import editTitles
editTitles.titleToPane()
end program.
The TITLE command will initially add a title to main output viewer (right hand side) but then the python code will transfer that text to the left hand pane output tree structure. As mentioned already, note TITLE is capped to 60 characters only, a warning will be triggered to highlight this also.
This editTitles.py approach is the closest you are going to get to include a descriptive title to identify each model. To replace the actual title "General Linear Model." with a custom title would require scripting knowledge and would involve a lot more code. This is a simpler alternative approach. Python integration required for this to work.
Also consider using:
SPLIT FILE SEPARATE BY <list of filter variables>.
This will automatically produce filter labels in the left hand pane.
This is easy to use for mutually exclusive filters but even if you have overlapping filters you can re-run multiple times (and have filters applied to get as close to your desired set of results).
For example:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
sort cases by jobcat minority.
split file separate by jobcat minority.
freq educ.
split file off.
Still using bloody OpenOffice Writer to customize my sale_order.rml report.
In my sale order I have 6 order lines with 6 different lead time to delivery. I need to show the maximum out of the six values.
After many attempt I have abandoned using the reduce function as it works erratically or not at all most of the time. I have never seen anything like this.
So I thought I'd give a try using max encapsulating a loop such as:
[[ max(repeatIn(so.order_line.delay,'d')) ]]
My maximum lead time being 20, I would expect to see 20 (yes well that would be too easy, wouldn't it!).
It returns
{'d': 20.0}
At least it contains the value I am after.
But; if I try and manipulate this result, it disappears altogether.
I have tried:
int(re.findall(r'[0-9]+', max(repeatIn(so.order_line.delay,'d')))[0])
which works great from the python window, but returns absolutely nothing in OpenERP.
I import the re from my sale_order.py file, which I have recompiled into sale_order.pyo:
import time
import re
from datetime import datetime, timedelta
from report import report_sxw
class order(report_sxw.rml_parse):
def __init__(self, cr, uid, name, context=None):
super(order, self).__init__(cr, uid, name, context=context)
self.localcontext.update({
'time': time,
'datetime': datetime,
'timedelta': timedelta,
're': re,
})
I have of course restarted the server many times. My test install sits on windows.
So can anyone tell me what I am doing wrong, because I can make it work from Python but not from OpenOffice Writer!
Thanks for your help!
EDIT 1:
The format
{'d': 20.0}
is, according to python, a dictionary. Still in Python, to extract the integer from a dictionary it is possible to do it like so:
>>> dict={'d': 20.0}
>>> print(dict['d'])
20.0
But how can I transpose this to OpenERP writer???
I have manage to get the result I wanted by importing functools and declaring the reduce function within the parameters of the sale_order.py file.
I then simply used a combination of reduce and max function and it works exactly as expected.
The correct syntax is as follow:
repeatIn(objects,'o')
reduce(lambda x, y: max(x, y.delay), o.order_line, 0)
Nothing else is required.
Enjoy!
As part of a pet project of mine, I need to test the performance of various different implementations of my code in Python. I anticipate this to be something I do alot of, and I want to try to make the code I write to serve this aim as easy to update and modify as possible.
It's still in its infancy at the moment, but I've taken to using strings to manage common setup or testing code, eg:
naiveSetup = 'from PerformanceTests.Vectors import NaiveVector\n' \
+ 'left = NaiveVector([1,0,0])\n' \
+ 'right = NaiveVector([0,1,0])'
This allows me to only write the code once, at the expense of making it harder to read and clunky to update.
Is there a better way?
Use triple quotes """
setup_code = """
from PerformanceTests.Vectors import NaiveVector
left = NaiveVector([1,0,0])
right = NaiveVector([0,1,0])
"""
Another interesting method is provided in the docs of timeit:
def test():
"Stupid test function"
L = []
for i in range(100):
L.append(i)
if __name__=='__main__':
from timeit import Timer
t = Timer("test()", "from __main__ import test")
print t.timeit()
Though this isn't suitable for all needs.
Timing code is fine, but it will still leave you guessing what's going on.
To find out what's actually going on, manually pause it a few random times in the debugger, and examine the call stack.
For example, in the code that is 30x slower in one implementation than in another, each sample of the stack has a 96.7% chance of falling in the extra time that it is spending, so you can see why.
No guesswork required.