RDKit convert Mol file sdf v3000 to v2000 - rdkit

I have a Mol (sdf) file with a few chemicals in v3000 format. How do I convert it to Mol (sdf) file in v2000 format using RDKit?
Any help is appreciated.

RDKit will write v2000 format by default unless v3000 is specified, so you can just read in the SDF in v3000 format and write to v2000:
from rdkit import Chem
supplier = Chem.SDMolSupplier('v3000.sdf')
writer = Chem.SDWriter('v2000.sdf')
for molecule in supplier:
writer.write(molecule)
writer.close()
Suppose you wanted to do the opposite:
supplier = Chem.SDMolSupplier('v2000.sdf')
writer = Chem.SDWriter('v3000.sdf')
writer.SetForceV3000(True)
for molecule in supplier:
writer.write(molecule)
writer.close()

Related

How to change column datatype with pyarrow

I am reading a set of arrow files and am writing them to a parquet file:
import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa
base_path = pathlib.Path('../mydata')
fields = [
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)
with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
for file_path in base_path.glob('*.arrow'):
table = feather.read_table(file_path)
pqwriter.write_table(table)
My problem is that the code field in the arrow files is defined with an int8 index instead of int32. The range of int8 however is insufficient. Hence I defined a schema with a int32 index for the field code in the parquet file.
However, writing the arrow table to parquet now complains that the schemas do not match.
How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?
Arrow ChunkedArray has got a cast function, but unfortunately it doesn't work for what you want to do:
>>> table['code'].cast(pa.dictionary(pa.int32(), pa.uint64(), ordered=False))
Unsupported cast from dictionary<values=uint64, indices=int8, ordered=0> to dictionary<values=uint64, indices=int32, ordered=0> (no available cast function for target type)
Instead you can cast to pa.uint64() and encode it to dictionary:
>>> table['code'].cast(pa.uint64()).dictionary_encode().type
DictionaryType(dictionary<values=uint64, indices=int32, ordered=0>)
Here's a self contained example:
import pyarrow as pa
source_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
])
source_table = pa.Table.from_arrays([
pa.array([1, 2, 3], pa.int64()),
pa.array([1, 2, 1000], pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
], schema=source_schema)
destination_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
])
destination_data = pa.Table.from_arrays([
source_table['value'],
source_table['code'].cast(pa.uint64()).dictionary_encode(),
], schema=destination_schema)

How to protonate a molecule in rdkit?

I m tring to present the workflow for the positive ion ESI mass spectra, based on the fragmentation of [M+H]+ ions. I want to simulate the ionisation by the addition of one proton to heteroatoms. For example,
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
# read mol
mol = Chem.MolFromSmiles('O=C(O)C1=CC(=NNC2=CC=C(C=C2)C(=O)NCCC(=O)O)C=CC1=O')
# draw the mol
dr = rdMolDraw2D.MolDraw2DSVG(800,800)
dr.SetFontSize(0.3)
op = dr.drawOptions()
for i in range(mol.GetNumAtoms()) :
op.atomLabels[i] = mol.GetAtomWithIdx(i).GetSymbol() + str((i+1))
AllChem.Compute2DCoords(mol)
dr.DrawMolecule(mol)
dr.FinishDrawing()
svg = dr.GetDrawingText()
SVG(svg)
I want to add one proton to the N atom with the index of #17 and to ionize the molecule. How to achieve this in rdkit?
Are these functions what you are looking for?
atom = mol.GetAtomWithIdx(17)
atom.SetNumExplicitHs(1)
atom.SetFormalCharge(1)

Extract Exif data from Tif

i have this code here and i would like to make it simpler by not using tif and cr2. basicly i would like to get exposure time fnumber iso and the date from the tif as variables t f S date, so that i don't have to use the cr2 file. here is my code so far:
clear all % clear workspace
RGB = imread('IMG_0069.tif');
info = imfinfo('IMG_0069.CR2'); % get Metadata
C = 1; % Constant to adjust image
x = info.DigitalCamera; % get EXIF
t = getfield(x, 'ExposureTime');% save ExposureTime
f = getfield(x, 'FNumber'); % save FNumber
S = getfield(x, 'ISOSpeedRatings');% save ISOSpeedRatings
date = getfield(x,'DateTimeOriginal'); % save DateTimeOriginal
I = rgb2gray(RGB);
You can easily concatenate strings to from names.
fname='IMG_XXX';
imread([fname, '.tif']);
iminfo([fname,'.CR2'])
iminfo should give you any info encoded in the metadata, but from the comments I can see that your files have not the information you want.

Incorrect representation of the string in csv file

I'm on Win7, Python2.7.
Have the string.
Original view:
A. P. Møller Mærsk
UTF-8:
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'
I need to write it in csv.
Try this:
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([s])
Got this:
A. P. Møller Mærsk
Try to use UnicodeWriter:
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'.decode('utf8')
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
And got again:
A. P. Møller Mærsk
Try unicodecsv:
Again:
A. P. Møller Mærsk
What's wrong? How can I write it right?
What you see is a mojibake: bytes that represent a Unicode text encoded in one character encoding are shown in another (incompatible) character encoding.
If ''.decode('utf8') doesn't raise AttributeError then it means that you are not on Python 3 (despite what you question says). On Python 2, csv doesn't support Unicode directly, you have to encode manually:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
text = "A. P. Møller Mærsk"
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([text.encode('utf-8')])
Both UnicodeWriter and unicodecsv module should work as well if text contains uncorrupted data.
Windows assumes the default Window locale's encoding with tools like Notepad or Excel, so for UTF-8 a byte order mark (BOM, U+FEFF) must be encoded at the start of the file. Python provides an encoding for this, utf-8-sig. Note also by using #coding:utf8 and saving your source file in UTF-8, you can declare your string directly as a Unicode string. Finally, files for use with the csv module should be opened as wb on Python 2.7 or you will see problems writing newlines on Windows.
#coding:utf8
import csv
from StringIO import StringIO
import codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
# Use utf-8-sig encoding here.
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = u'A. P. Møller Mærsk' # declare as Unicode string.
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
Output:
A. P. Møller Mærsk

byte reverse hex with python

I have a .bin file, and I want to simply byte reverse the hex data. Say for instance # 0x10 it reads AD DE DE C0, want it to read DE AD C0 DE.
I know there is a simple way to do this, but I am am beginner and just learning python and am trying to make a few simple programs to help me through my daily tasks. I would like to convert the whole file this way, not just 0x10.
I will be converting at start offset 0x000000 and blocksize/length is 1000000.
EDIT:
here is my code, maybe you can tell me where i am messing up.
def main():
infile = open("file.bin", "rb")
new_pos = int("0x000000", 16)
chunk = int("1000000", 16)
data = infile.read(chunk)
save(data)
def save(data):
with open("reversed", "wb") as outfile:
outfile.write(data)
main()
how would i go about coding it to byte reverse from CDAB TO ABCD?
if it helps any the file is exactly 16MB
You can just swap the bytes manually like this:
with open("file.bin", "rb") as infile, open("reversed", "wb") as outfile:
data = infile.read()
for i in xrange(len(data) / 2):
outfile.write(data[i*2+1])
outfile.write(data[i*2])

Resources