Losing the crs when writing to .gpkg with geopandas - geopandas

When I write my .gpkg I am losing the CRS. I have tried setting the CRS with .set_crs, or adding the CRS when writing the .gpkg (which creates a fault - "fiona._env - WARNING - dataset filename.gpkg does not support layer creation option EPSG"
My code
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
vector.set_crs(4326)
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG')
or
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG', epsg=4326)
neither works.

vector.set_crs(4326) does not work in place by default. You either need to assign it or specify inplace=True.
for layername in fiona.listlayers(file):
vector = geopandas.read_file(file, layer=layername)
# vector.set_crs(4326, inplace=True) # one option
vector = vector.set_crs(4326) # other option
vector.to_file(filename + ".gpkg", layer = layername, driver='GPKG')
Your second attempt does not work because to_file does not have espg keyword you are trying to lose and that gets lost among arguments passed to Fiona and GDAL (which silently ignores it).

Related

How can I export layered drawings from drawio to create "animated" slides in beamer?

When preparing lectures, or conference presentations with beamer, I usually use layered drawings. Then for graphics included in consecutive slides ("frames" in beamer), I simply use different sets of layers.
For graphics created in IPE, I have created a dedicated expallviews.lua script.
Unfortunately, for graphics created with diagrams.net locally run as drawio-desktop, no such automated export of various layers exists. The only way is to manually select the visible layers in GUI and then export consecutive drawings to a set of PDF files.
Is there a more convenient method to solve that problem?
The described problem has been reported in issues 405 and 737 in the drawio-desktop repository.
After reviewing those issues, I have found a method based on automated (instead of a manual via GUI) changing the visibility of layers and exporting such drawings to the set of PDF files. The proposed method is described in the comment to the issue 405. It uses a simple Python script:
#!/usr/bin/python3
"""
This script modifies the visibility of layers in the XML
file with diagram generated by drawio.
It works around the problem of lack of a possibility to export
only the selected layers from the CLI version of drawio.
Written by Wojciech M. Zabolotny 6.10.2022
(wzab01<at>gmail.com or wojciech.zabolotny<at>pw.edu.pl)
The code is published under LGPL V2 license
"""
from lxml import etree as let
import xml.etree.ElementTree as et
import xml.parsers.expat as pe
from io import StringIO
import os
import sys
import shutil
import zlib
import argparse
PARSER = argparse.ArgumentParser()
PARSER.add_argument("--layers", help="Selected layers, \"all\", comma separated list of integers or integer ranges like \"0-3,6,7\"", default="all")
PARSER.add_argument("--layer_prefix", help="Layer name prefix", default="Layer_")
PARSER.add_argument("--outfile", help="Output file", default="output.drawio")
PARSER.add_argument("--infile", help="Input file", default="input.drawio")
ARGS = PARSER.parse_args()
INFILENAME = ARGS.infile
OUTFILENAME = ARGS.outfile
# Find all elements with 'value' starting with the layer prefix.
# Return tuples with the element and the rest of 'value' after the prefix.
def find_layers(el_start):
res = []
for el in el_start:
val = el.get('value')
if val is not None:
if val.find(ARGS.layer_prefix) == 0:
# This is a layer element. Add it, and its name
# after the prefix to the list.
res.append((el,val[len(ARGS.layer_prefix):]))
continue
# If it is not a layer element, scan its children
res.extend(find_layers(el))
return res
# Analyse the list of visible layers, and create the list
# of layers that should be visible. Customize this part
# if you want a more sophisticate method for selection
# of layers.
# Now only "all", comma separated list of integers
# or ranges of integers are supported.
def build_visible_list(layers):
if layers == "all":
return layers
res = []
for lay in layers.split(','):
# Is it a range?
s = lay.find("-")
if s > 0:
# This is a range
first = int(lay[:s])
last = int(lay[(s+1):])
res.extend(range(first,last+1))
else:
res.append(int(lay))
return res
def is_visible(layer_tuple,visible_list):
if visible_list == "all":
return True
if int(layer_tuple[1]) in visible_list:
return True
try:
EL_ROOT = et.fromstring(open(INFILENAME,"r").read())
except et.ParseError as perr:
# Handle the parsing error
ROW, COL = perr.position
print(
"Parsing error "
+ str(perr.code)
+ "("
+ pe.ErrorString(perr.code)
+ ") in column "
+ str(COL)
+ " of the line "
+ str(ROW)
+ " of the file "
+ INFILENAME
)
sys.exit(1)
visible_list = build_visible_list(ARGS.layers)
layers = find_layers(EL_ROOT)
for layer_tuple in layers:
if is_visible(layer_tuple,visible_list):
print("set "+layer_tuple[1]+" to visible")
layer_tuple[0].attrib['visible']="1"
else:
print("set "+layer_tuple[1]+" to invisible")
layer_tuple[0].attrib['visible']="0"
# Now write the modified file
t=et.ElementTree(EL_ROOT)
with open(OUTFILENAME, 'w') as f:
t.write(f, encoding='unicode')
The maintained version of that script, together with a demonstration of its use is also available in my github repository.

How to use entrezpy and Biopython Entrez libraries to access ClinVar data from genomic position of variant

[Disclaimer: I have published this question 3 weeks ago in biostars, with no answers yet. I really would like to get some ideas/discussion to find a solution, so I post also here.
biostars post link: https://www.biostars.org/p/447413/]
For one of my projects of my PhD, I would like to access all variants, found in ClinVar db, that are in the same genomic position as the variant in each row of the input GSVar file. The language constraint is Python.
Up to now I have used entrezpy module: entrezpy.esearch.esearcher. Please see more for entrezpy at: https://entrezpy.readthedocs.io/en/master/
From the entrezpy docs I have followed this guide to access UIDs using the genomic position of a variant: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html in code:
# first get UIDs for clinvar records of the same position
# credits: credits: https://entrezpy.readthedocs.io/en/master/tutorials/esearch/esearch_uids.html
chr = variants["chr"].split("chr")[1]
start, end = str(variants["start"]), str(variants["end"])
es = entrezpy.esearch.esearcher.Esearcher('esearcher', self.entrez_email)
genomic_pos = chr + "[chr]" + " AND " + start + ":" + end # + "[chrpos37]"
entrez_query = es.inquire(
{'db': 'clinvar',
'term': genomic_pos,
'retmax': 100000,
'retstart': 0,
'rettype': 'uilist'}) # 'usehistory': False
entrez_uids = entrez_query.get_result().uids
Then I have used Entrez from BioPython to get the available ClinVar records:
# process each VariationArchive of each UID
handle = Entrez.efetch(db='clinvar', id=current_entrez_uids, rettype='vcv')
clinvar_records = {}
tree = ET.parse(handle)
root = tree.getroot()
This approach is working. However, I have two main drawbacks:
entrezpy fulls up my log file recording all interaction with Entrez making the log file too big to be read by the hospital collaborator, who is variant curator.
entrezpy function, entrez_query.get_result().uids, will return all UIDs retrieved so far from all the requests (say a request for each variant in GSvar), thus this space inefficient retrieval. That is the entrez_uids list will quickly grow a lot as I process all variants from a GSVar file. The simple solution that I have implenented is to check which UIDs are new from the current request and then keep only those for Entrez.fetch(). However, I still need to keep all seen UIDs, from previous variants in order to be able to know which is the new UIDs. I do this in code by:
# first snippet's first lines go here
entrez_uids = entrez_query.get_result().uids
current_entrez_uids = [uid for uid in entrez_uids if uid not in self.all_entrez_uids_gsvar_file]
self.all_entrez_uids_gsvar_file += current_entrez_uids
Does anyone have suggestion(s) on how to address these two presented drawbacks?

PyMC: Directly changing an object's name doesn't apply when pulling out traces

Here is a bare bit of code which produces an error:
import pymc
import numpy as np
a = pymc.Normal('a', 1, 1)
b = np.empty(4, dtype=object)
for i in range(4):
b[i] = 1*a
b[i].__name__ = 'b_%i'%i
M = pymc.MCMC([a,b])
M.sample(10)
M.trace('b_0') # Causes a KeyError:'b_0'
I don't understand why I get a KeyError: 'b_0' when I try to extract the trace of b_0 and all the other b's. Are the traces just not being saved? If so, is there a way to directly flick some switch to change that without having to make the object using #deterministic.
I looked through it, apparently the trace wasn't being saved. Also, the "flag variable" for keeping the trace isn't .trace, it's .keep_trace

TensorFlow: Reading images in queue without shuffling

I have a training set of 614 images which have already been shuffled. I want to read the images in order in batches of 5. Because my labels are arranged in the same order, any shuffling of the images when being read into the batch will result in incorrect labelling.
These are my functions to read and add the images to the batch:
# To add files from queue to a batch:
def add_to_batch(image):
print('Adding to batch')
image_batch = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
# Add to summary
tf.image_summary('images',image_batch,max_images=30)
return image_batch
# To read files in queue and process:
def get_batch():
# Create filename queue of images to read
filenames = [('/media/jessica/Jessica/TensorFlow/StreetView/training/original/train_%d.png' % i) for i in range(1,614)]
filename_queue = tf.train.string_input_producer(filenames,shuffle=False,capacity=614)
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
# Read and process image
# Image is 500 x 275:
my_image = tf.image.decode_png(value)
my_image_float = tf.cast(my_image,tf.float32)
my_image_float = tf.reshape(my_image_float,[275,500,4])
return add_to_batch(my_image_float)
This is my function to perform the prediction:
def inference(x):
< Perform convolution, pooling etc.>
return y_conv
This is my function to calculate loss and perform optimisation:
def train_step(y_label,y_conv):
""" Calculate loss """
# Cross-entropy
loss = -tf.reduce_sum(y_label*tf.log(y_conv + 1e-9))
# Add to summary
tf.scalar_summary('loss',loss)
""" Optimisation """
opt = tf.train.AdamOptimizer().minimize(loss)
return loss
This is my main function:
def main ():
# Training
images = get_batch()
y_conv = inference(images)
loss = train_step(y_label,y_conv)
# To write and merge summaries
writer = tf.train.SummaryWriter('/media/jessica/Jessica/TensorFlow/StreetView/SummaryLogs/log_5', graph_def=sess.graph_def)
merged = tf.merge_all_summaries()
""" Run session """
sess.run(tf.initialize_all_variables())
tf.train.start_queue_runners(sess=sess)
print "Running..."
for step in range(5):
# y_1 = <get the correct labels here>
# Train
loss_value = sess.run(train_step,feed_dict={y_label:y_1})
print "Step %d, Loss %g"%(step,loss_value)
# Save summary
summary_str = sess.run(merged,feed_dict={y_label:y_1})
writer.add_summary(summary_str,step)
print('Finished')
if __name__ == '__main__':
main()
When I check my image_summary the images do not seem to be in sequence. Or rather, what is happening is:
Images 1-5: discarded, Images 6-10: read, Images 11-15: discarded, Images 16-20: read etc.
So it looks like I am getting my batches twice, throwing away the first one and using the second one? I have tried a few remedies but nothing seems to work. I feel like I am understanding something fundamentally wrong about calling images = get_batch() and sess.run().
Your batch operation is a FIFOQueue, so every time you use it's output, it advances the state.
Your first session.run call uses the images 1-5 in the computation of train_step, your second session.run asks for the computation of image_summary which pulls images 5-6 and uses them in the visualization.
If you want to visualize things without affecting the state of input, it helps to cache queue values in variables and define your summaries with variables as inputs rather than depending on live queue.
(image_batch_live,) = tf.train.batch([image],batch_size=5,num_threads=1,capacity=614)
image_batch = tf.Variable(
tf.zeros((batch_size, image_size, image_size, color_channels)),
trainable=False,
name="input_values_cached")
advance_batch = tf.assign(image_batch, image_batch_live)
So now your image_batch is a static value which you can use both for computing loss and visualization. Between steps you would call sess.run(advance_batch) to advance the queue.
Minor wrinkle with this approach -- default saver will save your image_batch variable to checkpoint. If you ever change your batch-size, then your checkpoint restore will fail with dimension mismatch. To work-around you would need to specify the list of variables to restore manually, and run initializers for the rest.

Using native Windows FileChooser dialog under PyGTK?

I'm using PyGTK's GtkFileChooserButton, which is working - but looks very weird on Windows environment. Is it possible to use native Windows file chooser dialog?
UPDATE
See the comments for possible directions. However, if you decide (like me...) that it doesn't worth the effort, this will make the Gtk FileChooser more tolerable:
def get_win_my_documents():
# based on http://stackoverflow.com/questions/3858851/python-get-windows-special-folders-for-currently-logged-in-user
# and http://stackoverflow.com/questions/6227590/finding-the-users-my-documents-path
CSIDL_PERSONAL = 5 # My Documents
# the 2 stackoverflow answers use different values for this constant!
SHGFP_TYPE_CURRENT = 0 # Get current, not default value
buf = ctypes.create_unicode_buffer(ctypes.wintypes.MAX_PATH)
ctypes.windll.shell32.SHGetFolderPathW(None, CSIDL_PERSONAL, None, SHGFP_TYPE_CURRENT, buf)
if os.path.isdir(buf.value):
return buf.value
else:
# fall back to simple "home" notion
return(os.path.expanduser("~"))
...
my_documents = get_win_my_documents()
chooser.set_current_folder(my_documents)
chooser.add_shortcut_folder(my_documents)
# I'm not sure if it's a general solution, but works for me...
downloads = os.path.join(os.path.expanduser("~"), "Downloads")
if os.path.isdir(downloads):
chooser.add_shortcut_folder(downloads)

Resources