Reading from compressed FASTA bz2 file using skbio - skbio

Is it possible to read from a compressed file (e.g., FASTA bz2)? I usually use skbio.sequence.Sequence.read but don't see this option there.
Thanks!

This is possible to do as follows:
import skbio
seq = skbio.io.read("seqs.fna.bz2", format='fasta', compression='bz2', into=skbio.DNA)
I'm using scikit-bio 0.5.0, but this should be possible with earlier versions as well. While I'm explicitly defining the compression type, that's generally not necessary.
The relevant documentation is here and here.

Related

Conversion between knitr and sweave

This might have been asked before, but until now I couldn't find a really helpful answer for me.
I am using R Studio with knitr and a colleague of mine who I need to cooperate with uses the sweave format. Is there a good way to convert a script back and forth between these two?
I have already found "Sweave2knitr" and hoped this would have an .rmd as output with all chunks changed (<<>> to {} etc.) but this is not the case. My main problem is that I would also need the option to convert from .rmd back to .rnw so that my colleague can also re-edit my work-over.
Thanks a lot!
To process the code chunks and convert the .Rnw file to .tex, you use the knit() function in the knitr package rather than Sweave().
R -e 'library(knitr);knit("my_file.Rnw")'
Sweave2knitr() is for converting old Sweave-based .Rnw files to the knitr syntax.
In Program defaults change :
Weave Rnw files using Sweave or knitr
The Rnw format is really LaTeX with some modifications, whereas the Rmd format is Markdown with some modifications. There are two main flavours of Rnw, the one used by Sweave being the original, and the one used by knitr being a modification of it, but they are very similar.
It's not hard to change Sweave flavoured Rnw to knitr flavoured Rnw (that's what Sweave2knitr does), but changing either one to Rmd would require extensive changes, and probably isn't feasible: certainly I'd expect a lot of manual work after the change.
So for your joint work with a co-author, I would recommend that you settle on a single format, and just use that. I would choose Rmd for this: it's much easier for your co-author to learn Markdown than for you to learn LaTeX. (If you already know LaTeX, that might push the choice the other way.)

How do I effectively identify an unknown file format

I want to write a program that parses yum config files. These files look like this:
[google-chrome]
name=google-chrome - 64-bit
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub
This format looks like it is very easy to parse, but I do not want to reinvent the wheel. If there is an existing library that can generically parse this format, I want to use it.
But how to find a library for something you can not name?
The file extension is no help here. The term ".repo" does not yield any general results besieds yum itself.
So, please teach me how to fish:
How do I effectively find the name of a file format that is unknown to me?
Identifying an unknown file format can be a pain.
But you have some options. I will start with a very obvious one.
Ask
Showing other people the format is maybe the best way to find out its name.
Someone will likely recognize it. And if no one does, chances are good that
you have a proprietary file format in front of you.
In case of your yum repository file, I would say it is a plain old INI file.
But let's do some more research on this.
Reverse Engineering
Reverse Engineering maybe your best bet if nobody recognizes your format.
Take the reference implementation and find out what they are using to parse the format.
Luckily, yum is open source. So it is easy to look up.
Let's see, what the yum authors use to parse their repo file:
try:
ini = INIConfig(open(repo.repofile))
except:
return None
https://github.com/rpm-software-management/yum/blob/master/yum/config.py#L1304
Now the import of this function can be found here:
from iniparse import INIConfig
https://github.com/rpm-software-management/yum/blob/master/yum/config.py#L32
This leads us to a library called iniparse (https://pypi.org/project/iniparse/).
So yum uses an INI parser for its config files.
I will show you how to quickly navigate to those kind of code passages
since navigating in somewhat large projects can be intimidating.
I use a tool called ripgrep (https://github.com/BurntSushi/ripgrep).
My initial anchors are usually well known filepaths. In case of yum, I took /etc/yum.repos.d for my initial search:
# assuming you are in the root directory of yum's source code
rg /etc/yum.repos.d yum
yum/config.py
769: reposdir = ListOption(['/etc/yum/repos.d', '/etc/yum.repos.d'])
yum/__init__.py
556: # (typically /etc/yum/repos.d)
This narrows it down to two files. If you go on further with terms like read or parse,
you will quickly find the results you want.
What if you do not have the reference source?
Well, sometimes, you have no access to the source code of a reference implementation. E.g: The reference implementation is closed source.
Try to break the format. Insert some garbage and observe the log files afterwards. If you are lucky, you may find
a helpful error message which might give you hints about the format.
If you feel very brave, you can try to use an actual decompiler as well. This may or may not be illegal and may or may not be a waste of time.
I personally would only do this as a last resort.

How to convert image to integer array? (do not use any non-standard library)

How to convert image.png or image.bmp to integer array? (do not use any non-standard library)
Please ignore chunks that are not directly related to image data.(IHDR、IEND...etc.)
thank you very much.
SOLVED: I should use binary I/O function in stdio.h to read image file. thanks
If you have to read images into arrays without any image processing libraries you need two things:
You need means to read files in general.
You need to know the internal structure of the file formats you want to read.
So for png refer to https://www.w3.org/TR/2003/REC-PNG-20031110/
This document will tell you where to find the image dimensions, pixel data and other features. It's basically a manual for software developers on how to use this standard format properly.
Some image formats will require additional work like decrompression.

File extension for serialized protobuf output

Seems odd that I can't find the answer to this, but what file extension are you supposed to use when storing serialized protobuf output in a file? Just .protobuf? The json equivalent of what I am talking about would be a .json file.
I just use .bin, but there's no actual standard here AFAIK. If protoc -o (which emits a .proto schema in protobuf binary format as a FileDescriptorSet) had taken a directory like all the other output options do, we could have used that as a de-facto answer, but protoc -o is unusual in that it takes a file instead. In an old post on the protobuf group, Kenton Varda (one of the original authors) suggests that the file extension should be implementation specific (meaning: you decide) rather than simply referring to the format: https://groups.google.com/forum/#!topic/protobuf/JWZx9n8CUvw

spring batch and jrecord to generate ebcedic

I am reading a table in an object and I need to generate a passthrough ebcidic file from it. This is a spring batch step. There was some suggestions to use jrecord to write an aggregator and a FlatFileItemWriter.
Any clues ?
JRecord is possible solution, I can not say whether there is a better solution for you or not as I do not
know anything about Spring-Batch. This is perhaps more of an extended Comment than a pure answer
JRecord reads / writes files using a File-Schema (or File Description).
Normally this file-schema is a Cobol-Copybook although it also can be a Xml~Schema. The file schema can also be defined in the Program if need be. Given you want to write Ebcdic files, I would think a Cobol-Copybook
will be needed at some stage.
JRecord also support for mainframe/Cobol sequential File structures (FB - Fixed-Width files)
which is what you want
JRecord allows access to fields either by Field-Name or Field-Index (or field id). Note Record_Type_index is to handle files with multiple record types (e.g. header-record, detail-record, footer-record files).
outLine.getFieldValue(record_Type_Index, field_Index).set(...)
or
outLine.getFieldValue("Field-Name").set(...)
Bruce Martin (author of JRecord)
Discussions continued at JRecord forum
https://sourceforge.net/p/jrecord/discussion/678634/thread/2709ab72/?limit=25#c009/8287

Resources