Converting .fasta files to .gff3 files

Converting .fasta files to .gff3 files - bioinformatics

I'm trying to use Roary to do phylogenetic analysis.
It says "Alternatively you can use ncbi-genome-download to pull down the FASTA files and convert them to GFF3 with Prokka." in https://sanger-pathogens.github.io/Roary/
I already have all the .fasta file I need.
How am I supposed to convert it to .gff3 files?

Fasta files contain nucleotide or peptide sequences (nucleotides in the case of bacterial/archaeal genomes). Files in GFF3 format, on the other hand, contain annotations, a list of intervals corresponding to genes or other genomic features. Optionally, Fasta sequences can be appended to the end of a GFF3 file (separated by a ##FASTA directive). I personally find it abominable to combine GFF3 and Fasta in the same file, but apparently this is required for some software packages.
You mentioned the following excerpt from the Roary documentation.
Alternatively you can use ncbi-genome-download to pull down the FASTA files and convert them to GFF3 with Prokka.
Unless I'm mistaken, convert is the wrong word to use here. Prokka doesn't convert Fasta files to GFF3 files, it takes bacterial/archaeal genome sequences as input and annotates them. How to do that? Which parameters should you use? Well, #heathobrien is right: you'll need to read the prokka documentation (and maybe the paper as well).

Related

Performing conditional change in .ods file in bash

How can I do a conditional change in an .ods document? I have two columns. One of them stores a string and the second a value. I want to search the document with a particular string that I have, say "xyz". If this matches any of the strings that are shown in the first column, I would like a value of 1 to be deducted from the cell in the same row, but from the second column. The data in the .ods document are separated by the different adjoining cells (so a tab?)
As an example, consider the following:
xyz 23
xxy 42
xzz 76
If I have the string "xxy", I would like the bash script to update the .ods file such that it looks as so:
xyz 23
xxy 41
xzz 76
Now, the strings that I am searching for are stored in a seperate .txt file. I would like to iterate over all of the strings in the .txt file and repeatedly perform the described operation in the .ods file. There can be cases where the are multiple occurrences of the same string. Any helps with this?

This should be a comment, but its a bit long
am searching for are stored in a text file.
No. A MS Excel files is not a text file. Its not even a file but rather an embedded filesystem where content is encapsuleted in OLE, or more recently as an xml tree. While there are both OLE and XML parsers available on Unix (I assume you want to run this on Linux/Unix/Posix since you've flagged this with bash, awk and sed) that just gets you access to where the data is stored. You still need a detailled understanding of the file format to be able to make changes. While it may be possible to do this in bash, it would be a lot easier in a dedicated programming language. Several do come with libraries for processing Excel files but vary in their support for file formats. Alternatively you could load it up in openoffice using its UNO API.

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.

Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

How to find file that is missing a required string in eclipse/windows?

Is there a tool that can help me flag or list all the files (usually code files like java, xml, sql, etc) in a directory, which do not contain a particular string within the code.
For Ex: I need a list of all the files in my project that do not contain the word "author" (the text could be arbitrary).
I have seen a similar question here, but it is for the OS-X and not for Windows/eclipse platform.

A simple way would be to make a copy of the tree, find all the files which contain the word, and delete them. The remaining files are the ones which don't contain the word.

Diff for 3 binary files

I have 3 binary files. Let's call them file1.bin, file2.bin and file3.bin.
file1.bin and file2.bin have some common parts.
file2.bin and file3.bin have some common parts.
I want to find the common parts between file1.bin and file2.bin that are different between file2.bin and file3.bin.
How do you recommend to accomplish that? I have already dumped the binary files to text files using xxd and then did a 3-way diff using vim -d file1.txt file2.txt file3.txt.
However, vim marks a part as changed in all the files even if it has only changed in one file and remains the same in the other two files. I want those special kind of occurrences to be marked differently.

Perhaps you can use the built-in unix diff (I think it is part of OSX), but use the --unchanged-group-format to list the similarities. Do that for file1 and file 2. Then do it for file2 and file3. You can then do a regular diff on the two resulting files.
For an idea of how to get the similarities, have a look at this post.

The tool that I work for (ECMerge) does that. You just have to diff the 3 binary files, it will present equal portions in front of each other, and modified bytes appropriately placed in between. No need to first get an hex dump. You can script in JavaScript to output whatever you like based on the diff results and the bytes in the files (it works also in command line).

Chromium uses bsdiff, then switched to courgette for doing binary diff as explained in their blog here. You might find useful leads from their blog.

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.

XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.

About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
†: Yes, that may be an understatement.
‡: Even properly describing the email address format is not trivial.

COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.

At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.

I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)

CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.

The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio