xzgrep and other Compressed Pattern Matching Tools - algorithm

I'm experimenting on compressed pattern matching utilities, or more specifically searching for text patterns within LZW-compressed text files.
I'm wondering if the xzgrep Linux utility is applying a certain algorithm for achieving that, or is it just equivalent to the regular decompression and grepping, and has nothing to do e.g.
uncompress LARGE_TEXT_FILE.Z | grep "My Pattern"
Also, are there any other utilities/software that apply any compressed pattern-matching algorithms (LZW-compressed text files, like http://tandem.bu.edu/papers/let.sleeping.files.lie.jcss.1996.pdf) , preferably with the source code available?
Thank you!

Related

Is MATLAB performace better than bash scripts for text manipulation?

I have a program (for simulation of some physical systems) that gives very big (more than 1GB) text files as output. I have to extract the desired results (numbers) from this text file. Currently I've written bash scripts for this purpose that, for example, search the text files for some expressions and write the number after that expression in a separate file; e.g.:
grep $EXP | awk '{print $14}' > tmp
Unfortunately, these bash scripts are very time-consuming for large input text files. So I am considering to use another language for searching the text files. As there are many scripts that I have to rewrite, does writing the scripts in MATLAB give me a considarable speed-up?
As a side question, are there better options than MATLAB? (probably compiled languages like C?)

Finding RNAs and information in a region

I want to find novel and known RNAs and transcripts in a sequence of about 10 KB. What is the most easiest way using bioinformatics tools to start with if that sequence is not well annotated in ensembl and UCSC browsers? Does splices ESTs and RNA sequencing data one option? I am new to bioinformatics, your suggestions are useful for me.
Thanks in advance
I am a bit unclear on what exactly your desired end-product or output would look like. But I might suggest doing multiple sequence alignments and looking for those with high scores. Chances are if this 10KB sequence will have some of those known sequences but they won't match exactly, so I think you want a program that gives you alignment scores and not just simple matches. I use Perl in combination with Clustal to make alignments. Basically, you will need to make .fasta or .aln files with both the 10KB sequence and a known sequence of interest according to those file formats' respective convention. You can use the GUI version of clustal if you are not too programming savvy. If you want to use Perl, here is a script I wrote for aligning a whole directory of .fasta files. It can perform many alignments in one fell swoop. NOTE: you must edit the clustal executable path in the last line (system call) to match its location on your computer for this script to function.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA";
}
Do you have a linux server or computer or are you relying on web and windows-based programs?
To align RNA-seq reads, people generally use splice read aligners like Tophat, although BLAST would probably work too.
Initially I wrote long response explaining how to do this in Linux but I've just realised that Galaxy might be a much easier solution for a beginner. Galaxy is an online bioinformatics tool with a very user friendly interface; it's particularly designed for beginners. You can sign up and log in at this website: https://main.g2.bx.psu.edu/
There are tutorials on how to do things (see 'Help' menu) but my basic workflow for your experiment would go something like this:
Log into Galaxy
Upload RNA-seq reads, EST reads and 10K genome sequence
In the menu on the left, click to expand "NGS-RNA sequencing", then click "Tophat for Illumina (assuming your RNA-seq reads are Illumina fastq reads)"
Align your RNA-seq reads using Tophat, make sure to select your 10K sequence as the reference genome.
Try aligning your EST reads with one of the programs. I'm not sure how successful this will be, Tophat isn't designed to work with long sequences so you might have to have a bit of a play or be a bit creative to get this working.
Use Cufflinks to create annotation for novel gene models, based on your RNA-seq reads and/or EST sequences.
Regarding viewing the output, I'm not sure what is available for a custom reference sequence on Windows, you might have to do a bit of research. For Linux/Mac, I'd recommend IGV.

FindFirstFile Multiple file types

Is it possible to use Windows API function FindFirstFile to search for multiple file types, e.g *.txt and *.doc at the same time?
I tried to separate patterns with '\0' but it does not work - it searches only the first pattern (I guess, that's because it thinks that '\0' is the end of string).
Of course, I can call FindFirstFile with *.* pattern and then check my patterns or call it for every pattern, but I don't like this idea - I will use it only if there no other solutions.
This is not supported. Run it twice with different wildcards. Or use *.* and filter the result. This is definitely the better choice, wildcards are ambiguous anyway due to support for legacy MS-DOS 8.3 filenames. A wildcard like *.doc will find both .doc and .docx files for example. A filename like longfilename.docx also creates an entry named LONGFI~1.DOC
The MSDN docs mention nothing about FindFirstFile allowing multiple search patterns, hence it doesn't exist.
In this case your best bet is to scan using an open selection (like C:\\some directory\* or *) and then filter based on WIN32_FIND_DATA's cFileName member, using strrchr (or the appropriate Unicode variant) to find the extension. It should run pretty fast for the small set of characters that make up a file extension.
If you know the that all the extensions are say 3 characters, you should be able to mask it off as *.??? to speed things up.

very long lines - windows grep character (not a line based) tool

Is there a grep-like tool for windows where I can restrict the number of characters it outputs in a line where a searched for pattern is found.
One of the upstream software systems generates huge text files which we then feed as the input to our system.
Sometimes the input files get corrupted and I need to do a quick textual search to find if particular the bits of data are missing or not. To make it even worse - the input files is just one very very long line of text - and when I use grep or findstr - the result of the search is huge chunk of text.
I am wandering - how can I limit the number of characters grep to show before/after the pattern I searched for.
Cheers.
Two things spring to my mind:
Call grep with the --only-matching option so that only the text that matches is emitted. Depending on your regex, this may or may not help.
Write a very simple executable, call it trunc, which reads from stdin line by line and output the first n characters to stdout. Then simply pipe the output from grep to trunc.
The latter option is relatively simple. If you didn't want to go the whole hog and produce a proper native exe it could be quite easily achieved with a Perl/Python/Ruby etc. script.

Is there a standard format for describing a flat file?

Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
  †: Yes, that may be an understatement.
  ‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).

Resources