How to transform all the numbers in a text file - bash

I have an XML file that contains, among other things, numbers. Something like:
<things>
<a name="cat">
<vecs>(100,20),(200,40),(50,85)</vecs>
</a>
<b name="dog">
<vecs>(0,10),(5,75)</vecs>
<ratio>85.5</ratio>
</b>
... many more elements and numbers ...
</things>
Unfortunately all of the numbers with <vecs> elements in my file are 4 times larger than they should be. I need to multiply them all by 0.25. Numbers in <ratio> and other elements are fine. So for example the first <vecs> line above should read:
<vecs>(25,5),(50,10),(12.5,21.25)</vecs>
Is there a convenient solution (e.g. UNIX command line tool, bash script, etc.) to processing the file so that I can find all the numbers that live within a particular context (e.g. between <vecs> and </vecs>), perform a mathematical operation on them, and replace the existing numeric text in each instance with the result of the operation?
And no, I'm not asking you to write a whole program to solve this particular problem in detail. I'm wondering if there is an existing tool for such purposes or a clever combination of existing tools that could accomplish the job.

The problem itself is fairly easy, but the syntax is uncommon enough to have to use a general purpose script language to tackle the problem. For example in Python you would write something like this
from __future__ import print_function
import re
def transform(match):
return '(%.2f,%.2f)' % (int(match.group(1))*0.25,
int(match.group(2))*0.25)
for line in file('test.xml'):
if '<vecs>' in line:
print(re.sub(r'\((\d+),(\d+)\)',transform,line),end='')
else:
print(line,end='')
For particular problems your best bet is to learn a script language and use that to solve them.
If you want to use unix tools to do this kind of things sed and awk are your friends.

Related

Use Python to generate input to XeTeX

I am wondering what is a good way to use Python to generate an arithmetic worksheet consisting only of 2-digit numbers squared. Specifically I want to be able to call upon a Python program in which it asks me for parameters such as the range of the numbers in can call upon to square and the number of questions I want to generate. Once that is done the program will generate the numbers and then automatically open up a .tex file (already with preamble and customizations) and basically do a loop for each question like this:
\begin{exer}
n^2
\end{exer}
%%%%%Solution%%%%%%
\begin{solution}
n^2=n^2
\end{solution}
for some integer n.
Once it is done writing the .tex file then it will run xetex and output the pdf ready to use and print. Any help or suggestions? Python is preferred but not mandatory.
Actually your problem is so simple that doesn't require any special magic. However I would suggest you don't try to append your generated content into a file you already have with preamble, good practice is to leave it untouched and include (in fact you can copy it on generation or use TeX \include).
Now, let's add more to the generation. Python formatter is your friend here, you use the example you've given as a template, and write the product into a file in every iteration. Don't forget to escape "{" brackets, as they're symbols used by formatter.
At the end, (suggestion) you can subprocess to launch XeTeX - depending on your need call() is enough or use popen().

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

How to split a large csv file into multiple files in GO lang?

I am a novice Go lang programmer,trying to learn Go lang features.I wanted to split a large csv file into multiple files in GO lang, each file containing the header.How do i do this? I have searched everywhere but couldnt get the right solution.Any help in this regard will be greatly appreciated.
Also please suggest me a good book for reference.
Thanking You
Depending on your shell fu this problem might be better suited for common shell utilities but you specifically mentioned go.
Let's think through the problem.
How big is this csv file? Are we talking 100 lines or is it 5G ?
If it's smallish I typically use this:
http://golang.org/pkg/io/ioutil/#ReadFile
However, this package also exists:
http://golang.org/pkg/encoding/csv/
Regardless - let's return to the abstraction of the problem. You have a header (which is the first line) and then the rest of the document.
So what we probably want to do (if ignoring csv for the moment) is to read in our file.
Then we want to split the file body by all the newlines in it.
You can use this to do so:
http://golang.org/pkg/strings/#Split
You didn't mention but do you know how many files you want to split by or would you rather split by the line count or byte count? What's the actual limitation here?
Generally it's not going to be file count but if we pretend it is we simply want to divide our line count by our expected file count to give lines/file.
Now we can take slices of the appropriate size and write the file back out via:
http://golang.org/pkg/io/ioutil/#WriteFile
A trick I use sometime to help think me threw these things is to write down our mission statement.
"I want to split a large csv file into multiple files in go"
Then I start breaking that up into pieces but take the divide/conquer approach - don't try to solve the entire problem in one go - just break it up to where you can think about it.
Also - make gratiutious use of pseudo-code until you can comfortably write the real code itself. Sometimes it helps to just write a short comment inline with how you think the code should flow and then get it down to the smallest portion that you can code and work from there.
By the way - many of the golang.org packages have example links where you can literally run in your browser the example code and cut/paste that to your own local environment.
Also, I know I'll catch some haters with this - but as for books - imo - you are going to learn a lot faster just by trying to get things working rather than reading. Action trumps passivity always. Don't be afraid to fail.
Here is a package that might help. You can set a necessary chunk size in bytes and a file will be split on an appropriate amount of chunks.

Finding RNAs and information in a region

I want to find novel and known RNAs and transcripts in a sequence of about 10 KB. What is the most easiest way using bioinformatics tools to start with if that sequence is not well annotated in ensembl and UCSC browsers? Does splices ESTs and RNA sequencing data one option? I am new to bioinformatics, your suggestions are useful for me.
Thanks in advance
I am a bit unclear on what exactly your desired end-product or output would look like. But I might suggest doing multiple sequence alignments and looking for those with high scores. Chances are if this 10KB sequence will have some of those known sequences but they won't match exactly, so I think you want a program that gives you alignment scores and not just simple matches. I use Perl in combination with Clustal to make alignments. Basically, you will need to make .fasta or .aln files with both the 10KB sequence and a known sequence of interest according to those file formats' respective convention. You can use the GUI version of clustal if you are not too programming savvy. If you want to use Perl, here is a script I wrote for aligning a whole directory of .fasta files. It can perform many alignments in one fell swoop. NOTE: you must edit the clustal executable path in the last line (system call) to match its location on your computer for this script to function.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA";
}
Do you have a linux server or computer or are you relying on web and windows-based programs?
To align RNA-seq reads, people generally use splice read aligners like Tophat, although BLAST would probably work too.
Initially I wrote long response explaining how to do this in Linux but I've just realised that Galaxy might be a much easier solution for a beginner. Galaxy is an online bioinformatics tool with a very user friendly interface; it's particularly designed for beginners. You can sign up and log in at this website: https://main.g2.bx.psu.edu/
There are tutorials on how to do things (see 'Help' menu) but my basic workflow for your experiment would go something like this:
Log into Galaxy
Upload RNA-seq reads, EST reads and 10K genome sequence
In the menu on the left, click to expand "NGS-RNA sequencing", then click "Tophat for Illumina (assuming your RNA-seq reads are Illumina fastq reads)"
Align your RNA-seq reads using Tophat, make sure to select your 10K sequence as the reference genome.
Try aligning your EST reads with one of the programs. I'm not sure how successful this will be, Tophat isn't designed to work with long sequences so you might have to have a bit of a play or be a bit creative to get this working.
Use Cufflinks to create annotation for novel gene models, based on your RNA-seq reads and/or EST sequences.
Regarding viewing the output, I'm not sure what is available for a custom reference sequence on Windows, you might have to do a bit of research. For Linux/Mac, I'd recommend IGV.

What is the best character to use as a delimiter in a custom batch syntax?

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!
You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.
Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.
Tabs. And then expand or compress any tabs found in the text.
Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.
This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?
I would use '|'
It's one of the rarest characters.
How about String.fromCharCode(1) ?

Resources