One-line program to delete files with few header lines - windows

This is the next part of my earlier question perl one-liner to keep only desired lines. Here I have many *.fa files in a folder.
Suppose for three files: 1.fa, 2.fa, 3.fa
The contents of them are as follows:
1.fa
>djhnk_9
abfgdddcfdafaf
ygdugidg
>kjvk.80
jdsfkdbfdkfadf
>jnck_q2
fdgsdfjghsjhsfddf
>7ytiu98
ihdlfwdfjdlfl]ol
2.fa
>cj76
dkjfhkdjcfhdjk
>67q32
nscvsdkvklsflplsad
>kbvbk
cbjfdikjbfadkjfbka
3.fa
>1290.5
mnzmnvjbsdjb
The lines that start with a > are the headers and the rest are the feature lines.
I want to delete those files that have 3 or fewer header lines. Here, file 2.fa and file 3.fa should be deleted.
As I am working on a Windows system, preferably I use a one-line Perl script like:
for %%F in ("*.fa") do perl ...
Is there a one-line program for that?

Use a program. "One-liners" are inscrutable, non-portable, and very hard to debug
This does as you ask. I hope it's clear that I have commented out the unlink call for testing purposes: it would be a pain to regenerate the *.fa files each time
You will probably want to change '[0-9].fa' to just *.fa. I had other files in my own directory that I didn't want to be considered
use strict;
use warnings 'all';
while ( my $file = glob '[0-9].fa' ) {
open my $fh, '<', $file;
my $headers = grep /^>/, <$fh>;
#unlink $file if $headers <= 3;
print qq{deleting "$file"\n} if $headers <= 3;
}
output
deleting "2.fa"
deleting "3.fa"

Next time, please try to write some code by yourself to solve the problem, and only after come ask for help. You will learn more if you do that, and we won't feel like you're just asking us to write your code.
The problem is very simple though, so here's a solution.
Note that this solution should be considered as a quick fix. Borodin suggested cleaner, easier to understand and more portable way to do this here.
I would suggest doing this with perl like this :
perl -nE "$count{$ARGV}++ if /^>/; END { unlink grep { $count{$_} <= 3 } keys %count }" *.fa
(for the record, I'm using double-quotes" as the delimiter of the string since you are on windows, but if anyone wish to use this on an unix system, just change the double-quotes " for some single-quotes').
Explanations:
-n surround the code with while(<>){...}, which will read the files one by one.
With $count{$ARGV}++ if /^>/ we count the number of headers in each file : $ARGV holds the name of the file being read, and /^>/ is true only if the line starts with >, ie. it's a header line.
Finally ( the END { .. } part), we delete (with the function unlink) the files that have 3 headers or less : keys %count gives all the file names, and grep { $count{$_} <= 3 } retains only the files that have 3 or less header lines to delete them.

Related

For loop within a for loop for iterating files of different extensions

Say I have 20 different files. First 10 files end with .counts.tsv and the rest of the files end with .libsize.tsv. For each .counts.tsv there are matching .libsize.tsv files. I would like to use a for loop for selecting both of these files and run an R script for on those two files types.
Here is what I tried,
#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
for libsize in "$arti"/*__libsize.tsv ; do
Rscript score.R ${counts} ${libsize}
done;
done;
The above shell script iterates over the files more than 200 times whereas I have only 20 files. I need the Rscript to be executed 10 times for both files. Any suggestions would be appreciated.
I started typing up an answer before seeing your comment that you're only interested in a bash solution, posting anyway in case someone finds this question in the future and is open to an R based solution.
If I were approaching this from scratch, I'd probably just use an R function defined in the file that takes the two file names instead of messing around with the system() calls, but this would provide the behavior you desire.
## Get a vector of files matching each extension
counts_names <- list.files(path = ".", pattern ="*.counts.tsv")
libsize_names <- list.files(path = ".", pattern ="*.libsize.tsv")
## Get the root names of the files before the extensions
counts_roots <- gsub(".counts.tsv$", "",counts_names)
libsize_roots <- gsub(".libsize.tsv$", "",libsize_names)
## Get only root names that have both file types
shared_roots <- intersect(libsize_roots,counts_roots)
## Loop through the shared root names and execute an Rscript call based on the two files
for(i in seq_along(shared_roots)){
counts_filename <- paste0(shared_roots[[i]],".counts.tsv")
libsize_filename <- paste0(shared_roots[[i]],".libsize.tsv")
Command <- paste("Rscript score.R",counts_filename,libsize_filename)
system(Command)
}
Construct the second filename with ${counts%counts.tsv} (remove last part).
#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
libsize="${counts%counts.tsv}libsize.tsv"
Rscript score.R "${counts}" "${libsize}"
done
EDIT:
Less safe is trying to make it an oneliner. When the filenames are without spaces and newlines, you can risk an accident with
echo ${arti}/*counts.tsv ${arti}/*.libsize.tsv | xargs -n2 Rscript score.R
and when you feel really lucky (with no other files than those tsv files in $arti) make a bungee jump with
echo ${arti}/* | xargs -n2 Rscript score.R
Have you tried list.files in base? This will allow you to use all files in the folder.
arti='/home/path/tofiles'
for i in list.files(arti) {
script
}
See whether the below helps.
my_list = list.files("./Data")
counts = grep("counts.tsv", my_list, value=T)
libsize = grep("libsize.tsv", my_list, value=T)
for (i in seq(length(counts))){
system(paste("Rscript score.R",counts[i],libsize[i]))
}
Finally,
I tried the following and it helped me,
for sam in "$arti"/*__counts.tsv ; do
filebase=$(basename $sam)
samples=$(ls -1 ${filebase}|awk -F'[-1]' '{print $1}')
Rscript score.R ${samples}__counts.tsv ${samples}__libsize.tsv
done;
For someone looking for something similar :)

search&replace on huge txt files

I need a text processing tool that can perform search and replace operations PER LINE on HUGE TEXT FILES (>0.5 GB). Can be either windows or linux based. (I don't know if there is anything like a streamreader/writer in Linux but I have a feeling that it would be the ideal solution. The editors I have tries so far load the whole file into the momory.)
Bonus question: a tool that can MERGE two huge texts on a per line basis, separated with e.g. tabs
Sounds like you want sed. For example,
sed 's/foo/bar/' < big-input-file > big-output-file
should replace the first occurrence of foo by bar in each line of big-input-file, writing the results to big-output-file.
Bonus answer: I just learned about paste, which seems to be exactly what you want for your bonus question.
'sed' is built into Linux/Unix, and is available for Windows. I believe that it only loads a buffer at a time (not the whole file) -- you might try that.
What would you be trying to do with the merge -- interleaved in some way, rather than just concatenating?
Add: interleave.pl
use strict;
use warnings;
my $B;
open INA, $ARGV[0];
open INB, $ARGV[1];
while (<INA>) {
print $_;
$B = <INB>;
print $B;
}
close INA;
close INB;
run: perl interleave.pl fileA fileB > mergedFile
Note that this is a very bare-bones utility. It does not check if the files exist, and it expects that the files have the same number of lines.
I would use perl for this. It is easy to read a file line by line, has great search/repace available using regular expressions, and will enable you to merge, and you can make your perl script aware of both files.

Shell scripting - save to a file, so the file will always have the last 10 values added

I found myself quite stomped. I am trying to output data from a script to a file.
Altho I need to keep only the last 10 values, so the append won't work.
The main script returns one line; so I save it to a file. I use tail to get the last 10 lines and process them, but then I get to the point where the file is too big, due the fact that I continue to append lines to it (the script output a line every minute or so, which bring up the size of the log quite fast.
I would like to limit the number of writes that I do on that script, so I can always have only the last 10 lines, discarding the rest.
I have thought about different approaches, but they all involve a lot of activity, like create temp files, delete the original file and create a new file, with just the tail of the last 10 entry; but it feels so un-elegant and very amateurish.
Is there a quick and clean way to query a file, so I can add lines until I hit 10 lines, and then start to delete the lines in chronological order, and add the new ones on the bottom?
Maybe things are easier than what I think, and there is a simple solution that I cannot see.
Thanks!
In general, it is difficult to remove data from the start of a file. The only way to do it is to overwrite the file with the tail that you wish to keep. It isn't that ugly to write, though. One fairly reasonable hack is to do:
{ rm file; tail -9 > file; echo line 10 >> file; } < file
This will retain the last 9 lines and add a 10th line. There is a lot of redundancy, so you might like to do something like:
append() { test -f $1 && { rm $1; tail -9 > $1; } < $1; cat >> $1; }
And then invoke it as:
echo 'the new 10th line' | append file
Please note that this hack of using redirecting input to the same file as the later output is a bit fragile and obscure. It is entirely possible for the script to be interrupted and delete the file! It would be safer and more maintainable to explicitly use a temporary file.

replace $1 variable in file with 1-10000

I want to create 1000s of this one file.
All I need to replace in the file is one var
kitename = $1
But i want to do that 1000s of times to create 1000s of diff files.
I'm sure it involves a loop.
people answering people is more effective than google search!
thx
I'm not really sure what you are asking here, but the following will create 1000 files named filename.n containing 1 line each which is "kite name = n" for n = 1 to n = 1000
for i in {1..1000}
do
echo "kitename = $i" > filename.$i
done
If you have mysql installed, it comes with a lovely command line util called "replace" which replaces files in place across any number of files. Too few people know about this, given it exists on most linux boxen everywhere. Syntax is easy:
replace SEARCH_STRING REPLACEMENT -- targetfiles*
If you MUST use sed for this... that's okay too :) The syntax is similar:
sed -i.bak s/SEARCH_STRING/REPLACEMENT/g targetfile.txt
So if you're just using numbers, you'd use something like:
for a in {1..1000}
do
cp inputFile.html outputFile-$a.html
replace kitename $a -- outputFile-$a.html
done
This will produce a bunch of files "outputFile-1.html" through "outputFile-1000.html", with the word "kitename" replaced by the relevant number, inside the file.
But, if you want to read your lines from a file rather than generate them by magic, you might want something more like this (we're not using "for a in cat file" since that splits on words, and I'm assuming here you'd have maybe multi-word replacement strings that you'd want to put in:
cat kitenames.txt | while read -r a
do
cp inputFile.html "outputFile-$a.html"
replace kitename "$a" -- kitename-$a
done
This will produce a bunch of files like "outputFile-red kite.html" and "outputFile-kite with no string.html", which have the word "kitename" replaced by the relevant name, inside the file.

Why can't my Perl script find the file when I run it from Windows?

I have a Perl Script which was built on a Linux platform using Perl 5.8 . However now I am trying to run the Perl Script on a Windows platform command prompt with the same Perl version.
I am using this command perl rgex.pl however it gives me one whole chunk of errors which looks to me like it has already been resolved in the script itself. The weird thing is I am able to run another Perl script without problem consisting of simple functions such as print, input etc.
The Code:
#!/usr/bin/perl
use warnings;
use strict;
use Term::ANSIColor;
my $file = "C:\Documents and Settings\Desktop\logfiles.log";
open LOG, $file or die "The file $file has the error of:\n => $!";
my #lines = <LOG>;
close (LOG);
my $varchar = 0;
foreach my $line ( #lines ) {
if ( $line =~ m/PLLog/ )
{
print("\n\n\n");
my $coloredText = colored($varchar, 'bold underline red');
print colored ("POS :: $coloredText\n\n", 'bold underline red');
$varchar ++;
}
print( $line );
}
When I run on the windows command prompt it gives me errors such as:
Unrecognized escape \D passed through at rgex.pl line 7.
=> No such file or directory at rgex.pl line 8.
Please give some advice on the codes please. Thanks.
A \ in a Perl string enclosed in double quotes marks the beginning of an escape sequence like \n for newline, \t for tab. Since you want \ to be treated literally you need to escape \ like \\ as:
my $file = "C:\\Documents and Settings\\Desktop\\logfiles.log";
Since you are not interpolating any variables in the string it's better to use single quotes:
my $file = 'C:\Documents and Settings\Desktop\logfiles.log';
(Inside single quotes, \ is not special unless the next character is a backslash or single quote.)
These error messages are pretty clear. They tell you exactly which lines the problems are on (unlike some error messages which tell you the line where Perl first though "Hey, wait a minute!").
When you run into these sorts of problems, reduce the program to just the problematic lines and start working on them. Start with the first errors first, since they often cascade to the errors that you see later.
When you want to check the value that you get, print it to ensure it is what you think it is:
my $file = "C:\\D....";
print "file is [$file]\n";
This would have shown you very quickly that there was a problem with $file, and once you know where the problem is, you're most of the way to solving it.
This is just basic debugging technique.
Also, you're missing quite a bit of the basics, so going through a good Perl tutorial will help you immensely. There are several listed in perlfaq2 or perlbook. Many of the problems that you're having are things that Learning Perl deals with in the first couple of chapters.

Resources