I have a log that should have the latest N entries. There's no problem if the file is a bit bigger a few times.
My first attempt is periodically running:
tail -n 20 file.log > file.log
Unfortunately, that just empties the file. I could:
tail -n 20 file.log > .file.log; mv .file.log file.log
However, that seems messy. Is there a better way?
It sounds like you are looking for logrotate.
I agree, logrotate is probably what you need. If you still want a command line solution, this will get the job done. Ex is a line editor. Nobody uses line editors anymore except for use in shell scripts. Syntax is for Sh/Ksh/Bash shells. I think it's the same in C shell.
ex log.001 << HERE
$
-20
1,-1d
w
q
HERE
logrotate, with size=xxx where xxx is the approximate size for 20 lines, and possibly delaycompress to keep the previous one also human readble.
Related
I'm at a bit of a loss on how to proceed with the task of piping log file contents to a csv file based on certain criteria.
Essentially, the problem is something like this:
Write a script that receives http logs (or any arbitrary .log file) via pipe
input, and outputs a summarized csv in the number of hits per url per day.
Example: executing the pipe command
cat access.log|some filter commands|./your_script > summary.csv
creates a text file called summary.csv with the content:
" Action and path. 2015-01-01, 2015-01-02. 2015-01-03
GET /index.php, 34, 53, 65
POST /administrator, 32, 59, 39
..."
and so forth.
The problem I'm facing at the moment is figuring out how to identify and execute specific parts of the pipe input command, and apply filters, before feeding it to the output pipe.
From what I'm familiar with, an array of command parameters (such as "cat", "gedit", ">", "|", etc) might work, but this leaves the problem of identifying them and executing them as a pipe command would, instead of just one after the other.
I've searched quite thoroughly, but as yet found nothing even remotely helpful, aside from the suggestion to divide the pipe command into separate instructions and execute them one by one.
If anyone can suggest an easier and more effective way to do this, or any advice on this particular problem, it'd be much appreciated. Thanks in advance.
Perhaps you need tee command. You can use it to "fork the pipe", which means that an output
file could be created after a specific pipe input command. It is very useful when looking for errors.
For example:
cat access.log | some filter commands | tee out01.txt \
| some other filter | tee ou02.txt | ./your_script > summary.csv
More examples here.
this is my very first post on Stackoverflow, and I should probably point out that I am EXTREMELY new to a lot of programming. I'm currently a postgraduate student doing projects involving a lot of coding in various programs, everything from LaTeX to bash, MATLAB etc etc.
If you could explicitly explain your answers that would be much appreciated as I'm trying to learn as I go. I apologise if there is an answer else where that does what I'm trying to do, but I have spent a couple of days looking now.
So to the problem I'm trying to solve: I'm currently using a selection of bioinformatics tools to analyse a range of genomes, and I'm trying to somewhat automate the process.
I have a few sequences with names that look like this for instance (all contained in folders of their own currently as paired files):
SOL2511_S5_L001_R1_001.fastq
SOL2511_S5_L001_R2_001.fastq
SOL2510_S4_L001_R1_001.fastq
SOL2510_S4_L001_R2_001.fastq
...and so on...
I basically wish to automate the process by turning these in to variables and passing these variables to each of the programs I use in turn. So for example my idea thus far was to assign them as wildcards, using the R1 and R2 (which appears in all the file names, as they represent each strand of DNA) as follows:
#!/bin/bash
seq1=*R1_001*
seq2=*R2_001*
On a rudimentary level this works, as it returns the correct files, so now I pass these variables to my first function which trims the DNA sequences down by a specified amount, like so:
# seqtk is the program suite, trimfq is a function within it,
# and the options -b -e specify how many bases to trim from the beginning and end of
# the DNA sequence respectively.
seqtk trimfq -b 10 -e 20 $seq1 >
seqtk trimfq -b 10 -e 20 $seq2 >
So now my problem is I wish to be able to append something like "_trim" to the output file which appears after the >, but I can't find anything that seems like it will work online.
Alternatively, I've been hunting for a script that will take the name of the folder that the files are in, and create a variable for the folder name which I can then give to the functions in question so that all the output files are named correctly for use later on.
Many thanks in advance for any help, and I apologise that this isn't really much of a minimum working example to go on, as I'm only just getting going on all this stuff!
Joe
EDIT
So I modified #ghoti 's for loop (does the job wonderfully I might add, rep for you :D ) and now I append trim_, as the loop as it was before ended up giving me a .fastq.trim which will cause errors later.
Is there any way I can append _trim to the end of the filename, but before the extension?
Explicit is usually better than implied, when matching filenames. Your wildcards may match more than you expect, especially if you have versions of the files with "_trim" appended to the end!
I would be more precise with the wildcards, and use for loops to process the files instead of relying on seqtk to handle multiple files. That way, you can do your own processing on the filenames.
Here's an example:
#!/bin/bash
# Define an array of sequences
sequences=(R1_001 R2_001)
# Step through the array...
for seq in ${sequences[#]}; do
# Step through the files in this sequence...
for file in SOL*_${seq}.fastq; do
seqtk trimfq -b 10 -e 20 "$file" > "${file}.trim"
done
done
I don't know how your folders are set up, so I haven't addressed that in this script. But the basic idea is that if you want the script to be able to manipulate individual filenames, you need something like a for loop to handle the that manipulation on a per-filename basis.
Does this help?
UPDATE:
To put _trim before the extension, replace the seqtk line with the following:
seqtk trimfq -b 10 -e 20 "$file" > "${file%.fastq}_trim.fastq"
This uses something documented in the Bash man page under Parameter Expansion if you want to read up on it. Basically, the ${file%.fastq} takes the $file variable and strips off a suffix. Then we add your extra text, along with the suffix.
You could also strip an extension using basename(1), but there's no need to call something external when you can use something built in to the shell.
Instead of setting variables with the filenames, you could pipe the output of ls to the command you want to run with these filenames, like this:
ls *R{1,2}_001* | xargs -I# sh -c 'seqtk trimfq -b 10 -e 20 "$1" > "${1}_trim"' -- #
xargs -I# will grab the output of the previous command and store it in # to be used by seqtk
I found myself quite stomped. I am trying to output data from a script to a file.
Altho I need to keep only the last 10 values, so the append won't work.
The main script returns one line; so I save it to a file. I use tail to get the last 10 lines and process them, but then I get to the point where the file is too big, due the fact that I continue to append lines to it (the script output a line every minute or so, which bring up the size of the log quite fast.
I would like to limit the number of writes that I do on that script, so I can always have only the last 10 lines, discarding the rest.
I have thought about different approaches, but they all involve a lot of activity, like create temp files, delete the original file and create a new file, with just the tail of the last 10 entry; but it feels so un-elegant and very amateurish.
Is there a quick and clean way to query a file, so I can add lines until I hit 10 lines, and then start to delete the lines in chronological order, and add the new ones on the bottom?
Maybe things are easier than what I think, and there is a simple solution that I cannot see.
Thanks!
In general, it is difficult to remove data from the start of a file. The only way to do it is to overwrite the file with the tail that you wish to keep. It isn't that ugly to write, though. One fairly reasonable hack is to do:
{ rm file; tail -9 > file; echo line 10 >> file; } < file
This will retain the last 9 lines and add a 10th line. There is a lot of redundancy, so you might like to do something like:
append() { test -f $1 && { rm $1; tail -9 > $1; } < $1; cat >> $1; }
And then invoke it as:
echo 'the new 10th line' | append file
Please note that this hack of using redirecting input to the same file as the later output is a bit fragile and obscure. It is entirely possible for the script to be interrupted and delete the file! It would be safer and more maintainable to explicitly use a temporary file.
I have pretty much no experience with cygwin & UNIX but need to use it for extracting a large set of data from a even larger set of files...
I had some help yesterday to do this short script, but (after running for ~7-8 hours) the script simply wrote to the same output file 22 times. Atleast that's what I think happened.
I've now changed the code to this (see below) but it would be really awesome if someone who knows how this is done properly could tell me if it's likely to work before I waste another 8 hours...
for chr in {1..22}
do
zcat /cygdrive/g/data/really_long_filename$chr | sed '/^#/d' | cut -f1-3 >> db_to_rs_$chr
done
I want it to read file 1..22, remove rows starting with #, and send columns 1 to 3 to a file ending with the same number 1..22
yesterday the last part was just ...-f1-3 >> db_to_rs which I suspect just rewrote that file 22 times?
Help is much appreciated
~L
Yes, the code would work as expected.
When the command ended in ...-f1-3 >> db_to_rs, it essentially appended all the output to the file db_to_rs.
Saying ... >> db_to_rs_$chr would create filenames ending in {1 .. 22}.
However, note that saying >> would append the output to a file. So if db_to_rs1 already exists, the output would be appended. If you want to create a new file instead, say > instead of >>.
I have a very long file with numbers. Something like output of this perl program:
perl -le 'print int(rand() * 1000000) for 1..10'
but way longer - around hundreds of gigabytes.
I need to split this file into many others. For test purposes, let's assume that 100 files, and output file number is taken by taking module of number with 100.
With normal files, I can do it simply with:
perl -le 'print int(rand() * 1000000) for 1..1000' | awk '{z=$1%100; print > z}'
But I have a problem when I need to compress splitted parts. Normally, I could:
... | awk '{z=$1%100; print | "gzip -c - > "z".txt.gz"}'
But, when ulimit is configured to allow less open files than number of "partitions", awk breaks with:
awk: (FILENAME=- FNR=30) fatal: can't open pipe `gzip -c - > 60.txt.gz' for output (Too many open files)
This doesn't break with normal file output, as GNU awk is apparently smart enough to recycle file handles.
Do you know any way (aside from writing my own stream-splitting-program, implementing buffering, and some sort of pool-of-filehandles management) to handle such case - that is: splitting to multiple files, where access to output files is random, and gzipping all output partitions on the fly?
I didn't write it in question itself, but since the additional information is together with solution, I'll write it all here.
So - the problem was on Solaris. Apparently there is a limitation, that no program using stdio on Solaris can have more than 256 open filehandles ?!
It is described in here in detail. The important point is that it's enough to set one env variable before running my problematic program, and the problem is gone:
export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1