Doubts about bash script efficiency - bash

I have to accomplish a relatively simple task, basically i have an enormous amount of files with the following format
"2014-01-27","07:20:38","data","data","data"
Basically i would like to extract the first 2 fields, convert them into a unix epoch date, add 6 hours to it (due timezone difference), and replace the first 2 original columns with the resulting milliseconds (unix epoch, since 19700101 converted to mills)
I have written a script that works fine, well, the issue is that is very very slow, i need to run this over 150 files with a total line count of more then 5.000.000 and i was wondering if you had any advice about how could i make it faster, here it is:
#!/bin/bash
function format()
{
while read line; do
entire_date=$(echo ${line} | cut -d"," -f1-2);
trimmed_date=$(echo ${entire_date} | sed 's/"//g;s/,/ /g');
seconds=$(date -d "${trimmed_date} + 6 hours" +%s);
millis=$((${seconds} * 1000));
echo ${line} | sed "s/$entire_date/\"$millis\"/g" >> "output"
done < $*
}
format $*

You are spawning a significant number of processes for each input line. Probably half of those could easily be factored away, by quick glance, but I would definitely recommend a switch to Perl or Python instead.
perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e'
I'd like to recommend Text::CSV but I do not have it installed here, and if you have requirements to not touch the fields after the second at all, it might not be what you need anyway. This is quick and dirty but probably also much simpler than a "proper" CSV solution.
The real meat is the str2time function from Date::Parse, which I imagine will be a lot quicker than repeatedly calling date (ISTR it does some memoization internally so it can do nearby dates quickly). The regex replaces the first two fields with the output; note the /e flag which allows Perl code to be evaluated in the replacement part. The (?<=^") and (?=") zero-width assertions require these matches to be present but does not include them in the substitution operation. (I originally substituted the enclosing double quotes, but with this change, they are retained, as apparently you want to keep them.)
Change the die to a warn if you want the script to continue in spite of errors (maybe redirect standard error to a file then!)

I have tried to avoid external commands (except date) to gain time. Tests show that it is 4 times faster than your code. (Okay, the tripleee's perl solution is 40 times faster than mine !)
#! /bin/bash
function format()
{
while IFS=, read date0 date1 datas; do
date0="${date0//\"/}"
date1="${date1//\"/}"
seconds=$(date -d "$date0 $date1 + 6 hours" +%s)
echo "\"${seconds}000\",$datas"
done
}
output="output.txt"
# Process each file in argument
for file ; do
format < "$file"
done >| "$output"
exit 0

Using the exist function mktime in awk, tested, it is faster than perl.
awk '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file
Here is the test result.
$ wc -l file
1244 file
$ time awk '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file > /dev/null
real 0m0.172s
user 0m0.140s
sys 0m0.046s
$ time perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e' file > /dev/null
real 0m0.328s
user 0m0.218s
sys 0m0.124s

Related

Can I use sed to generate UUIDs inline instead of echo?

This bash snippet works to add a UUID and a tab character (\t) at the start of each line of a file.
while read; do
echo "$(uuidgen | tr 'A-Z' 'a-z')\t$REPLY"
done < sourcefile.tsv > temp_destination.tsv
(Note the reason for the pipe to TR is to convert them to lowercase in MacOS version of UUID-generation).
Although that performs well for smaller files, it doesn't seem efficient.
sed -i '' "s/^/$(uuidgen | tr 'A-Z' 'a-z')\t/" sourcefile.tsv
Again, using MacOS bash so the '' after the -i flag is required since I don't want a backup file.
I think sed would perform better, but I think I have to have the UUID generation in some sort of loop.
I'm just looking to make this faster and/or perform more efficiently. It's working, but it's pretty slow on a 20,000-line file, and all other attempts by me have stumped me.
EDIT: I tested my bash script just outputting the UUIDs to a while loop without any of the other subprocesses. With my configuration, I can generate about 250-300 per second, so updating a 20,000-line file will take a minimum of 72 seconds just because of the weak link of UUID generation. As described below, using Perl or Python will likely be faster.
EDIT 2: This little python script kill the bash script. This snippet only does part of what I need, but just for comparison, it generated about 200,000 UUIDs in a second, or 1,000,000 in 5 seconds compared to the 250-300 in the bash subprocess. Wow, what a difference.
#!/usr/bin/env python3
#this generates 1,000,000 UUIDs in about 5 seconds
import uuid
import sys
sys.stdout = open('lots-of-uuid.txt', 'w')
i = 1
while i < 1000000:
print(uuid.uuid4())
i +=1
sys.stdout.close()
Did you try something like that:
{
uuidgen | tr 'A-Z' 'a-z'
echo -n "\t"
cat 'sourcefile.tsv'
} > temp_destination.tsv
You may think it is not much different from your "read" version, but it is:
You don't capture the result of uuidgen
cat will probably perform faster than read + $REPLY
Try this out:
while read; do printf "%s\t%s\n" $(uuidgen) "$REPLY"; done < input.tsv > output.tsv
No monkeying around with building strings.
Using sed
$ sed -i '' 's/.*/printf &#;\Luuidgen/e;s/\([^#]*\)#\(.*\)/\2\t\1/' sourcefile.tsv
This might work for you (GNU sed):
sed -i 'h;s/.*/uuidgen/e;s/.*/\L&/;G;s/\n/\t/' file
Make a copy of the current line.
Replace the current line by an evaluated uuidgen command and convert the result to lowercase.
Append the copy and replace the newline by a tab.

More efficient way to loop through lines in shell

I've come to learn that looping through lines in bash by
while read line; do stuff; done <file
Is not the most efficient way to do it. https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice
What is a more time/resource efficient method?
Here's a time'd example using Bash and awk. I have 1 million records in a file:
$ wc -l 1M
1000000 1M
Counting it's records with bash, using while read:
$ time while read -r line ; do ((i++)) ; done < 1M ; echo $i
real 0m12.440s
user 0m11.548s
sys 0m0.884s
1000000
Using let "i++" took 15.627 secs (real) and NOPing with do : ; 10.466. Using awk:
$ time awk '{i++}END{print i}' 1M
1000000
real 0m0.128s
user 0m0.128s
sys 0m0.000s
As others have said, it depends on what you're doing.
The reason it's inefficient is that everything runs in its own process. Depending on what you are doing, that may or may not be a big deal.
If what you want to do in the loop is run another shell process, you won't get any gain from eliminating the loop. If you can do what you need without the need for a loop, you could get a gain.
awk? Perl? C(++)? Of course it depends on if you're interested in CPU time or programmer time, and the latter depends on what the programmer is used to using.
The top answer to the question you linked to pretty much explains that the biggest problem is spawning external processes for simple text processing tasks. E.g. running an instance of awk or a pipeline of sed and cut for each single line just to get a part of the string is silly.
If you want to stay in shell, use the string processing parameter expansions (${var#word}, ${var:n:m}, ${var/search/replace} etc.) and other shell features as much as you can. If you see yourself running a set of commands for each input line, it's time to think the structure of the script again. Most of the text processing commands can process a whole file with one execution, so use that.
A trivial/silly example:
while read -r line; do
x=$(echo "$line" | awk '{print $2}')
somecmd "$x"
done < file
would be better as
awk < file '{print $2}' | while read -r x ; do somecmd "$x" ; done
Choose between awk or perl both are efficient

What can I do to speed up this bash script?

The code I have goes through a file and multiplies all the numbers in the first column by a number. The code works, but I think its somewhat slow. It takes 26.676s (walltime) to go through a file with 2302 lines in it. I'm using a 2.7 GHz Intel Core i5 processor. Here is the code.
#!/bin/bash
i=2
sed -n 1p data.txt > data_diff.txt #outputs the header (x y)
while [ $i -lt 2303 ]; do
NUM=`sed -n "$i"p data.txt | awk '{print $1}'`
SEC=`sed -n "$i"p data.txt | awk '{print $2}'`
NNUM=$(bc <<< "$NUM*0.000123981")
echo $NNUM $SEC >> data_diff.txt
let i=$i+1
done
Honestly, the biggest speedup you can get will come from using a single language that can do the whole task itself. This is mostly because your script invokes 5 extra processes for each line, and invoking extra processes is slow, but also text processing in bash is really not that well optimized.
I'd recommend awk, given that you have it available:
awk '{ print $1*0.000123981, $2 }'
I'm sure you can improve this to skip the header line and print it without modification.
You can also do this sort of thing with Perl, Python, C, Fortran, and many other languages, though it's unlikely to make much difference for such a simple calculation.
Your script runs 4603 separate sed processes, 4602 separate awk processes, and 2301 separate bc processes. If echo were not a built-in then it would also run 2301 echo processes. Starting a process has relatively large overhead. Not so large that you would ordinarily notice it, but you are running over 11000 short processes. The wall time consumption doesn't seem unreasonable for that.
MOREOVER, each sed that you run processes the whole input file anew, selecting from it just one line. This is horribly inefficient.
The solution is to reduce the number of processes you are running, and especially to perform only a single run through the whole input file. A fairly easy way to do that would be to convert to an awk script, possibly with a bash wrapper. That might look something like this:
#!/bin/bash
awk '
NR==1 { print; next }
NR>=2303 { exit }
{ print $1 * 0.000123981, $2 }
' data.txt > data_diff.txt
Note that the line beginning with NR>=2303 artificially stops processing the input file when it reaches the 2303rd line, as your original script does; you could omit that line of the script altogether to let it simply process all the lines, however many there are.
Note, too, that that uses awk's built-in FP arithmetic instead of running bc. If you actually need the arbitrary-precision arithmetic of bc then I'm sure you can figure out how to modify the script to get that.
As an example of how to speed up the bash script (without implying that this is the right solution)
#!/bin/bash
{ IFS= read -r header
echo "$header"
# You can drop the third name "rest" if your input file
# only has two columns.
while read -r num sec rest; do
nnum=$( bc <<< "$num * 0.000123981" )
echo "$nnum $sec"
done
} < data.txt > data_diff.txt
Now you only have one extra call to bc per data line, necessitated by the fact that bash doesn't do floating-point arithmetic. The right answer is to use a single call to program that can do floating-point arithmetic, as pointed out by David Z.

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

What is your latest useful Perl one-liner (or a pipe involving Perl)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
The one-liner should:
solve a real-world problem
not be extensively cryptic (should be easy to understand and reproduce)
be worth the time it takes to write it (should not be too clever)
I'm looking for practical tips and tricks (complementary examples for perldoc perlrun).
Please see my slides for "A Field Guide To The Perl Command Line Options."
Squid log files. They're great, aren't they? Except by default they have seconds-from-the-epoch as the time field. Here's a one-liner that reads from a squid log file and converts the time into a human readable date:
perl -pe's/([\d.]+)/localtime $1/e;' access.log
With a small tweak, you can make it only display lines with a keyword you're interested in. The following watches for stackoverflow.com accesses and prints only those lines, with a human readable date. To make it more useful, I'm giving it the output of tail -f, so I can see accesses in real time:
tail -f access.log | perl -ne's/([\d.]+)/localtime $1/e,print if /stackoverflow\.com/'
The problem: A media player does not automatically load subtitles due to their names differ from corresponding video files.
Solution: Rename all *.srt (files with subtitles) to match the *.avi (files with video).
perl -e'while(<*.avi>) { s/avi$/srt/; rename <*.srt>, $_ }'
CAVEAT: Sorting order of original video and subtitle filenames should be the same.
Here, a more verbose version of the above one-liner:
my #avi = glob('*.avi');
my #srt = glob('*.srt');
for my $i (0..$#avi)
{
my $video_filename = $avi[$i];
$video_filename =~ s/avi$/srt/; # 'movie1.avi' -> 'movie1.srt'
my $subtitle_filename = $srt[$i]; # 'film1.srt'
rename($subtitle_filename, $video_filename); # 'film1.srt' -> 'movie1.srt'
}
The common idiom of using find ... -exec rm {} \; to delete a set of files somewhere in a directory tree is not particularly efficient in that it executes the rm command once for each file found. One of my habits, born from the days when computers weren't quite as fast (dagnabbit!), is to replace many calls to rm with one call to perl:
find . -name '*.whatever' | perl -lne unlink
The perl part of the command line reads the list of files emitted* by find, one per line, trims the newline off, and deletes the file using perl's built-in unlink() function, which takes $_ as its argument if no explicit argument is supplied. ($_ is set to each line of input thanks to the -n flag.) (*These days, most find commands do -print by default, so I can leave that part out.)
I like this idiom not only because of the efficiency (possibly less important these days) but also because it has fewer chorded/awkward keys than typing the traditional -exec rm {} \; sequence. It also avoids quoting issues caused by file names with spaces, quotes, etc., of which I have many. (A more robust version might use find's -print0 option and then ask perl to read null-delimited records instead of lines, but I'm usually pretty confident that my file names do not contain embedded newlines.)
You may not think of this as Perl, but I use ack religiously (it's a smart grep replacement written in Perl) and that lets me edit, for example, all of my Perl tests which access a particular part of our API:
vim $(ack --perl -l 'api/v1/episode' t)
As a side note, if you use vim, you can run all of the tests in your editor's buffers.
For something with more obvious (if simple) Perl, I needed to know how many test programs used out test fixtures in the t/lib/TestPM directory (I've cut down the command for clarity).
ack $(ls t/lib/TestPM/|awk -F'.' '{print $1}'|xargs perl -e 'print join "|" => #ARGV') aggtests/ t -l
Note how the "join" turns the results into a regex to feed to ack.
All one-liners from the answers collected in one place:
perl -pe's/([\d.]+)/localtime $1/e;' access.log
ack $(ls t/lib/TestPM/|awk -F'.' '{print $1}'|xargs perl -e 'print join "|" => #ARGV')
aggtests/ t -l
perl -e'while(<*.avi>) { s/avi$/srt/; rename <*.srt>, $_ }'
find . -name '*.whatever' | perl -lne unlink
tail -F /var/log/squid/access.log | perl -ane 'BEGIN{$|++} $F[6] =~ m{\Qrad.live.com/ADSAdClient31.dll}
&& printf "%02d:%02d:%02d %15s %9d\n", sub{reverse #_[0..2]}->(localtime $F[0]), #F[2,4]'
export PATH=$(perl -F: -ane'print join q/:/, grep { !$c{$_}++ } #F'<<<$PATH)
alias e2d="perl -le \"print scalar(localtime($ARGV[0]));\""
perl -ple '$_=eval'
perl -00 -ne 'print sort split /^/'
perl -pe'1while+s/\t/" "x(8-pos()%8)/e'
tail -f log | perl -ne '$s=time() unless $s; $n=time(); $d=$n-$s; if ($d>=2) { print qq
($. lines in last $d secs, rate ),$./$d,qq(\n); $. =0; $s=$n; }'
perl -MFile::Spec -e 'print join(qq(\n),File::Spec->path).qq(\n)'
See corresponding answers for their descriptions.
The Perl one-liner I use the most is the Perl calculator
perl -ple '$_=eval'
One of the biggest bandwidth hogs at $work is download web advertising, so I'm looking at the low-hanging fruit waiting to be picked. I've got rid of Google ads, now I have Microsoft in my line of sights. So I run a tail on the log file, and pick out the lines of interest:
tail -F /var/log/squid/access.log | \
perl -ane 'BEGIN{$|++} $F[6] =~ m{\Qrad.live.com/ADSAdClient31.dll}
&& printf "%02d:%02d:%02d %15s %9d\n",
sub{reverse #_[0..2]}->(localtime $F[0]), #F[2,4]'
What the Perl pipe does is to begin by setting autoflush to true, so that any that is acted upon is printed out immediately. Otherwise the output it chunked up and one receives a batch of lines when the output buffer fills. The -a switch splits each input line on white space, and saves the results in the array #F (functionality inspired by awk's capacity to split input records into its $1, $2, $3... variables).
It checks whether the 7th field in the line contains the URI we seek (using \Q to save us the pain of escaping uninteresting metacharacters). If a match is found, it pretty-prints the time, the source IP and the number of bytes returned from the remote site.
The time is obtained by taking the epoch time in the first field and using 'localtime' to break it down into its components (hour, minute, second, day, month, year). It takes a slice of the first three elements returns, second, minute and hour, and reverses the order to get hour, minute and second. This is returned as a three element array, along with a slice of the third (IP address) and fifth (size) from the original #F array. These five arguments are passed to sprintf which formats the results.
#dr_pepper
Remove literal duplicates in $PATH:
$ export PATH=$(perl -F: -ane'print join q/:/, grep { !$c{$_}++ } #F'<<<$PATH)
Print unique clean paths from %PATH% environment variable (it doesn't touch ../ and alike, replace File::Spec->rel2abs by Cwd::realpath if it is desirable) It is not a one-liner to be more portable:
#!/usr/bin/perl -w
use File::Spec;
$, = "\n";
print grep { !$count{$_}++ }
map { File::Spec->rel2abs($_) }
File::Spec->path;
I use this quite frequently to quickly convert epoch times to a useful datestamp.
perl -l -e 'print scalar(localtime($ARGV[0]))'
Make an alias in your shell:
alias e2d="perl -le \"print scalar(localtime($ARGV[0]));\""
Then pipe an epoch number to the alias.
echo 1219174516 | e2d
Many programs and utilities on Unix/Linux use epoch values to represent time, so this has proved invaluable for me.
Remove duplicates in path variable:
set path=(`echo $path | perl -e 'foreach(split(/ /,<>)){print $_," " unless $s{$_}++;}'`)
Remove MS-DOS line-endings.
perl -p -i -e 's/\r\n$/\n/' htdocs/*.asp
Extracting Stack Overflow reputation without having to open a web page:
perl -nle "print ' Stack Overflow ' . $1 . ' (no change)' if /\s{20,99}([0-9,]{3,6})<\/div>/;" "SO.html" >> SOscores.txt
This assumes the user page has already been downloaded to file SO.html. I use wget for this purpose. The notation here is for Windows command line; it would be slightly different for Linux or Mac OS X. The output is appended to a text file.
I use it in a BAT script to automate sampling of reputation on the four sites in the family:
Stack Overflow, Server Fault, Super User and Meta Stack Overflow.
In response to Ovid's Vim/ack combination:
I too am often searching for something and then want to open the matching files in Vim, so I made myself a little shortcut some time ago (works in Z shell only, I think):
function vimify-eval; {
if [[ ! -z "$BUFFER" ]]; then
if [[ $BUFFER = 'ack'* ]]; then
BUFFER="$BUFFER -l"
fi
BUFFER="vim \$($BUFFER)"
zle accept-line
fi
}
zle -N vim-eval-widget vimify-eval
bindkey '^P' vim-eval-widget
It works like this: I search for something using ack, like ack some-pattern. I look at the results and if I like it, I press arrow-up to get the ack-line again and then press Ctrl + P. What happens then is that Z shell appends and "-l" for listing filenames only if the command starts with "ack". Then it puts "$(...)" around the command and "vim" in front of it. Then the whole thing is executed.
I often need to see a readable version of the PATH while shell scripting. The following one-liners print every path entry on its own line.
Over time this one-liner has evolved through several phases:
Unix (version 1):
perl -e 'print join("\n",split(":",$ENV{"PATH"}))."\n"'
Windows (version 2):
perl -e "print join(qq(\n),split(';',$ENV{'PATH'})).qq(\n)"
Both Unix/Windows (using q/qq tip from #j-f-sebastian) (version 3):
perl -MFile::Spec -e 'print join(qq(\n), File::Spec->path).qq(\n)' # Unix
perl -MFile::Spec -e "print join(qq(\n), File::Spec->path).qq(\n)" # Windows
One of the most recent one-liners that got a place in my ~/bin:
perl -ne '$s=time() unless $s; $n=time(); $d=$n-$s; if ($d>=2) { print "$. lines in last $d secs, rate ",$./$d,"\n"; $. =0; $s=$n; }'
You would use it against a tail of a log file and it will print the rate of lines being outputed.
Want to know how many hits per second you are getting on your webservers? tail -f log | this_script.
Get human-readable output from du, sorted by size:
perl -e '%h=map{/.\s/;7x(ord$&&10)+$`,$_}`du -h`;print#h{sort%h}'
Filters a stream of white-space separated stanzas (name/value pair lists),
sorting each stanza individually:
perl -00 -ne 'print sort split /^/'
Network administrators have the tendency to misconfigure "subnet address" as "host address" especially while using Cisco ASDM auto-suggest. This straightforward one-liner scans the configuration files for any such configuration errors.
incorrect usage: permit host 10.1.1.0
correct usage: permit 10.1.1.0 255.255.255.0
perl -ne "print if /host ([\w\-\.]+){3}\.0 /" *.conf
This was tested and used on Windows, please suggest if it should be modified in any way for correct usage.
Expand all tabs to spaces: perl -pe'1while+s/\t/" "x(8-pos()%8)/e'
Of course, this could be done with :set et, :ret in Vim.
I have a list of tags with which I identify portions of text. The master list is of the format:
text description {tag_label}
It's important that the {tag_label} are not duplicated. So there's this nice simple script:
perl -ne '($c) = $_ =~ /({.*?})/; print $c,"\n" ' $1 | sort | uniq -c | sort -d
I know that I could do the whole lot in shell or perl, but this was the first thing that came to mind.
Often I have had to convert tabular data in to configuration files. For e.g, Network cabling vendors provide the patching record in Excel format and we have to use that information to create configuration files. i.e,
Interface, Connect to, Vlan
Gi1/0/1, Desktop, 1286
Gi1/0/2, IP Phone, 1317
should become:
interface Gi1/0/1
description Desktop
switchport access vlan 1286
and so on. The same task re-appears in several forms in various administration tasks where a tabular data needs to be prepended with their field name and transposed to a flat structure. I have seen some DBA's waste a lot of times preparing their SQL statements from excel sheet. It can be achieved using this simple one-liner. Just save the tabular data in CSV format using your favourite spreadsheet tool and run this one-liner. The field names in header row gets prepended to individual cell values, so you may have to edit it to match your requirements.
perl -F, -lane "if ($.==1) {#keys = #F} else{print #keys[$_].$F[$_] foreach(0..$#F)} "
The caveat is that none of the field names or values should contain any commas. Perhaps this can be further elaborated to catch such exceptions in a one-line, please improve this if possible.
Here is one that I find handy when dealing with a collection compressed log files:
open STATFILE, "zcat $logFile|" or die "Can't open zcat of $logFile" ;
At some time I found that anything I would want to do with Perl that is short enough to be done on the command line with 'perl -e' can be done better, easier and faster with normal Z shell features without the hassle of quoting. E.g. the example above could be done like this:
srt=(*.srt); for foo in *.avi; mv $srt[1] ${foo:r}.srt && srt=($srt[2,-1])

Resources