sorting numerically by first row - bash [duplicate] - bash

I tried to sort these number with Unix sort, but it doesn't seem to work:
2e-13
1e-91
2e-13
1e-104
3e-19
9e-99
This is my command:
sort -nr file.txt
What's the right way to do it?

Use -g (long form --general-numeric-sort) instead of -n. The -g option passes the numbers through strtod to obtain their value, and it will recognize this format.
I'm not sure if this is available on all implementations of sort, or just the GNU one.

if your sort doesn't have -g, another way.
$ printf "%.200f\n" $(<file) |sort -n |xargs printf "%g\n"
1e-104
9e-99
1e-91
3e-19
2e-13
2e-13
$ sort -g file
1e-104
9e-99
1e-91
3e-19
2e-13
2e-13
$ printf "%.200f\n" `cat file` |sort -n |xargs printf "%g\n"

Just do two things:
Use -g (or --general-numeric-sort) to make sort treat Exp-numbers correctly.
Use LC_ALL=C. The sort is very sensible to locale if your data may contain some language-specific symbols except ASCII. So using LC_ALL=C is the universal approach for every case you use the sort, it makes sort to treat the input stream as binary, and you wouldn't have any problems.
So the universal solution is:
cat file.txt | LC_ALL=C sort -gr | less
Also I made an alias for sort in my .bashrc file:
alias csort="LC_ALL=C sort"
for much convinient use.

Ok, here is a not fully tested version of Python script. Supposed usage:
sort_script.py file.txt
Unfortunately I developed this on Windows, and with 2 different versions of Python installed I cannot test it properly. Warning: requires newest Python (with, print functions added or changed). Note: back_to_file flag can be an optional parameter, although with Unix you can always redirect ... even in Windows you can.
#!/usr/bin/env python3.1
# Note: requires newer python
import sys
#Remove this line:
sys.argv = ('', 'file.txt')
assert(len(sys.argv) == 2)
with open(sys.argv[1], 'r') as fin:
lines = fin.readlines()
lines_sorted = sorted(lines, key=lambda x: float(x))
back_to_file = False # Change this if needed
if back_to_file:
with open(sys.argv[1], 'w') as fout:
fout.writelines(lines_sorted)
else:
for lns in lines_sorted:
print(lns, end='') # Suppress new line

Related

Iterate through list of filenames in order they were created in bash

Parsing output of ls to iterate through list of files is bad. So how should I go about iterating through list of files in order by which they were first created? I browsed several questions here on SO and they all seem to parsing ls.
The embedded link suggests:
Things get more difficult if you wanted some specific sorting that
only ls can do, such as ordering by mtime. If you want the oldest or
newest file in a directory, don't use ls -t | head -1 -- read Bash FAQ
99 instead. If you truly need a list of all the files in a directory
in order by mtime so that you can process them in sequence, switch to
perl, and have your perl program do its own directory opening and
sorting. Then do the processing in the perl program, or -- worst case
scenario -- have the perl program spit out the filenames with NUL
delimiters.
Even better, put the modification time in the filename, in YYYYMMDD
format, so that glob order is also mtime order. Then you don't need ls
or perl or anything. (The vast majority of cases where people want the
oldest or newest file in a directory can be solved just by doing
this.)
Does that mean there is no native way of doing it in bash? I don't have the liberty to modify the filename to include the time in them. I need to schedule a script in cron that would run every 5 minutes, generate an array containing all the files in a particular directory ordered by their creation time and perform some actions on the filenames and move them to another location.
The following worked but only because I don't have funny filenames. The files are created by a server so it will never have special characters, spaces, newlines etc.
files=( $(ls -1tr) )
I can write a perl script that would do what I need but I would appreciate if someone can suggest the right way to do it in bash. Portable option would be great but solution using latest GNU utilities will not be a problem either.
sorthelper=();
for file in *; do
# We need something that can easily be sorted.
# Here, we use "<date><filename>".
# Note that this works with any special characters in filenames
sorthelper+=("$(stat -n -f "%Sm%N" -t "%Y%m%d%H%M%S" -- "$file")"); # Mac OS X only
# or
sorthelper+=("$(stat --printf "%Y %n" -- "$file")"); # Linux only
done;
sorted=();
while read -d $'\0' elem; do
# this strips away the first 14 characters (<date>)
sorted+=("${elem:14}");
done < <(printf '%s\0' "${sorthelper[#]}" | sort -z)
for file in "${sorted[#]}"; do
# do your stuff...
echo "$file";
done;
Other than sort and stat, all commands are actual native Bash commands (builtins)*. If you really want, you can implement your own sort using Bash builtins only, but I see no way of getting rid of stat.
The important parts are read -d $'\0', printf '%s\0' and sort -z. All these commands are used with their null-delimiter options, which means that any filename can be procesed safely. Also, the use of double-quotes in "$file" and "${anarray[*]}" is essential.
*Many people feel that the GNU tools are somehow part of Bash, but technically they're not. So, stat and sort are just as non-native as perl.
With all of the cautions and warnings against using ls to parse a directory notwithstanding, we have all found ourselves in this situation. If you do find yourself needing sorted directory input, then about the cleanest use of ls to feed your loop is ls -opts | read -r name; do... This will handle spaces in filenames, etc.. without requiring a reset of IFS due to the nature of read itself. Example:
ls -1rt | while read -r fname; do # where '1' is ONE not little 'L'
So do look for cleaner solutions avoiding ls, but if push comes to shove, ls -opts can be used sparingly without the sky falling or dragons plucking your eyes out.
let me add the disclaimer to keep everyone happy. If you like newlines inside your filenames -- then do not use ls to populate a loop. If you do not have newlines inside your filenames, there are no other adverse side-effects.
Contra: TLDP Bash Howto Intro:
#!/bin/bash
for i in $( ls ); do
echo item: $i
done
It appears that SO users do not know what the use of contra means -- please look it up before downvoting.
You can try using use stat command piped with sort:
stat -c '%Y %n' * | sort -t ' ' -nk1 | cut -d ' ' -f2-
Update: To deal with filename with newlines we can use %N format in stat andInstead of cut we can use awk like this:
LANG=C stat -c '%Y^A%N' *| sort -t '^A' -nk1| awk -F '^A' '{print substr($2,2,length($2)-2)}'
Use of LANG=C is needed to make sure stat uses single quotes only in quoting file names.
^A is conrtrol-A character typed using ControlVA keys together.
How about a solution with GNU find + sed + sort?
As long as there are no newlines in the file name, this should work:
find . -type f -printf '%T# %p\n' | sort -k 1nr | sed 's/^[^ ]* //'
It may be a little more work to ensure it is installed (it may already be, though), but using zsh instead of bash for this script makes a lot of sense. The filename globbing capabilities are much richer, while still using a sh-like language.
files=( *(oc) )
will create an array whose entries are all the file names in the current directory, but sorted by change time. (Use a capital O instead to reverse the sort order). This will include directories, but you can limit the match to regular files (similar to the -type f predicate to find):
files=( *(.oc) )
find is needed far less often in zsh scripts, because most of its uses are covered by the various glob flags and qualifiers available.
I've just found a way to do it with bash and ls (GNU).
Suppose you want to iterate through the filenames sorted by modification time (-t):
while read -r fname; do
fname=${fname:1:((${#fname}-2))} # remove the leading and trailing "
fname=${fname//\\\"/\"} # removed the \ before any embedded "
fname=$(echo -e "$fname") # interpret the escaped characters
file "$fname" # replace (YOU) `file` with anything
done < <(ls -At --quoting-style=c)
Explanation
Given some filenames with special characters, this is the ls output:
$ ls -A
filename with spaces .hidden_filename filename?with_a_tab filename?with_a_newline filename_"with_double_quotes"
$ ls -At --quoting-style=c
".hidden_filename" " filename with spaces " "filename_\"with_double_quotes\"" "filename\nwith_a_newline" "filename\twith_a_tab"
So you have to process a little each filename to get the actual one. Recalling:
${fname:1:((${#fname}-2))} # remove the leading and trailing "
# ".hidden_filename" -> .hidden_filename
${fname//\\\"/\"} # removed the \ before any embedded "
# filename_\"with_double_quotes\" -> filename_"with_double_quotes"
$(echo -e "$fname") # interpret the escaped characters
# filename\twith_a_tab -> filename with_a_tab
Example
$ ./script.sh
.hidden_filename: empty
filename with spaces : empty
filename_"with_double_quotes": empty
filename
with_a_newline: empty
filename with_a_tab: empty
As seen, file (or the command you want) interprets well each filename.
Each file has three timestamps:
Access time: the file was opened and read. Also known as atime.
Modification time: the file was written to. Also known as mtime.
Inode modification time: the file's status was changed, such as the file had a new hard link created, or an existing one removed; or if the file's permissions were chmod-ed, or a few other things. Also known as ctime.
Neither one represents the time the file was created, that information is not saved anywhere. At file creation time, all three timestamps are initialized, and then each one gets updated appropriately, when the file is read, or written to, or when a file's permissions are chmoded, or a hard link created or destroyed.
So, you can't really list the files according to their file creation time, because the file creation time isn't saved anywhere. The closest match would be the inode modification time.
See the descriptions of the -t, -u, -c, and -r options in the ls(1) man page for more information on how to list files in atime, mtime, or ctime order.
Here's a way using stat with an associative array.
n=0
declare -A arr
for file in *; do
# modified=$(stat -f "%m" "$file") # For use with BSD/OS X
modified=$(stat -c "%Y" "$file") # For use with GNU/Linux
# Ensure stat timestamp is unique
if [[ $modified == *"${!arr[#]}"* ]]; then
modified=${modified}.$n
((n++))
fi
arr[$modified]="$file"
done
files=()
for index in $(IFS=$'\n'; echo "${!arr[*]}" | sort -n); do
files+=("${arr[$index]}")
done
Since sort sorts lines, $(IFS=$'\n'; echo "${!arr[*]}" | sort -n) ensures the indices of the associative array get sorted by setting the field separator in the subshell to a newline.
The quoting at arr[$modified]="${file}" and files+=("${arr[$index]}") ensures that file names with caveats like a newline are preserved.

Sorting floats with exponents with 'sort -g' bash command

I have a file with floats with exponents and I want to sort them. AFAIK 'sort -g' is what I need. But it seems like it sorts floats throwing away all the exponents. So the output looks like this (which is not what I wanted):
$ cat file.txt | sort -g
8.387280091e-05
8.391373668e-05
8.461754562e-07
8.547354437e-05
8.831553093e-06
8.936111118e-05
8.959458896e-07
This brings me to two questions:
Why 'sort -g' doesn't work as I expect it to work?
How cat I sort my file with using bash commands?
The problem is that in some countries local settings can mess this up by using , as the decimal separator instead of . on a system level. Check by typing locale in terminal. There should be an entry
LC_NUMERIC=en_US.UTF-8
If the value is anything else, change it to the above by editing the locale file
sudo gedit /etc/default/locale
That's it. You can also temporarily use this value by doing
LC_ALL=C sort -g file.dat
LC_ALL=C is shorter to write in terminal, but putting it in the locale file might not be preferable as it could alter some other system-wide behavior such as maybe time format.
Here's a neat trick:
$ sort -te -k2,2n -k1,1n test.txt
8.461754562e-07
8.959458896e-07
8.831553093e-06
8.387280091e-05
8.391373668e-05
8.547354437e-05
8.936111118e-05
The -te divides your number into two fields by the e that separates out the mantissa from the exponent. the -k2,2 says to sort by exponent first, then the -k1,1 says to sort by your mantissa next.
Works with all versions of the sort command.
Your method is absolutely correct
cat file.txt | sort -g
If the above code is not working , then try this
sed 's/\./0000000000000/g' file.txt | sort -g | sed 's/0000000000000/\./g'
Convert '.' to '0000000000000' , sort and again subsitute with '.'. I chose '0000000000000' to replace so as to avoid mismatching of the number with the inputs.
You can manipulate the number by your own.

bash: shortest way to get n-th column of output

Let's say that during your workday you repeatedly encounter the following form of columnized output from some command in bash (in my case from executing svn st in my Rails working directory):
? changes.patch
M app/models/superman.rb
A app/models/superwoman.rb
in order to work with the output of your command - in this case the filenames - some sort of parsing is required so that the second column can be used as input for the next command.
What I've been doing is to use awk to get at the second column, e.g. when I want to remove all files (not that that's a typical usecase :), I would do:
svn st | awk '{print $2}' | xargs rm
Since I type this a lot, a natural question is: is there a shorter (thus cooler) way of accomplishing this in bash?
NOTE:
What I am asking is essentially a shell command question even though my concrete example is on my svn workflow. If you feel that workflow is silly and suggest an alternative approach, I probably won't vote you down, but others might, since the question here is really how to get the n-th column command output in bash, in the shortest manner possible. Thanks :)
You can use cut to access the second field:
cut -f2
Edit:
Sorry, didn't realise that SVN doesn't use tabs in its output, so that's a bit useless. You can tailor cut to the output but it's a bit fragile - something like cut -c 10- would work, but the exact value will depend on your setup.
Another option is something like: sed 's/.\s\+//'
To accomplish the same thing as:
svn st | awk '{print $2}' | xargs rm
using only bash you can use:
svn st | while read a b; do rm "$b"; done
Granted, it's not shorter, but it's a bit more efficient and it handles whitespace in your filenames correctly.
I found myself in the same situation and ended up adding these aliases to my .profile file:
alias c1="awk '{print \$1}'"
alias c2="awk '{print \$2}'"
alias c3="awk '{print \$3}'"
alias c4="awk '{print \$4}'"
alias c5="awk '{print \$5}'"
alias c6="awk '{print \$6}'"
alias c7="awk '{print \$7}'"
alias c8="awk '{print \$8}'"
alias c9="awk '{print \$9}'"
Which allows me to write things like this:
svn st | c2 | xargs rm
Try the zsh. It supports suffix alias, so you can define X in your .zshrc to be
alias -g X="| cut -d' ' -f2"
then you can do:
cat file X
You can take it one step further and define it for the nth column:
alias -g X2="| cut -d' ' -f2"
alias -g X1="| cut -d' ' -f1"
alias -g X3="| cut -d' ' -f3"
which will output the nth column of file "file". You can do this for grep output or less output, too. This is very handy and a killer feature of the zsh.
You can go one step further and define D to be:
alias -g D="|xargs rm"
Now you can type:
cat file X1 D
to delete all files mentioned in the first column of file "file".
If you know the bash, the zsh is not much of a change except for some new features.
HTH Chris
Because you seem to be unfamiliar with scripts, here is an example.
#!/bin/sh
# usage: svn st | x 2 | xargs rm
col=$1
shift
awk -v col="$col" '{print $col}' "${#--}"
If you save this in ~/bin/x and make sure ~/bin is in your PATH (now that is something you can and should put in your .bashrc) you have the shortest possible command for generally extracting column n; x n.
The script should do proper error checking and bail if invoked with a non-numeric argument or the incorrect number of arguments, etc; but expanding on this bare-bones essential version will be in unit 102.
Maybe you will want to extend the script to allow a different column delimiter. Awk by default parses input into fields on whitespace; to use a different delimiter, use -F ':' where : is the new delimiter. Implementing this as an option to the script makes it slightly longer, so I'm leaving that as an exercise for the reader.
Usage
Given a file file:
1 2 3
4 5 6
You can either pass it via stdin (using a useless cat merely as a placeholder for something more useful);
$ cat file | sh script.sh 2
2
5
Or provide it as an argument to the script:
$ sh script.sh 2 file
2
5
Here, sh script.sh is assuming that the script is saved as script.sh in the current directory; if you save it with a more useful name somewhere in your PATH and mark it executable, as in the instructions above, obviously use the useful name instead (and no sh).
It looks like you already have a solution. To make things easier, why not just put your command in a bash script (with a short name) and just run that instead of typing out that 'long' command every time?
If you are ok with manually selecting the column, you could be very fast using pick:
svn st | pick | xargs rm
Just go to any cell of the 2nd column, press c and then hit enter
Note, that file path does not have to be in second column of svn st output. For example if you modify file, and modify it's property, it will be 3rd column.
See possible output examples in:
svn help st
Example output:
M wc/bar.c
A + wc/qax.c
I suggest to cut first 8 characters by:
svn st | cut -c8- | while read FILE; do echo whatever with "$FILE"; done
If you want to be 100% sure, and deal with fancy filenames with white space at the end for example, you need to parse xml output:
svn st --xml | grep -o 'path=".*"' | sed 's/^path="//; s/"$//'
Of course you may want to use some real XML parser instead of grep/sed.

How can I shuffle the lines of a text file on the Unix command line or in a shell script?

I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat, awk, cut, etc?
You can use shuf. On some systems at least (doesn't appear to be in POSIX).
As jleedev pointed out: sort -R might also be an option. On some systems at least; well, you get the picture. It has been pointed out that sort -R doesn't really shuffle but instead sort items according to their hash value.
[Editor's note: sort -R almost shuffles, except that duplicate lines / sort keys always end up next to each other. In other words: only with unique input lines / keys is it a true shuffle. While it's true that the output order is determined by hash values, the randomness comes from choosing a random hash function - see manual.]
Perl one-liner would be a simple version of Maxim's solution
perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile
This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
The functions take not only stdin input, but alternatively also filename arguments
The functions take extra steps to handle SIGPIPE in the usual way (quiet termination with exit code 141), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping to head.
A performance comparison is made.
POSIX-compliant function based on awk, sort, and cut, adapted from the OP's own answer:
shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$#" |
sort -k1,1n | cut -d ' ' -f2-; }
Perl-based function - adapted from Moonyoung Kang's answer:
shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$#"; }
Python-based function, adapted from scai's answer:
shuf() { python -c '
import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL;
signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write("".join(lines))
' "$#"; }
See the bottom section for a Windows version of this function.
Ruby-based function, adapted from hoffmanc's answer:
shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT");
puts ARGF.readlines.shuffle' "$#"; }
Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk implementation used (e.g., the BSD awk version used on OSX is usually slower than GNU awk and especially mawk), this should provide a general sense of relative performance.
Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000.
Times are listed in ascending order (fastest first):
shuf
0.090s
Ruby 2.0.0
0.289s
Perl 5.18.2
0.589s
Python
1.342s with Python 2.7.6; 2.407s(!) with Python 3.4.2
awk + sort + cut
3.003s with BSD awk; 2.388s with GNU awk (4.1.1); 1.811s with mawk (1.3.4);
For further comparison, the solutions not packaged as functions above:
sort -R (not a true shuffle if there are duplicate input lines)
10.661s - allocating more memory doesn't seem to make a difference
Scala
24.229s
bash loops + sort
32.593s
Conclusions:
Use shuf, if you can - it's the fastest by far.
Ruby does well, followed by Perl.
Python is noticeably slower than Ruby and Perl, and, comparing Python versions, 2.7.6 is quite a bit faster than 3.4.1
Use the POSIX-compliant awk + sort + cut combo as a last resort; which awk implementation you use matters (mawk is faster than GNU awk, BSD awk is slowest).
Stay away from sort -R, bash loops, and Scala.
Windows versions of the Python solution (the Python code is identical, except for variations in quoting and the removal of the signal-related statements, which aren't supported on Windows):
For PowerShell (in Windows PowerShell, you'll have to adjust $OutputEncoding if you want to send non-ASCII characters via the pipeline):
# Call as `shuf someFile.txt` or `Get-Content someFile.txt | shuf`
function shuf {
$Input | python -c #'
import sys, random, fileinput;
lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write(''.join(lines))
'# $args
}
Note that PowerShell can natively shuffle via its Get-Random cmdlet (though performance may be a problem); e.g.:
Get-Content someFile.txt | Get-Random -Count ([int]::MaxValue)
For cmd.exe (a batch file):
Save to file shuf.cmd, for instance:
#echo off
python -c "import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines))" %*
I use a tiny perl script, which I call "unsort":
#!/usr/bin/perl
use List::Util 'shuffle';
#list = <STDIN>;
print shuffle(#list);
I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.
PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.
Here is a first try that's easy on the coder but hard on the CPU which prepends a random number to each line, sorts them and then strips the random number from each line. In effect, the lines are sorted randomly:
cat myfile | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2- > myfile.shuffled
here's an awk script
awk 'BEGIN{srand() }
{ lines[++d]=$0 }
END{
while (1){
if (e==d) {break}
RANDOM = int(1 + rand() * d)
if ( RANDOM in lines ){
print lines[RANDOM]
delete lines[RANDOM]
++e
}
}
}' file
output
$ cat file
1
2
3
4
5
6
7
8
9
10
$ ./shell.sh
7
5
10
9
6
8
2
1
3
4
A one-liner for python:
python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile
And for printing just a single random line:
python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile
But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.
Simple awk-based function will do the job:
shuffle() {
awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
}
usage:
any_command | shuffle
This should work on almost any UNIX. Tested on Linux, Solaris and HP-UX.
Update:
Note, that leading zeros (%06d) and rand() multiplication makes it to work properly also on systems where sort does not understand numbers. It can be sorted via lexicographical order (a.k.a. normal string compare).
Ruby FTW:
ls | ruby -e 'puts STDIN.readlines.shuffle'
A simple and intuitive way would be to use shuf.
Example:
Assume words.txt as:
the
an
linux
ubuntu
life
good
breeze
To shuffle the lines, do:
$ shuf words.txt
which would throws the shuffled lines to standard output; So, you've to pipe it to an output file like:
$ shuf words.txt > shuffled_words.txt
One such shuffle run could yield:
breeze
the
linux
an
ubuntu
good
life
One liner for Python based on scai's answer, but a) takes stdin, b) makes the result repeatable with seed, c) picks out only 200 of all lines.
$ cat file | python -c "import random, sys;
random.seed(100); print ''.join(random.sample(sys.stdin.readlines(), 200))," \
> 200lines.txt
We have a package to do the very job:
sudo apt-get install randomize-lines
Example:
Create an ordered list of numbers, and save it to 1000.txt:
seq 1000 > 1000.txt
to shuffle it, simply use
rl 1000.txt
If like me you came here to look for an alternate to shuf for macOS then use randomize-lines.
Install randomize-lines(homebrew) package, which has an rl command which has similar functionality to shuf.
brew install randomize-lines
Usage: rl [OPTION]... [FILE]...
Randomize the lines of a file (or stdin).
-c, --count=N select N lines from the file
-r, --reselect lines may be selected multiple times
-o, --output=FILE
send output to file
-d, --delimiter=DELIM
specify line delimiter (one character)
-0, --null set line delimiter to null character
(useful with find -print0)
-n, --line-number
print line number with output lines
-q, --quiet, --silent
do not output any errors or warnings
-h, --help display this help and exit
-V, --version output version information and exit
This is a python script that I saved as rand.py in my home folder:
#!/bin/python
import sys
import random
if __name__ == '__main__':
with open(sys.argv[1], 'r') as f:
flist = f.readlines()
random.shuffle(flist)
for line in flist:
print line.strip()
On Mac OSX sort -R and shuf are not available so you can alias this in your bash_profile as:
alias shuf='python rand.py'
If you have Scala installed, here's a one-liner to shuffle the input:
ls -1 | scala -e 'for (l <- util.Random.shuffle(io.Source.stdin.getLines.toList)) println(l)'
This bash function has the minimal dependency(only sort and bash):
shuf() {
while read -r x;do
echo $RANDOM$'\x1f'$x
done | sort |
while IFS=$'\x1f' read -r x y;do
echo $y
done
}
In windows You may try this batch file to help you to shuffle your data.txt, The usage of the batch code is
C:\> type list.txt | shuffle.bat > maclist_temp.txt
After issuing this command, maclist_temp.txt will contain a randomized list of lines.
Hope this helps.
Not mentioned as of yet:
The unsort util. Syntax (somewhat playlist oriented):
unsort [-hvrpncmMsz0l] [--help] [--version] [--random] [--heuristic]
[--identity] [--filenames[=profile]] [--separator sep] [--concatenate]
[--merge] [--merge-random] [--seed integer] [--zero-terminated] [--null]
[--linefeed] [file ...]
msort can shuffle by line, but it's usually overkill:
seq 10 | msort -jq -b -l -n 1 -c r
Another awk variant:
#!/usr/bin/awk -f
# usage:
# awk -f randomize_lines.awk lines.txt
# usage after "chmod +x randomize_lines.awk":
# randomize_lines.awk lines.txt
BEGIN {
FS = "\n";
srand();
}
{
lines[ rand()] = $0;
}
END {
for( k in lines ){
print lines[k];
}
}

What is your latest useful Perl one-liner (or a pipe involving Perl)? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
The one-liner should:
solve a real-world problem
not be extensively cryptic (should be easy to understand and reproduce)
be worth the time it takes to write it (should not be too clever)
I'm looking for practical tips and tricks (complementary examples for perldoc perlrun).
Please see my slides for "A Field Guide To The Perl Command Line Options."
Squid log files. They're great, aren't they? Except by default they have seconds-from-the-epoch as the time field. Here's a one-liner that reads from a squid log file and converts the time into a human readable date:
perl -pe's/([\d.]+)/localtime $1/e;' access.log
With a small tweak, you can make it only display lines with a keyword you're interested in. The following watches for stackoverflow.com accesses and prints only those lines, with a human readable date. To make it more useful, I'm giving it the output of tail -f, so I can see accesses in real time:
tail -f access.log | perl -ne's/([\d.]+)/localtime $1/e,print if /stackoverflow\.com/'
The problem: A media player does not automatically load subtitles due to their names differ from corresponding video files.
Solution: Rename all *.srt (files with subtitles) to match the *.avi (files with video).
perl -e'while(<*.avi>) { s/avi$/srt/; rename <*.srt>, $_ }'
CAVEAT: Sorting order of original video and subtitle filenames should be the same.
Here, a more verbose version of the above one-liner:
my #avi = glob('*.avi');
my #srt = glob('*.srt');
for my $i (0..$#avi)
{
my $video_filename = $avi[$i];
$video_filename =~ s/avi$/srt/; # 'movie1.avi' -> 'movie1.srt'
my $subtitle_filename = $srt[$i]; # 'film1.srt'
rename($subtitle_filename, $video_filename); # 'film1.srt' -> 'movie1.srt'
}
The common idiom of using find ... -exec rm {} \; to delete a set of files somewhere in a directory tree is not particularly efficient in that it executes the rm command once for each file found. One of my habits, born from the days when computers weren't quite as fast (dagnabbit!), is to replace many calls to rm with one call to perl:
find . -name '*.whatever' | perl -lne unlink
The perl part of the command line reads the list of files emitted* by find, one per line, trims the newline off, and deletes the file using perl's built-in unlink() function, which takes $_ as its argument if no explicit argument is supplied. ($_ is set to each line of input thanks to the -n flag.) (*These days, most find commands do -print by default, so I can leave that part out.)
I like this idiom not only because of the efficiency (possibly less important these days) but also because it has fewer chorded/awkward keys than typing the traditional -exec rm {} \; sequence. It also avoids quoting issues caused by file names with spaces, quotes, etc., of which I have many. (A more robust version might use find's -print0 option and then ask perl to read null-delimited records instead of lines, but I'm usually pretty confident that my file names do not contain embedded newlines.)
You may not think of this as Perl, but I use ack religiously (it's a smart grep replacement written in Perl) and that lets me edit, for example, all of my Perl tests which access a particular part of our API:
vim $(ack --perl -l 'api/v1/episode' t)
As a side note, if you use vim, you can run all of the tests in your editor's buffers.
For something with more obvious (if simple) Perl, I needed to know how many test programs used out test fixtures in the t/lib/TestPM directory (I've cut down the command for clarity).
ack $(ls t/lib/TestPM/|awk -F'.' '{print $1}'|xargs perl -e 'print join "|" => #ARGV') aggtests/ t -l
Note how the "join" turns the results into a regex to feed to ack.
All one-liners from the answers collected in one place:
perl -pe's/([\d.]+)/localtime $1/e;' access.log
ack $(ls t/lib/TestPM/|awk -F'.' '{print $1}'|xargs perl -e 'print join "|" => #ARGV')
aggtests/ t -l
perl -e'while(<*.avi>) { s/avi$/srt/; rename <*.srt>, $_ }'
find . -name '*.whatever' | perl -lne unlink
tail -F /var/log/squid/access.log | perl -ane 'BEGIN{$|++} $F[6] =~ m{\Qrad.live.com/ADSAdClient31.dll}
&& printf "%02d:%02d:%02d %15s %9d\n", sub{reverse #_[0..2]}->(localtime $F[0]), #F[2,4]'
export PATH=$(perl -F: -ane'print join q/:/, grep { !$c{$_}++ } #F'<<<$PATH)
alias e2d="perl -le \"print scalar(localtime($ARGV[0]));\""
perl -ple '$_=eval'
perl -00 -ne 'print sort split /^/'
perl -pe'1while+s/\t/" "x(8-pos()%8)/e'
tail -f log | perl -ne '$s=time() unless $s; $n=time(); $d=$n-$s; if ($d>=2) { print qq
($. lines in last $d secs, rate ),$./$d,qq(\n); $. =0; $s=$n; }'
perl -MFile::Spec -e 'print join(qq(\n),File::Spec->path).qq(\n)'
See corresponding answers for their descriptions.
The Perl one-liner I use the most is the Perl calculator
perl -ple '$_=eval'
One of the biggest bandwidth hogs at $work is download web advertising, so I'm looking at the low-hanging fruit waiting to be picked. I've got rid of Google ads, now I have Microsoft in my line of sights. So I run a tail on the log file, and pick out the lines of interest:
tail -F /var/log/squid/access.log | \
perl -ane 'BEGIN{$|++} $F[6] =~ m{\Qrad.live.com/ADSAdClient31.dll}
&& printf "%02d:%02d:%02d %15s %9d\n",
sub{reverse #_[0..2]}->(localtime $F[0]), #F[2,4]'
What the Perl pipe does is to begin by setting autoflush to true, so that any that is acted upon is printed out immediately. Otherwise the output it chunked up and one receives a batch of lines when the output buffer fills. The -a switch splits each input line on white space, and saves the results in the array #F (functionality inspired by awk's capacity to split input records into its $1, $2, $3... variables).
It checks whether the 7th field in the line contains the URI we seek (using \Q to save us the pain of escaping uninteresting metacharacters). If a match is found, it pretty-prints the time, the source IP and the number of bytes returned from the remote site.
The time is obtained by taking the epoch time in the first field and using 'localtime' to break it down into its components (hour, minute, second, day, month, year). It takes a slice of the first three elements returns, second, minute and hour, and reverses the order to get hour, minute and second. This is returned as a three element array, along with a slice of the third (IP address) and fifth (size) from the original #F array. These five arguments are passed to sprintf which formats the results.
#dr_pepper
Remove literal duplicates in $PATH:
$ export PATH=$(perl -F: -ane'print join q/:/, grep { !$c{$_}++ } #F'<<<$PATH)
Print unique clean paths from %PATH% environment variable (it doesn't touch ../ and alike, replace File::Spec->rel2abs by Cwd::realpath if it is desirable) It is not a one-liner to be more portable:
#!/usr/bin/perl -w
use File::Spec;
$, = "\n";
print grep { !$count{$_}++ }
map { File::Spec->rel2abs($_) }
File::Spec->path;
I use this quite frequently to quickly convert epoch times to a useful datestamp.
perl -l -e 'print scalar(localtime($ARGV[0]))'
Make an alias in your shell:
alias e2d="perl -le \"print scalar(localtime($ARGV[0]));\""
Then pipe an epoch number to the alias.
echo 1219174516 | e2d
Many programs and utilities on Unix/Linux use epoch values to represent time, so this has proved invaluable for me.
Remove duplicates in path variable:
set path=(`echo $path | perl -e 'foreach(split(/ /,<>)){print $_," " unless $s{$_}++;}'`)
Remove MS-DOS line-endings.
perl -p -i -e 's/\r\n$/\n/' htdocs/*.asp
Extracting Stack Overflow reputation without having to open a web page:
perl -nle "print ' Stack Overflow ' . $1 . ' (no change)' if /\s{20,99}([0-9,]{3,6})<\/div>/;" "SO.html" >> SOscores.txt
This assumes the user page has already been downloaded to file SO.html. I use wget for this purpose. The notation here is for Windows command line; it would be slightly different for Linux or Mac OS X. The output is appended to a text file.
I use it in a BAT script to automate sampling of reputation on the four sites in the family:
Stack Overflow, Server Fault, Super User and Meta Stack Overflow.
In response to Ovid's Vim/ack combination:
I too am often searching for something and then want to open the matching files in Vim, so I made myself a little shortcut some time ago (works in Z shell only, I think):
function vimify-eval; {
if [[ ! -z "$BUFFER" ]]; then
if [[ $BUFFER = 'ack'* ]]; then
BUFFER="$BUFFER -l"
fi
BUFFER="vim \$($BUFFER)"
zle accept-line
fi
}
zle -N vim-eval-widget vimify-eval
bindkey '^P' vim-eval-widget
It works like this: I search for something using ack, like ack some-pattern. I look at the results and if I like it, I press arrow-up to get the ack-line again and then press Ctrl + P. What happens then is that Z shell appends and "-l" for listing filenames only if the command starts with "ack". Then it puts "$(...)" around the command and "vim" in front of it. Then the whole thing is executed.
I often need to see a readable version of the PATH while shell scripting. The following one-liners print every path entry on its own line.
Over time this one-liner has evolved through several phases:
Unix (version 1):
perl -e 'print join("\n",split(":",$ENV{"PATH"}))."\n"'
Windows (version 2):
perl -e "print join(qq(\n),split(';',$ENV{'PATH'})).qq(\n)"
Both Unix/Windows (using q/qq tip from #j-f-sebastian) (version 3):
perl -MFile::Spec -e 'print join(qq(\n), File::Spec->path).qq(\n)' # Unix
perl -MFile::Spec -e "print join(qq(\n), File::Spec->path).qq(\n)" # Windows
One of the most recent one-liners that got a place in my ~/bin:
perl -ne '$s=time() unless $s; $n=time(); $d=$n-$s; if ($d>=2) { print "$. lines in last $d secs, rate ",$./$d,"\n"; $. =0; $s=$n; }'
You would use it against a tail of a log file and it will print the rate of lines being outputed.
Want to know how many hits per second you are getting on your webservers? tail -f log | this_script.
Get human-readable output from du, sorted by size:
perl -e '%h=map{/.\s/;7x(ord$&&10)+$`,$_}`du -h`;print#h{sort%h}'
Filters a stream of white-space separated stanzas (name/value pair lists),
sorting each stanza individually:
perl -00 -ne 'print sort split /^/'
Network administrators have the tendency to misconfigure "subnet address" as "host address" especially while using Cisco ASDM auto-suggest. This straightforward one-liner scans the configuration files for any such configuration errors.
incorrect usage: permit host 10.1.1.0
correct usage: permit 10.1.1.0 255.255.255.0
perl -ne "print if /host ([\w\-\.]+){3}\.0 /" *.conf
This was tested and used on Windows, please suggest if it should be modified in any way for correct usage.
Expand all tabs to spaces: perl -pe'1while+s/\t/" "x(8-pos()%8)/e'
Of course, this could be done with :set et, :ret in Vim.
I have a list of tags with which I identify portions of text. The master list is of the format:
text description {tag_label}
It's important that the {tag_label} are not duplicated. So there's this nice simple script:
perl -ne '($c) = $_ =~ /({.*?})/; print $c,"\n" ' $1 | sort | uniq -c | sort -d
I know that I could do the whole lot in shell or perl, but this was the first thing that came to mind.
Often I have had to convert tabular data in to configuration files. For e.g, Network cabling vendors provide the patching record in Excel format and we have to use that information to create configuration files. i.e,
Interface, Connect to, Vlan
Gi1/0/1, Desktop, 1286
Gi1/0/2, IP Phone, 1317
should become:
interface Gi1/0/1
description Desktop
switchport access vlan 1286
and so on. The same task re-appears in several forms in various administration tasks where a tabular data needs to be prepended with their field name and transposed to a flat structure. I have seen some DBA's waste a lot of times preparing their SQL statements from excel sheet. It can be achieved using this simple one-liner. Just save the tabular data in CSV format using your favourite spreadsheet tool and run this one-liner. The field names in header row gets prepended to individual cell values, so you may have to edit it to match your requirements.
perl -F, -lane "if ($.==1) {#keys = #F} else{print #keys[$_].$F[$_] foreach(0..$#F)} "
The caveat is that none of the field names or values should contain any commas. Perhaps this can be further elaborated to catch such exceptions in a one-line, please improve this if possible.
Here is one that I find handy when dealing with a collection compressed log files:
open STATFILE, "zcat $logFile|" or die "Can't open zcat of $logFile" ;
At some time I found that anything I would want to do with Perl that is short enough to be done on the command line with 'perl -e' can be done better, easier and faster with normal Z shell features without the hassle of quoting. E.g. the example above could be done like this:
srt=(*.srt); for foo in *.avi; mv $srt[1] ${foo:r}.srt && srt=($srt[2,-1])

Resources