Compare some specific columns of lines within a file using bash script - bash

I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.
I have written a script, it worked but running so long, over 30 minutes.
Let the number of lines is 1733 with header, my code is:
for line1 in {2..1733}; do \
for line2 in {$((line1+1))..1733}; do \
i7_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
i5_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
if [ $i7_diff -lt 3 ]; then
if [ $i5_diff -lt 3 ]; then
echo $(sed -n "${line1}p" Lab_primers.tsv)"\n" >> primer_collision.txt
echo $(sed -n "${line2}p" Lab_primers.tsv)"\n\n" >> primer_collision.txt
fi;
fi;
done
done
I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.
If meeting the condition (both 2nd and 4th columns of pair of lines have the number of differences less than 3), the code will print a pair lines to output file.
Here is an excerpt of the input (it has 1733 lines):
I7_Index_ID index I5_Index_ID index2 primer
D703 CGCTCATT D507 ACGTCCTG 27
D704 GAGATTCC D507 ACGTCCTG 28
D701 ATTACTCG D508 GTCAGTAC 29
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
D705 ATTCAGAA D504 TCAGAGCC 45
D706 GAATTCGT D504 TCAGAGCC 46
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
D715 GGTGATGA D510 AACCTCTC 103
D716 AACCTACG D510 AACCTCTC 104
i787 TGCTTCCA i593 ATCGTCTC R35
Then the expected output is:
D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35
My question is what the better code to deal with it, how to reduce the running time?
Thank you for your help!

You could start to sort by fields 2 and 4.
Then, no need for double loop: if a pair exist, they should be adjacent.
sort -k 2,2 -k 4,4 myfile.txt
Then, we need to print only bunch of consecutive lines that share the same 2 and 4 fields.
first=yes
sort -k 2,2 -k 4,4 test.txt | while read l
do
fields=(${l})
new2=${fields[1]}
new4=${fields[3]} # Fields 2 and 4, bash-way
if [[ "$new2" = "$old2" ]] && [[ "$new4" = "$old4" ]]
then
if [[ $first ]]
then
# first time we print something for this series: we need
# to also print the previous line (the first of the series)
echo; echo "$oldl"
# But if the next line is identical (series of 3, no need to repeat this line)
first=
fi
echo "$l"
else
# This line is not identical to the previous. So nothing to print
# If the next one is identical to this one, then, this one will
# be the first of its series
first=yes
fi
old2=$new2
old4=$new4
oldl="${l}"
done
One frustrating thing: uniq -D almost does all the job we did in this script. Except that it is unable to filter on specific lines.
But we could also rewrites lines so that uniq can work.
Not very fluent in awk (if I were, I am pretty sure awk could do the uniq work for me), but well
sort -k 2,2 -k 4,4 test.txt | awk '{print $0" "$2" "$4}' | uniq -D -f 5 | awk '{printf "%-12s %-9s %-12s %-9s %s\n",$1,$2,$3,$4,$5}'
does the job.
sort sort the lines by fields 2 and 4. awk add to the end of each line a copy of fields 2 and 4. Which then make uniq usable, since uniq is able to ignore N first fields. So here, we use uniq ignoring the 5 1st fields, that is working only on the copies of fields 2 and 4. With -D uniq display only duplicate lines.
Then, the last awk remove the copies of field we don't need anymore.

Related

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

How to sort lines based on specific part of their value?

When I run the following command:
command list -r machine-a-.* | sort -nr
It gives me the following result:
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
I wish to sort these lines based on the number at the end, in descending order.
( Clearly sort -nr doesn't work as expected. )
You just need the -t and -k options in the sort.
command list -r machine-a-.* | sort -t '-' -k 3 -nr
-t is the separator used to separate the fields.
By giving it the value of '-', sort will see given text as:
Field 1 Field 2 Field 3
machine a 9
machine a 8
machine a 72
machine a 71
machine a 70
-k is specifying the field which will be used for comparison.
By giving it the value 3, sort will sort the lines by comparing the values from the Field 3.
Namely, these strings will be compared:
9
8
72
71
70
-n makes sort treat the fields for comparison as numbers instead of strings.
-r makes sort to sort the lines in reverse order(descending order).
Therefore, by sorting the numbers from Field 3 in reverse order, this will be the output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Here is an example of input to sort:
$ cat 1.txt
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
Here is our short program:
$ cat 1.txt | ( IFS=-; while read A B C ; do echo $C $A-$B-$C; done ) | sort -rn | cut -d' ' -f 2
Here is its output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Explanation:
$ cat 1.txt \ (put contents of file into pipe input)
| ( \ (group some commands)
IFS=-; (set field separator to "-" for read command)
while read A B C ; (read fields in 3 variables A B and C every line)
do echo $C $A-$B-$C; (create output with $C in the beggining)
done
) \ (end of group)
| sort -rn \ (reverse number sorting)
| cut -d' ' -f 2 (cut-off first unneeded anymore field)

Get a percentage of randomly chosen lines from a text file

I have a text file (bigfile.txt) with thousands of rows. I want to make a smaller text file with 1 % of the rows which are randomly chosen. I tried the following
output=$(wc -l bigfile.txt)
ds1=$(0.01*output)
sort -r bigfile.txt|shuf|head -n ds1
It give the following error:
head: invalid number of lines: ‘ds1’
I don't know what is wrong.
Even after you fix your issues with your bash script, it cannot do floating point arithmetic. You need external tools like Awk which I would use as
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' bigfile.txt)
(( randomCount )) && sort -r file | shuf | head -n "$randomCount"
E.g. Writing a file with with 221 lines using the below loop and trying to get random lines,
tmpfile=$(mktemp /tmp/abc-script.XXXXXX)
for i in {1..221}; do echo $i; done >> "$tmpfile"
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' "$tmpfile")
If I print the count, it would return me a integer number 2 and using that on the next command,
sort -r "$tmpfile" | shuf | head -n "$randomCount"
86
126
Roll a die (with rand()) for each line of the file and get a number between 0 and 1. Print the line if the die shows less than 0.01:
awk 'rand()<0.01' bigFile
Quick test - generate 100,000,000 lines and count how many get through:
seq 1 100000000 | awk 'rand()<0.01' | wc -l
999308
Pretty close to 1%.
If you want the order random as well as the selection, you can pass this through shuf afterwards:
seq 1 100000000 | awk 'rand()<0.01' | shuf
On the subject of efficiency which came up in the comments, this solution takes 24s on my iMac with 100,000,000 lines:
time { seq 1 100000000 | awk 'rand()<0.01' > /dev/null; }
real 0m23.738s
user 0m31.787s
sys 0m0.490s
The only other solution that works here, heavily based on OP's original code, takes 13 minutes 19s.

Get variable value by finding keyword in unix environment

In UNIX environment, I have a file.txt that contains following details:
Data recording started:
0001100 Matched at 412090
0001101 Mismatched at 414798
0001102 Matched at 420007
0001103 Mismatched at 420015
Job completed
How do I can get the first Matched value by searching "Matched" (line 2) word and also for the first "Mismatched" (line 3)
Find the difference between them and store as a variable, "dif"
The result is Matched minus Mismatched, so it cannot find the data by specify line number, i.e. find line 3 last integers minus line 2 last integers, because the mismatched may come at first like following:
Data recording started:
0001100 Mismatched at 412090
0001101 Matched at 414798
0001102 Mismatched at 420007
0001103 Matched at 420015
Job completed
One way:
echo $((
$(grep Matched input | head -1 | sed 's/.*at //')
- $(grep Mismatched input | head -1 | sed 's/.*at //')
))
or using only sed:
echo $((
$(sed -n 's/.*Matched.*at //p' input | head -1)
- $(sed -n 's/.*Mismatched.*at //p' input | head -1)
))
Output
-2708
We can use grep -m 1 to kick away head.
dif=$((
$(grep -m 1 'Matched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
- $(grep -m 1 'Mismatched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
))
echo $dif

How do I pick random unique lines from a text file in shell?

I have a text file with an unknown number of lines. I need to grab some of those lines at random, but I don't want there to be any risk of repeats.
I tried this:
jot -r 3 1 `wc -l<input.txt` | while read n; do
awk -v n=$n 'NR==n' input.txt
done
But this is ugly, and doesn't protect against repeats.
I also tried this:
awk -vmax=3 'rand() > 0.5 {print;count++} count>max {exit}' input.txt
But that obviously isn't the right approach either, as I'm not guaranteed even to get max lines.
I'm stuck. How do I do this?
This might work for you:
shuf -n3 file
shuf is one of GNU coreutils.
If you have Python accessible (change the 10 to what you'd like):
python -c 'import random, sys; print("".join(random.sample(sys.stdin.readlines(), 10)).rstrip("\n"))' < input.txt
(This will work in Python 2.x and 3.x.)
Also, (again change the 10 to the appropriate value):
sort -R input.txt | head -10
If jot is on your system, then I guess you're running FreeBSD or OSX rather than Linux, so you probably don't have tools like rl or sort -R available.
No worries. I had to do this a while ago. Try this instead:
$ printf 'one\ntwo\nthree\nfour\nfive\n' > input.txt
$ cat rndlines
#!/bin/sh
# default to 3 lines of output
lines="${1:-3}"
# default to "input.txt" as input file
input="${2:-input.txt}"
# First, put a random number at the beginning of each line.
while read line; do
printf '%8d%s\n' $(jot -r 1 1 99999999) "$line"
done < "$input" |
sort -n | # Next, sort by the random number.
sed 's/^.\{8\}//' | # Last, remove the number from the start of each line.
head -n "$lines" # Show our output
$ ./rndlines input.txt
two
one
five
$ ./rndlines input.txt
four
two
three
$
Here's a 1-line example that also inserts the random number a little more cleanly using awk:
$ printf 'one\ntwo\nthree\nfour\nfive\n' | awk 'BEGIN{srand()} {printf("%8d%s\n", rand()*10000000, $0)}' | sort -n | head -n 3 | cut -c9-
Note that different versions of sed (in FreeBSD and OSX) may require the -E option instead of -r to handle ERE instead or BRE dialect in the regular expression if you want to use that explictely, though everything I've tested works with escapted bounds in BRE. (Ancient versions of sed (HP/UX, etc) might not support this notation, but you'd only be using those if you already knew how to do this.)
This should do the trick, at least with bash and assuming your environment has the other commands available:
cat chk.c | while read x; do
echo $RANDOM:$x
done | sort -t: -k1 -n | tail -10 | sed 's/^[0-9]*://'
It basically outputs your file, placing a random number at the start of each line.
Then it sorts on that number, grabs the last 10 lines, and removes that number from them.
Hence, it gives you ten random lines from the file, with no repeats.
For example, here's a transcript of it running three times with that chk.c file:
====
pax$ testprog chk.c
} else {
}
newNode->next = NULL;
colm++;
====
pax$ testprog chk.c
}
arg++;
printf (" [%s] n", currNode->value);
free (tempNode->value);
====
pax$ testprog chk.c
char tagBuff[101];
}
return ERR_OTHER;
#define ERR_MEM 1
===
pax$ _
sort -Ru filename | head -5
will ensure no duplicates. Not all implementations of sort have the -R option.
To get N random lines from FILE with Perl:
perl -MList::Util=shuffle -e 'print shuffle <>' FILE | head -N
Here's an answer using ruby if you don't want to install anything else:
cat filename | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
for example, given a file (dups.txt) that looks like:
1 2
1 3
2
1 2
3
4
1 3
5
6
6
7
You might get the following output (or some permutation):
cat dups.txt| ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
4
6
5
1 2
2
3
7
1 3
Further example from the comments:
printf 'test\ntest1\ntest2\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test1
test
test2
Of course if you have a file with repeated lines of test you'll get just one line:
printf 'test\ntest\ntest\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test

Resources