Slow speed with gawk for multiple edits the same file - bash

i run a test enviroment where i created 40 000 testfiles with lorem alg. the files are between 200k and 5 MB big. I wanna modify lots of random files. I will change 5% of the lines by delete 2 lines and insert 1 line with base64 string.
the probleme is that this procedere needs to much time per file. i try to fix with copying testfile to ram and change it there, but i see a single thread that use only one full core and gawk show the most cpu work. i'm looking for some solutions, but i dont find the right advice. i think gawk could do this in one step but for big files i get a to long string when i caculate with "getconf ARG_MAX".
how can i speed this up?
zeilen=$(wc -l < testfile$filecount.txt);
durchlauf=$(($zeilen/20))
zeilen=$((zeilen-2))
for (( c=1; c<=durchlauf; c++ ))
do
zeile=$(shuf -i 1-$zeilen -n 1);
zeile2=$((zeile+1))
zeile3=$((zeile2+1))
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
if [[ $c -eq 1 ]]
then
gawk -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next;print} \
NR==n2{next; print} NR==n3{print s}1' testfile$filecount.txt > /mnt/RAM/tempfile.tmp
else
gawk -i inplace -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next; print} \
NR==n2{next; print} NR==n3{print s}1' /mnt/RAM/tempfile.tmp
fi
done

Assumptions:
generate $durchlauf (a number) random line numbers; we'll refer to a single number as n ...
delete lines numbered n and n+1 from the input file and in their place ...
insert $string (a randomly generated base64 string)
this list of random line numbers must not have any consecutive line numbers
As others have pointed out you want to limit yourself to a single gawk call per input file.
New approach:
generate $durchlauf (count) random numbers (see gen_numbers() function)
generate $durchlauf (count) base64 strings (we'll reuse Ed Morton's code)
paste these 2 sets of data into a single input stream/file
feed 2 files to gawk ... the paste result and the actual file to be modified
we won't be able to use gawk's -i inplace so we'll use an intermediate tmp file
when we find a matching line in our input file we'll 1) insert the base64 string and then 2) skip/delete the current/next input lines; this should address the issue where we have two random numbers that are different by +1
One idea to insure we do not generate consecutive line numbers:
break our set of line numbers into ranges, eg, 100 lines split into 5 ranges => 1-20 / 21-40 / 41-60 / 61-80 / 81-100
reduce the end of each range by 1, eg, 1-19 / 21-39 / 41-59 / 61-79 / 81-99
use $RANDOM to generate numbers between each range (this tends to be at least a magnitude faster than comparable shuf calls)
We'll use a function to generate our list of non-consecutive line numbers:
gen_numbers () {
max=$1 # $zeilen eg, 100
count=$2 # $durchlauf eg, 5
interval=$(( max / count )) # eg, 100 / 5 = 20
for (( start=1; start<max; start=start+interval ))
do
end=$(( start + interval - 2 ))
out=$(( ( RANDOM % interval ) + start ))
[[ $out -gt $end ]] && out=${end}
echo ${out}
done
}
Sample run:
$ zeilen=100
$ durchlauf=5
$ gen_numbers ${zeilen} ${durchlauf}
17
31
54
64
86
Demonstration of the paste/gen_numbers/base64/tr/gawk idea:
$ zeilen=300
$ durchlauf=3
$ paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )
This generates:
74 7VFhnDN4J...snip...rwnofLv
142 ZYv07oKMB...snip...xhVynvw
261 gifbwFCXY...snip...hWYio3e
Main code:
tmpfile=$(mktemp)
while/for loop ... # whatever OP is using to loop over list of input files
do
zeilen=$(wc -l < "testfile${filecount}".txt)
durchlauf=$(( $zeilen/20 ))
awk '
# process 1st file (ie, paste/gen_numbers/base64/tr/gawk)
FNR==NR { ins[$1]=$2 # store base64 in ins[] array
del[$1]=del[($1)+1] # make note of zeilen and zeilen+1 line numbers for deletion
next
}
# process 2nd file
FNR in ins { print ins[FNR] } # insert base64 string?
! (FNR in del) # if current line number not in del[] array then print the line
' <( paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )) "testfile${filecount}".txt > "${tmpfile}"
# the last line with line continuations for readability:
#' <( paste \
# <( gen_numbers ${zeilen} ${durchlauf} ) \
# <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) \
# ) \
#"testfile${filecount}".txt > "${tmpfile}"
mv "${tmpfile}" "testfile${filecount}".txt
done
Simple example of awk code in action:
$ cat orig.txt
line1
line2
line3
line4
line5
line6
line7
line8
line9
$ cat paste.out # simulated output from paste/gen_numbers/base64/tr/gawk
1 newline1
5 newline5
$ awk '...' paste.out orig.txt
newline1
line3
line4
newline5
line7
line8
line9

I don't know what the rest of your script is doing but below will give you the idea how to vastly improve it's performance.
Instead of this which calls base64, tr, head, and awk on each iteration of the loop with all of the overhead that implies:
for (( c=1; c<=3; c++ ))
do
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
echo "$string" | awk '{print "<" $0 ">"}'
done
<nSxzxmRQc11+fFnG7ET4EBIBUwoflPo9Mop0j50C1MtRoLNjb43aNTMNRSMePTnGub5gqDWeV4yEyCVYC2s519JL5OLpBFxSS/xOjbL4pkmoFqOceX3DTmsZrl/RG+YLXxiLBjL//I220MQAzpQE5bpfQiQB6BvRw64HbhtVzHYMODbQU1UYLeM6IMXdzPgsQyghv1MCFvs0Nl4Mez2Zh98f9+472c6K+44nmi>
<9xfgBc1Y7P/QJkB6PCIfNg0b7V+KmSUS49uU7XdT+yiBqjTLcNaETpMhpMSt3MLs9GFDCQs9TWKx7yXgbNch1p849IQrjhtZCa0H5rtCXJbbngc3oF9LYY8WT72RPiV/gk4wJrAKYq8/lKYzu0Hms0lHaOmd4qcz1hpzubP7NuiBjvv16A8T3slVG1p4vwxa5JyfgYIYo4rno219ba/vRMB1QF9HaAppdRMP32>
<K5kNgv9EN1a/c/7eatrivNeUzKYolCrz5tHE2yZ6XNm1aT4ZZq3OaY5UgnwF8ePIpMKVw5LZNstVwFdVaNvtL6JreCkcO+QtebsCYg5sAwIdozwXFs4F4hZ/ygoz3DEeMWYgFTcgFnfoCV2Rct2bg/mAcJBZ9+4x9IS+JNTA64T1Zl+FJiCuHS05sFIsZYBCqRADp2iL3xcTr913dNplqUvBEEsW1qCk/TDwQh>
you should write this which only calls each tool once and so will run orders of magnitude faster:
$ base64 /dev/urandom | tr -dc '[[:print:]]' |
gawk -v RS='.{230}' '{print "<" RT ">"} NR==3{exit}'
<X0If1qkQItVLDOmh2BFYyswBgKFZvEwyA+WglyU0BhqWHLzURt/AIRgL3olCWZebktfwBU6sK7N3nwK6QV2g5VheXIY7qPzkzKUYJXWvgGcrIoyd9tLUjkM3eusuTTp4TwNY6E/z7lT0/2oQrLH/yZr2hgAm8IXDVgWNkICw81BRPUqITNt3VqmYt/HKnL4d/i88F4QDE0XgivHzWAk6OLowtmWAiT8k1a0Me6>
<TqCyRXj31xsFcZS87vbA50rYKq4cvIIn1oCtN6PJcIsSUSjG8hIhfP8zwhzi6iC33HfL96JfLIBcLrojOIkd7WGGXcHsn0F0XVauOR+t8SRqv+/t9ggDuVsn6MsY2R4J+mppTMB3fcC5787u0dO5vO1UTFWZG0ZCzxvX/3oxbExXb8M54WL6PZQsNrVnKtkvllAT/s4mKsQ/ojXNB0CTw7L6AvB9HU7W2x+U3j>
<ESsGZlHjX/nslhJD5kJGsFvdMp+PC5KA+xOYlcTbc/t9aXoHhAJuy/KdjoGq6VkP+v4eQ5lNURdyxs+jMHqLVVtGwFYSlc61MgCt0IefpgpU2e2werIQAsrDKKT1DWTfbH1qaesTy2IhTKcEFlW/mc+1en8912Dig7Nn2MD8VQrGn6BzvgjzeGRqGLAtWJWkzQjfx+74ffJQUXW4uuEXA8lBvbuJ8+yQA2WHK5>

#mark-fuso, Wow, thats incredibly fast! But there is a mistake in the script. The file grows in size a little bit, which is something I have to avoid. I think if two random line numbers ($durchlauf) are following each other, then one line is not deleted. Honestly, I dont completely understand what your command is doing, but it works very well. I think for such a task, I have to have more bash experience.
Sample output:
64
65
66
gOf0Vvb9OyXY1Tjb1r4jkDWC4VIBpQAYnSY7KkT1gl5MfnkCMzUmN798pkgEVAlRgV9GXpknme46yZURCaAjeg6G5f1Fc7nc7AquIGnEER>
AFwB9cnHWu6SRnsupYCPViTC9XK+fwGkiHvEXrtw2aosTGAAFyu0GI8Ri2+NoJAvMw4mv/FE72t/xapmG5wjKpQYsBXYyZ9YVV0SE6c6rL>
70
71

Related

Foreach command with table [duplicate]

I have a huge tab-separated file formatted like this
X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
I thought of a solution like this
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done
But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' file
output
$ more file
0 1 2
3 4 5
6 7 8
9 10 11
$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11
Performance against Perl solution by Jonathan on a 10000 lines file
$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2
$ wc -l < file
10000
$ time perl test.pl file >/dev/null
real 0m0.480s
user 0m0.442s
sys 0m0.026s
$ time awk -f test.awk file >/dev/null
real 0m0.382s
user 0m0.367s
sys 0m0.011s
$ time perl test.pl file >/dev/null
real 0m0.481s
user 0m0.431s
sys 0m0.022s
$ time awk -f test.awk file >/dev/null
real 0m0.390s
user 0m0.370s
sys 0m0.010s
EDIT by Ed Morton (#ghostdog74 feel free to delete if you disapprove).
Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (rowNr=1;rowNr<=NF;rowNr++) {
cell[rowNr,NR] = $rowNr
}
maxRows = (NF > maxRows ? NF : maxRows)
maxCols = NR
}
END {
for (rowNr=1;rowNr<=maxRows;rowNr++) {
for (colNr=1;colNr<=maxCols;colNr++) {
printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
The above solutions will work in any awk (except old, broken awk of course - there YMMV).
The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
print ""
if (ARGIND < NF) {
ARGV[ARGC] = FILENAME
ARGC++
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.
awk
Gawk version which uses arrays of arrays:
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}
Plain awk version which uses multidimensional arrays (this was about twice as slow in my benchmark):
tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}
macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.
To use space as a separator without collapsing sequences of multiple spaces, use FS='[ ]'.
rs
rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.
Use sequences of spaces and tabs as column separator:
rs -T
Use tab as column separator:
rs -c -C -T
Use comma as column separator:
rs -c, -C, -T
-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.
Do not use -t instead of -T, because it automatically selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).
When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:
$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4
rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:
$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,
R
The t function transposes a matrix or dataframe:
Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'
If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.
quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.
For a big input file, fread and fwrite from data.table are faster than read.table and write.table:
$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m1.061s
user 0m0.983s
sys 0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m0.599s
user 0m0.535s
sys 0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>t/b
x being coerced from class: matrix to data.table
real 0m0.375s
user 0m0.296s
sys 0m0.073s
jq
tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}
jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.
Ruby
ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'
The -1 argument to split disables discarding empty fields at the end:
$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]
Function form:
$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4
The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:
$ ruby -e'p" a \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]
A Python solution:
python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output
The above is based on the following:
import sys
for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
print(' '.join(c))
This code does assume that every line has the same number of columns (no padding is performed).
Have a look at GNU datamash which can be used like datamash transpose.
A future version will also support cross tabulation (pivot tables)
Here is how you would do it with space separated columns:
datamash transpose -t ' ' < file > transposed_file
the transpose project on sourceforge is a coreutil-like C program for exactly that.
gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.
Pure BASH, no additional process. A nice exercise:
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line ; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s\t" ${array[$COUNTER]}
done
printf "\n"
done
GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!
datamash -W transpose infile > outfile
There is a purpose built utility for this,
GNU datamash utility
apt install datamash
datamash transpose < yourfile
Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods
Here is a moderately solid Perl script to do the job. There are many structural analogies with #ghostdog74's awk solution.
#!/bin/perl -w
#
# SO 1729824
use strict;
my(%data); # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
my(#row) = split /\s+/;
my($colnum) = 0;
foreach my $val (#row)
{
$data{$rownum}{$colnum++} = $val;
}
$rownum++;
$maxcol = $colnum if $colnum > $maxcol;
}
my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
for (my $row = 0; $row < $maxrow; $row++)
{
printf "%s%s", ($row == 0) ? "" : "\t",
defined $data{$row}{$col} ? $data{$row}{$col} : "";
}
print "\n";
}
With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.
Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:
Osiris JL: time gawk -f tr.awk xxx > /dev/null
real 0m0.367s
user 0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null
real 0m0.138s
user 0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx > /dev/null
real 0m1.891s
user 0m0.924s
sys 0m0.961s
Osiris-2 JL:
Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.
Assuming all your rows have the same number of fields, this awk program solves the problem:
{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}
In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.
Example:
$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6
$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
1 4
2 5
3 6
If you have sc installed, you can do:
psc -r < inputfile | sc -W% - > outputfile
I normally use this little awk snippet for this requirement:
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j==NR?RS:FS)
}
}' file
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.
This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.
A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste
#!/usr/bin/perl
use warnings;
use strict;
my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
chomp $line;
my #array = split ("\t",$line);
open OUTPUT, ">temp$." or die ("unable to open output file!");
print OUTPUT join ("\n",#array);
close OUTPUT;
$counter=$.;
}
close INPUT;
# paste files together
my $execute = "paste ";
foreach (1..$counter) {
$execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:
/bin/rm output 2> /dev/null
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do
awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.
Call the input file foo.
If we know foo has four columns:
for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
If we don't know how many columns foo has:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:
{ timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
Maximum length of command we could actually use: 2088944
tr & echo:
for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
...or if the # of columns are unknown:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n); do
cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
done
Using set, which like xargs, has similar command line size based limitations:
for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $# ; done
I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s" ${array[$COUNTER]}
if [ $COUNTER -lt $(( ${#array[#]} - $COLS )) ]
then
printf "\t"
fi
done
printf "\n"
done
I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row
SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
ncols[$indexCol]=${#line[#]};
((indexCol++))
if [ ${#line[#]} -gt ${MAXROWS} ]
then
MAXROWS=${#line[#]}
fi
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
COUNTER=$ROW;
for (( indexCol=0; indexCol < ${#ncols[#]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
then
printf $PADDING
else
printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[#]} ]
then
printf $SEPARATOR
fi
COUNTER=$(( COUNTER + ncols[indexCol] ))
done
printf "\n"
done
I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:
Row2Trans=number1
Col2Trans=number2
for ((i=1; $i <= Line2Trans; i++));do
for ((j=1; $j <=Col2Trans ; j++));do
awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
done
done
paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO
If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:
head -$N file | tail -1 | tr ',' '\n'
Not very elegant, but this "single-line" command solves the problem quickly:
cols=4; for((i=1;i<=$cols;i++)); do \
awk '{print $'$i'}' input | tr '\n' ' '; echo; \
done
Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.
Another awk solution and limited input with the size of memory you have.
awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
END{ for (i in RtoC) print RtoC[i] }' infile
This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc.
Will output:
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
#!/bin/bash
aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#
#set -x
while read line; do
set -- $line
for i in $(seq $colNum); do
eval col$i="\"\$col$i \$$i\""
done
done < file.txt
for i in $(seq $colNum); do
eval echo \${col$i}
done
another version with set eval
Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:
echo '' > tmp1; \
cat m.txt | while read l ; \
do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
cp tmp2 tmp1; \
done; \
cat tmp1
m.txt:
0 1 2
4 5 6
7 8 9
10 11 12
creates tmp1 file so it's not empty.
reads each line and transforms it into a column using tr
pastes the new column to the tmp1 file
copies result back into tmp1.
PS: I really wanted to use io-descriptors but couldn't get them to work.
Another bash variant
$ cat file
XXXX col1 col2 col3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
Script
#!/bin/bash
I=0
while read line; do
i=0
for item in $line; { printf -v A$I[$i] $item; ((i++)); }
((I++))
done < file
indexes=$(seq 0 $i)
for i in $indexes; {
J=0
while ((J<I)); do
arr="A$J[$i]"
printf "${!arr}\t"
((J++))
done
echo
}
Output
$ ./test
XXXX row1 row2 row3 row4
col1 0 3 6 9
col2 1 4 7 10
col3 2 5 8 11
I'm a little late to the game but how about this:
cat table.tsv | python -c "import pandas as pd, sys; pd.read_csv(sys.stdin, sep='\t').T.to_csv(sys.stdout, sep='\t')"
or zcat if it's gzipped.
This is assuming you have pandas installed in your version of python
Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.
transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])
main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
An awk solution that store the whole array in memory
awk '$0!~/^$/{ i++;
split($0,arr,FS);
for (j in arr) {
out[i,j]=arr[j];
if (maxr<j){ maxr=j} # max number of output rows.
}
}
END {
maxc=i # max number of output columns.
for (j=1; j<=maxr; j++) {
for (i=1; i<=maxc; i++) {
printf( "%s:", out[i,j])
}
printf( "%s\n","" )
}
}' infile
But we may "walk" the file as many times as output rows are needed:
#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
echo
done
Which (for a low count of output rows is faster than the previous code).
A oneliner using R...
cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "
I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.
awk '
{
for (i = 1; i <= NF; i++) {
s[i] = s[i]?s[i] FS $i:$i
}
}
END {
for (i in s) {
print s[i]
}
}' file.txt
declare -a arr
while IFS= read -r line
do
i=0
for word in $line
do
[[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
((i++))
done
done < file.txt
for ((i=0; i < ${#arr[#]}; i++))
do
echo ${arr[i]}
done
Simple 4 line answer, keep it readable.
col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done

Random line using sed

I want to select a random line with sed. I know shuf -n and sort -R | head -n does the job, but for shuf you have to install coreutils, and for the sort solution, it isn't optimal on large data :
Here is what I tested :
echo "$var" | shuf -n1
Which gives the optimal solution but I'm afraid for portability
that's why I want to try it with sed.
`var="Hi
i am a student
learning scripts"`
output:
i am a student
output:
hi
It must be Random.
It depends greatly on what you want your pseudo-random probability distribution to look like. (Don't try for random, be content with pseudo-random. If you do manage to generate a truly random value, go collect your nobel prize.) If you just want a uniform distribution (eg, each line has equal probability of being selected), then you'll need to know a priori how many lines of are in the file. Getting that distribution is not quite so easy as allowing the earlier lines in the file to be slightly more likely to be selected, and since that's easy, we'll do that. Assuming that the number of lines is less than 32769, you can simply do:
N=$(wc -l < input-file)
sed -n -e $((RANDOM % N + 1))p input-file
-- edit --
After thinking about it for a bit, I realize you don't need to know the number of lines, so you don't need to read the data twice. I haven't done a rigorous analysis, but I believe that the following gives a uniform distribution:
awk 'BEGIN{srand()} rand() < 1/NR { out=$0 } END { print out }' input-file
-- edit --
Ed Morton suggests in the comments that we should be able to invoke rand() only once. That seems like it ought to work, but doesn't seem to. Curious:
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed); r=rand()} r < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 205
2 64
3 37
4 21
5 9
6 9
7 9
8 46
real 0m1.862s
user 0m0.689s
sys 0m0.907s
$ time for i in $(seq 400); do awk -v seed=$(( $(date +%s) + i)) 'BEGIN{srand(seed)} rand() < 1/NR { out=$0 } END { print out}' input; done | awk '{a[$0]++} END { for (i in a) print i, a[i]}' | sort
1 55
2 60
3 37
4 50
5 57
6 45
7 50
8 46
real 0m1.924s
user 0m0.710s
sys 0m0.932s
var="Hi
i am a student
learning scripts"
mapfile -t array <<< "$var" # create array from $var
echo "${array[$RANDOM % (${#array}+1)]}"
echo "${array[$RANDOM % (${#array}+1)]}"
Output (e.g.):
learning scripts
i am a student
See: help mapfile
This seems to be the best solution for large input files:
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file
as it uses standard UNIX tools, it's not restricted to files that are 32,769 lines long or less, it doesn't have any bias towards either end of the input, it'll produce different output even if called twice in 1 second, and it exits immediately after the target line is printed rather than continuing to the end of the input.
Update:
Having said the above, I have no explanation for why a script that calls rand() once per line and reads every line of input is about twice as fast as a script that calls rand() once and exits at the first matching line:
$ seq 100000 > file
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" -v max="$(wc -l < file)" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file;
done > o3
real 1m0.712s
user 0m8.062s
sys 0m9.340s
$ time for i in $(seq 500); do
awk -v seed="$RANDOM" 'BEGIN{srand(seed)} rand() < 1/NR{ out=$0 } END { print out}' file;
done > o4
real 0m29.950s
user 0m9.918s
sys 0m2.501s
They both produced very similar types of output:
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o3 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
498 500 1 2
$ awk '{a[$0]++} END { for (i in a) print i, a[i]}' o4 | awk '{sum+=$2; max=(NR>1&&max>$2?max:$2); min=(NR>1&&min<$2?min:$2)} END{print NR, sum, min, max}'
490 500 1 3
Final Update:
Turns out it was calling wc that (unexpectedly to me at least!) was taking most of the time. Here's the improvement when we take it out of the loop:
$ time { max=$(wc -l < file); for i in $(seq 500); do awk -v seed="$RANDOM" -v max="$max" 'BEGIN{srand(seed); n=int(rand()*max)+1} NR==n{print; exit}' file; done } > o3
real 0m24.556s
user 0m5.044s
sys 0m1.565s
so the solution where we call wc up front and rand() once is faster than calling rand() for every line as expected.
on bash shell, first initialize seed to # line cube or your choice
$ i=;while read a; do let i++;done<<<$var; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;echo -e $var |sed -En "$l p"
if move your data to varfile
$ echo -e $var >varfile
$ i=;while read a; do let i++;done<varfile; let RANDOM=i*i*i
$ let l=$RANDOM%$i+1 ;sed -En "$l p" varfile
put the last inside loop e.g. for((c=0;c<9;c++)) { ;}
Using GNU sed and bash; no wc or awk:
f=input-file
sed -n $((RANDOM%($(sed = $f | sed '2~2d' | sed -n '$p')) + 1))p $f
Note: The three seds in $(...) are an inefficient way to fake wc -l < $f. Maybe there's a better way -- using only sed of course.
Using shuf:
$ echo "$var" | shuf -n 1
Output:
Hi

Replace hex char by a different random one (awk possible?)

have a mac address and I need to replace only one hex char (one at a very specific position) by a different random one (it must be different than the original). I have it done in this way using xxd and it works:
#!/bin/bash
mac="00:00:00:00:00:00" #This is a PoC mac address obviously :)
different_mac_digit=$(xxd -p -u -l 100 < /dev/urandom | sed "s/${mac:10:1}//g" | head -c 1)
changed_mac=${mac::10}${different_mac_digit}${mac:11:6}
echo "${changed_mac}" #This echo stuff like 00:00:00:0F:00:00
The problem for my script is that using xxd means another dependency... I want to avoid it (not all Linux have it included by default). I have another workaround for this using hexdump command but using it I'm at the same stage... But my script already has a mandatory awk dependency, so... Can this be done using awk? I need an awk master here :) Thanks.
Something like this may work with seed value from $RANDOM:
mac="00:00:00:00:00:00"
awk -v seed=$RANDOM 'BEGIN{ FS=OFS=":"; srand(seed) } {
s="0"
while ((s = sprintf("%x", rand() * 16)) == substr($4, 2, 1))
$4 = substr($4, 1, 1) s
} 1' <<< "$mac"
00:00:00:03:00:00
Inside while loop we continue until hex digit is not equal to substr($4, 2, 1) which 2nd char of 4th column.
You don't need xxd or hexdump. urandom will also generate nubmers that match the encodings of the digits and letters used to represent hexadecimal numbers, therefore you can just use
old="${mac:10:1}"
different_mac_digit=$(tr -dc 0-9A-F < /dev/urandom | tr -d "$old" | head -c1)
Of course, you can replace your whole script with an awk script too. The following GNU awk script will replace the 11th symbol of each line with a random hexadecimal symbol different from the old one. With <<< macaddress we can feed macaddress to its stdin without having to use echo or something like that.
awk 'BEGIN { srand(); pos=11 } {
old=strtonum("0x" substr($0,pos,1))
new=(old + 1 + int(rand()*15)) % 16
print substr($0,1,pos-1) sprintf("%X",new) substr($0,pos+1)
}' <<< 00:00:00:00:00:00
The trick here is to add a random number between 1 and 15 (both inclusive) to the digit to be modified. If we end up with a number greater than 15 we wrap around using the modulo operator % (16 becomes 0, 17 becomes 1, and so on). That way the resulting digit is guaranteed to be different from the old one.
However, the same approach would be shorter if written completely in bash.
mac="00:00:00:00:00:00"
old="${mac:10:1}"
(( new=(16#"$old" + 1 + RANDOM % 15) % 16 ))
printf %s%X%s\\n "${mac::10}" "$new" "${mac:11}"
"One-liner" version:
mac=00:00:00:00:00:00
printf %s%X%s\\n "${mac::10}" "$(((16#${mac:10:1}+1+RANDOM%15)%16))" "${mac:11}"
bash has printf builtin and a random function (if you trust it):
different_mac_digit() {
new=$1
while [[ $new = $1 ]]; do
new=$( printf "%X" $(( RANDOM%16 )) )
done
echo $new
}
Invoke with the character to be replaced as argument.
Another awk:
$ awk -v n=11 -v s=$RANDOM ' # set n to char # you want to replace
BEGIN { FS=OFS="" }{ # each char is a field
srand(s)
while((r=sprintf("%x",rand()*16))==$n);
$n=r
}1' <<< $mac
Output:
00:00:00:07:00:00
or oneliner:
$ awk -v n=11 -v s=$RANDOM 'BEGIN{FS=OFS=""}{srand(s);while((r=sprintf("%x",rand()*16))==$n);$n=r}1' <<< $mac
$ mac="00:00:00:00:00:00"
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:01:00:00
$ awk -v m="$mac" -v p=11 'BEGIN{srand(); printf "%s%X%s\n", substr(m,1,p-1), int(rand()*15-1), substr(m,p+1)}'
00:00:00:0D:00:00
And to ensure you get a different digit than you started with:
$ awk -v mac="$mac" -v pos=11 'BEGIN {
srand()
new = old = toupper(substr(mac,pos,1))
while (new==old) {
new = sprintf("%X", int(rand()*15-1))
}
print substr(mac,1,pos-1) new substr(mac,pos+1)
}'
00:00:00:0D:00:00

Replace each } with a }\n in a huge (12GB) which consists of 1 line?

I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line.
I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.
I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at } by replacing with }\n:
sed -i 's/}/}\n/g' NonPROD.log.backup
But sed is killed - I assume also because of memory. How can I resolve this? Can I let sed manipulate the file using other chunks of data than lines? I know by default sed reads line by line.
The following uses only functionality built into the shell:
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
If you have logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.
With GNU awk for RT:
$ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1'
abc}
def}
ghi
with other awks:
$ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi
I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:
awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m
and here's what I got:
1) awk (both awk scripts above had similar results):
time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m
Got expected output, timing =
real 0m0.608s
user 0m0.561s
sys 0m0.045s
2) shell loop:
$ cat tst.sh
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
$ time ./tst.sh < file1m
Got expected output, timing =
real 1m52.152s
user 1m18.233s
sys 0m32.604s
3) tr+sed:
$ time tr '}' '\n' < file1m | sed 's/$/}/'
Did not produce the expected output (Added an undesirable } at the end of the file), timing =
real 0m0.577s
user 0m0.468s
sys 0m0.078s
With a tweak to remove that final undesirable }:
$ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//'
real 0m0.718s
user 0m0.670s
sys 0m0.108s
4) fold+sed+tr:
$ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n'
Got expected output, timing =
real 0m0.811s
user 0m1.137s
sys 0m0.076s
5) split+sed+cat:
$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}\n/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$
$ time ./tst2.sh file1m
Got expected output, timing =
real 0m0.983s
user 0m0.685s
sys 0m0.167s
You can running it through tr, then put the end bracket back on at the end of each line:
$ cat NonPROD.log.backup | tr '}' '\n' | sed 's/$/}/' > tmp$$
$ wc -l NonPROD.log.backup tmp$$
0 NonPROD.log.backup
43 tmp10528
43 total
(My test file only had 43 brackets.)
You could:
Split the file to say 1M chunks using split -b 1m file.log
Process all the files sed 's/}/}\n/g' x*
... and redirect the output of sed to combine them back to a single piece
The drawback of this is the doubled storage space.
another alternative with fold
$ fold -w 1000 long_line_file | sed 's/}/}\n\n/g' | tr -s '\n'

Sort a text file by line length including spaces

I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $# | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'
Answer
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key option.)
Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
Test input including some lines of equal length:
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g
The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.
Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file
Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-
Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done
Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s
The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]*://'
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
awk '{ print length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]* //'
I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example
1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1
using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
https://raku.org
Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).
Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1

Resources