Sorting groups of lines - bash

Say I have this list:
sharpest
tool
in
the
shed
im
not
the
How can I order alphabetically by the non-indented lines and preserve groups of lines? The above should become:
im
not
the
sharpest
tool
in
the
shed
Similar questions exist here and here but I can't seem to make them work for my example.
Hopeful ideas so far
Maybe I could use grep -n somehow, as it gives me the line numbers? I was thinking to first get the line numbers, then order. I guess I'd somehow need to calculate a line range before ordering, and then from there fetch the range of lines somehow. Can't even think how to do this however!
sed ranges look promising too, but same deal; sed 1,2p and further examples here.

If perl is okay:
$ perl -0777 -ne 'print sort split /\n\K(?=\S)/' ip.txt
im
not
the
sharpest
tool
in
the
shed
-0777 slurp entire file, so solution not suitable if input is too big
split /\n\K(?=\S)/ gives array using newline character followed by non-whitespace character as split indication
sort to sort the array

You can use this asort function in a single gnu awk command:
awk '{if (/^[^[:blank:]]/) {k=$1; keys[++i]=k} else arr[k] = arr[k] $0 RS}
END{n=asort(keys); for (i=1; i<=n; i++) printf "%s\n%s", keys[i], arr[keys[i]]}' file
im
not
the
sharpest
tool
in
the
shed
Code Demo
Alternative solution using awk + sort:
awk 'FNR==NR{if (/^[^[:blank:]]/) k=$1; else arr[k] = arr[k] $0 RS; next}
{printf "%s\n%s", $1, arr[$1]}' file <(grep '^[^[:blank:]]' file | sort)
im
not
the
sharpest
tool
in
the
shed
Edit: POSIX compliancy:
#!/bin/sh
awk 'FNR==NR{if (/^[^[:blank:]]/) k=$1; else arr[k] = arr[k] $0 RS; next} {printf "%s\n%s", $1, arr[$1]}' file |
grep '^[![:blank:]]' file |
sort

With single GNU awk command:
awk 'BEGIN{ PROCINFO["sorted_in"] = "#ind_str_asc" }
/^[^[:space:]]+/{ k = $1; a[k]; next }
{ a[k] = (a[k]? a[k] ORS : "")$0 }
END{ for(i in a) print i ORS a[i] }' file
The output:
im
not
the
sharpest
tool
in
the
shed

awk one-liner
$ awk '/^\w/{k=$1; a[k]=k; next} {a[k]=a[k] RS $0} END{ n=asorti(a,b); for(i=1; i<=n; i++) print a[b[i]] }' file
im
not
the
sharpest
tool
in
the
shed

Related

combining numbers from multiple text files using bash

I'm strugling to combine some data from my txt files generated in my jenkins job.
on each of the files there is 1 line, this is how each file look:
testsuite name="mytest" cars="201" users="0" bus="0" bike="0" time="116.103016"
What I manage to do for now is to extract the numbers for each txt file:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.txt
Result are :
cars="193" users="2" bus="0" bike="0"
cars="23" users="2" bus="10" bike="7"
cars="124" users="2" bus="5" bike="0"
cars="124" users="2" bus="0" bike="123"
now I have a random number of files like this:
my-output1.txt
my-output2.txt
my-output7.txt
my-output*.txt
I would like to create single command just like the one I did above and to sum all of the files to have the following echo result:
cars=544 users=32 bus=12 bike=44
is there a way to do that? with a single line of command?
Using awk
$ cat script.awk
BEGIN {
FS="[= ]"
} {
gsub(/"/,"")
for (i=1;i<NF;i++)
if ($i=="cars") cars+=$(i+1)
else if($i=="users") users+=$(i+1);
else if($i=="bus") bus+=$(i+1);
else if ($i=="bike")bike+=$(i+1)
} END {
print "cars="cars,"users="users,"bus="bus,"bike="bike
}
To run the script, you can use;
$ awk -f script.awk my-output*.txt
Or, as a ugly one liner.
$ awk -F"[= ]" '{gsub(/"/,"");for (i=1;i<NF;i++) if ($i=="cars") cars+=$(i+1); else if($i=="users") users+=$(i+1); else if($i=="bus") bus+=$(i+1); else if ($i=="bike")bike+=$(i+1)}END{print"cars="cars,"users="users,"bus="bus,"bike="bike}' my-output*.txt
1st solution: With your shown samples please try following awk code, using match function in here. Since awk could read multiple files within a single program itself and your files have .txt format you can pass as .txt format to awk program itself.
Written and tested in GNU awk with its match function's capturing group capability to create/store values into an array to be used later on in program.
awk -v s1="\"" '
match($0,/[[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"/,tempArr){
temp=""
for(i=2;i<=8;i+=2){
temp=tempArr[i-1]
values[i]+=tempArr[i]
indexes[i-1]=temp
}
}
END{
for(i in values){
val=(val?val OFS:"") (indexes[i-1]"=" s1 values[i] s1)
}
print val
}
' *.txt
Explanation:
In start of GNU awk program creating variable named s1 to be set to " to be used later in the program.
Using match function in main program of awk.
Mentioning regex [[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"(explained at last of this post) which is creating 8 groups to be used later on.
Then once condition is matched running a for loop which runs only even numbers in it(to get required values only).
Creating array values with index of i and keep adding its own value + tempArr values to it, where tempArr is created by match function.
Similarly creating indexes array to store only key values in it.
Then in END block of this program traversing through values array and printing the values from indexes and values array as per requirement.
Explanation of regex:
[[:space:]]+ ##Matching spaces 1 or more occurrences here.
(cars)="([^"]*)" ##Matching cars=" till next occurrence of " here.
(users)="([^"]*)" ##Matching spaces followed by users=" till next occurrence of " here.
(bus)="([^"]*)" ##Matching spaces followed by bus=" till next occurrence of " here.
(bike)="([^"]*)" ##Matching spaces followed by bike=" till next occurrence of " here.
2nd solution: In GNU awk only with using RT and RS variables power here. This will make sure the sequence of the values also in output should be same in which order they have come in input.
awk -v s1="\"" -v RS='[[:space:]][^=]*="[^"]*"' '
RT{
gsub(/^ +|"/,"",RT)
num=split(RT,arr,"=")
if(arr[1]!="time" && arr[1]!="name"){
if(!(arr[1] in values)){
indexes[++count]=arr[1]
}
values[arr[1]]+=arr[2]
}
}
END{
for(i=1;i<=count;i++){
val=(val?val OFS:"") (indexes[i]"=" s1 values[indexes[i]] s1)
}
print val
}
' *.txt
You may use this awk solution:
awk '{
for (i=1; i<=NF; ++i)
if (split($i, a, /=/) == 2) {
gsub(/"/, "", a[2])
sums[a[1]] +=a[2]
}
}
END {
for (i in sums) print i "=" sums[i]
}' file*
bus=15
cars=464
users=8
bike=130
found a way to do so a bit long:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.xml | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | awk '{bus+=$1;users+=$2;cars+=$3;bike+=$4 }END{print "bus=" bus " users="users " cars=" cars " bike=" bike}'
M. Nejat Aydin answer was good fit:
awk -F '[ "=]+' '/testsuite name=/{ cars+=$5; users+=$7; buses+=$9; bikes+=$11 } END{ print "cars="cars, "users="users, "buses="buses, "bikes="bikes }' my-output*.xml

awk FS vs FPAT puzzle and counting words but not blank fields

Suppose I have the file:
$ cat file
This, that;
this-that or this.
(Punctuation at the line end is not always there...)
Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:
sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n" | tr '[:upper:]' '[:lower:]' | sort | uniq -c
1 or
2 that
3 this
With grep you can shorten that a bit to only match what you define as a word:
grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output
With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):
gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
3 this
1 or
2 that
Now trying to replicate in POSIX awk I tried:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
2
3 this
1 or
2 that
Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.
You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.
I can also fix it this way:
awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
Which seems a little less than straight forward.
Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?
This should work in POSIX/BSD or any version of awk:
awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file
1 or
3 this
2 that
By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.
With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:
#!awk
{
s = $0
while (match(s, /[[:alpha:]]+/)) {
word = substr(s, RSTART, RLENGTH)
count[tolower(word)]++
s = substr(s, RSTART+RLENGTH)
}
}
END {
for (word in count) print count[word], word
}
$ awk -f countwords.awk file
1 or
3 this
2 that
Works with the default BSD awk on my Mac.
With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.
awk -v RS='[[:alpha:]]+' '
RT{
val[tolower(RT)]++
}
END{
for(word in val){
print val[word], word
}
}
' Input_file
Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.
Using RS instead:
$ gawk -v RS="[^[:alpha:]]+" ' # [^a-zA-Z] or something for some awks
$0 { # remove possible leading null string
a[tolower($0)]++
}
END {
for(i in a)
print i,a[i]
}' file
Output:
this 3
or 1
that 2
Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]
With GNU awk using patsplit() and a second array for counting, you can try this:
awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that

How to sort ROW in a line in BASH

Most sorting available in bash or linux terminal commands are about sorting a field (column). I couldn't figure out how to sort a row of three number, e.g. "1, 3, 2". I want it from left to right are small to large, like "1,2,3" or vice versa.
So input would be like line="5, 3, 10". After being sorted, the output will be sorted_line="3,5,10".
Any tips? Thanks.
Note that asort works for gawk not general awk. So here is another solution for a file, a.txt
gawk -F, '{split($0, w); s=""; for(i=1; i<=asort(w); i++) s=s w[i] ","; print s }' a.txt | sed 's/,$//'
sample file, a.txt is
1,5,7,2
8,1,3,4
9,7,8,2
result,
1,2,5,7
1,3,4,8
2,7,8,9
This is one way :
echo "6 5,4,9 1,3 2,10,7 8" | awk '{ split($0,arr,"(,| )") ; asort(arr); exit; } END{ for ( i=1; i <= length(arr) ; i++ ) { print arr[i]} }'
I am using a regex as a delimiter so it can be comma or space separated.
Hope it helps!

sort string with delimiter as string in unix

I have some data in the following format::
Info-programNumber!/TvSource/11100001_233a_32c0/13130^Info-channelName!5 USA^Info-Duration!1575190^Info-programName!CSI: ab cd
Delimiter = Info-
I tried to sort the string based on the delimiter in ascending order. But none of my solutions are working.
Expected Result:
Info-channelName!5 USA^Info-Duration!1575190^Info-programName!CSI: ab cd^Info-programNumber!/TvSource/11100001_233a_32c0/13130
Is there any command that will allow me to do this or do i need to write an awk script to iterate over the string and sort it?
Temporarily split the info into multiple lines so you can sort:
tr ^ \\n | sort | tr \\n ^
Note: if you have multiple entries, you have to write a loop, which processes it per line.. with huge datasets this is probably not a good idea (too slow), in which case pick a programming language.. but you were asking about the shell...
Can be done in awk itself:
awk -F "^" '{OFS="^"; for (i=1; i<=NF; i++) a[i]=$i}
END {n=asort(a, b); for(i=1; i<=n; i++) printf("%s%s", b[i], FS); print ""}' file

Convert tallies to relative probabilities

Background
Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.
Problem
Given a CSV file with the following words and tallies:
aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1
Create a file with probabilities relative to the largest tally in the file:
aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1
Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10.
Question
What is the most efficient way to implement a shell script to create the file of relative probabilities?
Constraints
Neither the words nor the numbers are in any order.
No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
Standard Unix tools such as awk, sed, or sort are welcome.
All probabilities must be relative to the highest probability in the file.
The words are unique, the numbers are not.
The tallies are natural numbers.
Thank you!
awk 'BEGIN{max=0;OFS=FS=","} $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file
No need to read the file twice:
awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile
If you need the output sorted by word:
awk ... | sort
or
awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile
If you need the output sorted by probability:
awk ... | sort -t, -k2,2n -k1,1
This is not error-proof but something like this should work:
#!/bin/bash
INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1
OLDIFS=$IFS
IFS=,
maxval=0 # Assuming all $val are positive
while read name val
do
if (( val > maxval )); then maxval=$val; fi
done < $INPUT
# Make sure $OUTPUT doesn't exist
touch $OUTPUT
while read name val
do
tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
echo "$name,$tally" >> $OUTPUT
done < $INPUT
IFS=$OLDIFS
Borrowed from this question, and various googling.

Resources