Sort subgroups of lines with command-line tools - bash

I've been trying to find a way of sorting this with standard commandline tools, bash, awk, sort, whatever but can't find a way apart from using perl or similar.
Any hint?
Input data
header1
3
2
5
1
header2
5
1
3
.....
.....
Output data
header1
1
2
3
5
header2
1
....
Thanks

Assumes sections are separated by blank lines and the header doesn't necessarily contain the string "header". Leaves the sections in the original order so the sort is stable. Reads from stdin, displays on stdout.
#!/bin/bash
function read_section() {
while read LINE && [ "$LINE" ]; do echo "$LINE"; done
}
function sort_section() {
read HEADER && (echo "$HEADER"; sort; echo)
}
while read_section | sort_section; do :; done
Or as a one-liner:
cat test.txt | while (while read LINE && [ "$LINE" ]; do echo "$LINE"; done) | (read HEADER && (echo "$HEADER"; sort; echo)); do :; done

Try this:
mark#ubuntu:~$ cat /tmp/test.txt
header1
3
2
5
1
header2
5
1
3
mark#ubuntu:~$ cat /tmp/test.txt | awk '/header/ {colname=$1; next} {print colname, "," , $0}' | sort | awk '{if ($1 != header) {header = $1; print header} print $3}'
header1
1
2
3
5
header2
1
3
5
To get rid of the blank lines, I guess you can add a "| grep -v '^$'" at the end...

Use AWK to prefix the header to each number line.
sort the resulting file.
remove the prefix to return the file to original format.

with GNU awk, you can use its internal sort functions.
awk 'BEGIN{ RS=""}
{
print $1
for(i=2;i<=NF;i++){
a[i]=$i
}
b=asort(a,d)
for(i=1;i<=b;i++){
print d[i]
}
delete d
delete a
} ' file
output
# more file
header1
3
2
5
1
header2
5
1
3
# ./test.sh
header1
1
2
3
5
header2
1
3
5

Related

column of data to be separated into a rows based on a common value and its subsequent data

i have a column of data in a test1.txt file that looks like this:
sys#hostname:/tmp/ cat -n test1.txt
1 row1.txt
2 1234
3 2331
4 2238
5 row2.txt
6 2773
7 6673
I would like to have this data converted into two rows based on a common value , maybe *.txt. Maybe leverage grep , awk or sed for this if possible.
The data of test1.text will change adding more data so grep -A3/B3 or grep -A2/B2 will not work as the values will change and the command will be automated.
The end result will should display :
sys#hostname:/tmp/ cat -n text2.txt
1 row1.txt,1234,2331,2238
2 row2.txt,2773,6673
i tried a number of variations leveraging a for loop but could not manage to get it to work.
If you want an awk solution, here it is:
#!/bin/bash
cat >test1.txt <<"EnDoFiNpUt"
row1.txt
1234
2331
2238
row2.txt
2773
6673
EnDoFiNpUt
awk 'BEGIN{ first=1 ; }{
if( $0 != "" ){
if( $1 ~ /^row/ ){
if( first == 1 ){
printf("%s", $0 ) ;
}else{
printf("\n%s", $0 ) ;
} ;
first=0 ;
}else{
printf(",%s", $0 );
} ;
} ;
}END{
print "" ;
}' test1.txt
For sed-based solution, you have to get rid of the stray "," from the end of the last line:
cat test1.txt | tr '\n' ',' | sed 's+,$++' | sed 's+\,row+\nrow+g'

Print first few and last few lines of file through a pipe with "..." in the middle

Problem Description
This is my file
1
2
3
4
5
6
7
8
9
10
I would like to send the cat output of this file through a pipe and receive this
% cat file | some_command
1
2
...
9
10
Attempted solutions
Here are some solutions I've tried, with their output
% cat temp | (head -n2 && echo '...' && tail -n2)
1
2
...
% cat temp | tee >(head -n3) >(tail -n3) >/dev/null
1
2
3
8
9
10
# I don't know how to get the ...
% cat temp | sed -e 1b -e '$!d'
1
10
% cat temp | awk 'NR==1;END{print}'
1
10
# Can only get 2 lines
An awk:
awk -v head=2 -v tail=2 'FNR==NR && FNR<=head
FNR==NR && cnt++==head {print "..."}
NR>FNR && FNR>(cnt-tail)' file file
Or if a single pass is important (and memory allows), you can use perl:
perl -0777 -lanE 'BEGIN{$head=2; $tail=2;}
END{say join("\n", #F[0..$head-1],("..."),#F[-$tail..-1]);}' file
Or, an awk that is one pass:
awk -v head=2 -v tail=2 'FNR<=head
{lines[FNR]=$0}
END{
print "..."
for (i=FNR-tail+1; i<=FNR; i++) print lines[i]
}' file
Or, nothing wrong with being a caveman direct like:
head -2 file; echo "..."; tail -2 file
Any of these prints:
1
2
...
9
10
It terms of efficiency, here are some stats.
For small files (ie, less than 10 MB or so) all these are less than 1 second and the 'caveman' approach is 2 ms.
I then created a 1.1 GB file with seq 99999999 >file
The two pass awk: 50 secs
One pass perl: 10 seconds
One pass awk: 29 seconds
'Caveman': 2 MS
You may consider this awk solution:
awk -v top=2 -v bot=2 'FNR == NR {++n; next} FNR <= top || FNR > n-top; FNR == top+1 {print "..."}' file{,}
1
2
...
9
10
Two single pass sed solutions:
sed '1,2b
3c\
...
N
$!D'
and
sed '1,2b
3c\
...
$!{h;d;}
H;g'
Assumptions:
as OP has stated, a solution must be able to work with a stream from a pipe
the total number of lines coming from the stream is unknown
if the total number of lines is less than the sum of the head/tail offsets then we'll print duplicate lines (we can add more logic if OP updates the question with more details on how to address this situation)
A single-pass awk solution that implements a queue in awk to keep track of the most recent N lines; the queue allows us to limit awk's memory usage to just N lines (as opposed to loading the entire input stream into memory, which could be problematic when processing a large volume of lines/data on a machine with limited available memory):
h=2 t=3
cat temp | awk -v head=${h} -v tail=${t} '
{ if (NR <= head) print $0
lines[NR % tail] = $0
}
END { print "..."
if (NR < tail) i=0
else i=NR
do { i=(i+1)%tail
print lines[i]
} while (i != (NR % tail) )
}'
This generates:
1
2
...
8
9
10
Demonstrating the overlap issue:
$ cat temp4
1
2
3
4
With h=3;t=3 the proposed awk code generates:
$ cat temp4 | awk -v head=${h} -v tail=${t} '...'
1
2
3
...
2
3
4
Whether or not this is the 'correct' output will depend on OP's requirements.
I suggest with bash:
(head -n 2; echo "..."; tail -n 2) < file
Output:
1
2
...
9
10

Line differences with element location in shell script

Input:
file1.txt
abc 1 2 3 4
file2.txt
abc 1 2 5 6
Expected output:
differences is
3
5
at location 3
I am able to track the differences using:
comm -3 file1.txt file2.txt | uniq -c | awk '{print $4}' | uniq
But not able to track the element location.
Could you guys please suggest the shell script to track the element location?
With perl, and Path::Class from CPAN for convenience
perl -MPath::Class -MList::Util=first -e '
#f1 = split " ", file(shift)->slurp;
#f2 = split " ", file(shift)->slurp;
$idx = first {$f1[$_] ne $f2[$_]} 0..$#f1;
printf "difference is\n%s\n%s\nat index %d\n", $f1[$idx], $f2[$idx], $idx;
' file{1,2}.txt
difference is
3
5
at index 3

Adding column values from multiple different files

I have ~100 files and I would like to do an arithmetical operation (e.g. sum them up) on the second column of the files, such that I add the value of first row of one file to the first row value of second file and so on for all rows of column 2 in each file.
In my actual files I have ~30 000 rows so any kind of manual manipulation with the rows is not possible.
fileA
1 1
2 100
3 1000
4 15000
fileB
1 7
2 500
3 6000
4 20000
fileC
1 4
2 300
3 8000
4 70000
output:
1 12
2 900
3 15000
4 105000
I used this and ran it as: script.sh listofnames.txt (All the files have the same name but they are in different directories so I was referring to them with $line to the file with the list of directories names). This gives me a syntax error and I am looking for a way to define the "sum" otherwise.
while IFS='' read -r line || [[ -n "$line" ]]; do
awk '{"'$sum'"+=$3; print $1,$2,"'$sum'"}' ../$line/file.txt >> output.txt
echo $sum
done < "$1"
$ paste fileA fileB fileC | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
1 12
2 900
3 15000
4 105000
or if you wanted to do it all in awk:
$ awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' fileA fileB fileC
1 12
2 900
3 15000
4 105000
If you have a list of directories in a file named "foo" and every file you're interested in in every directory is named "bar" then you can do:
IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
cmd "${files[#]}"
where cmd is awk or paste or anything else you want to run on those files. Look:
$ cat foo
abc
def
ghi klm
$ IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
$ awk 'BEGIN{ for (i=1;i<ARGC;i++) print "<" ARGV[i] ">"; exit}' "${files[#]}"
<abc/bar>
<def/bar>
<ghi klm/bar>
So if your files are all named file.txt and your directory names are stored in listofnames.txt then your script would be:
IFS=$'\n' files=( $(awk '{print $0 "/file.txt"}' listofnames.txt) )
followed by whichever of these you prefer:
paste "${files[#]}" | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' "${files[#]}"

Using grep and awk together

I have a file (A.txt) with 4 columns of numbers and another file with 3 columns of numbers (B.txt). I need to solve the following problems:
Find all lines in A.txt whose 3rd column has a number that appears any where in the 3rd column of B.txt.
Assume that I have many files like A.txt in a directory and I need to run this for every file in that directory.
How do I do this?
You should never see someone using grep and awk together because whatever grep can do, you can also do in awk:
Grep and Awk
grep "foo" file.txt | awk '{print $1}'
Using Only Awk:
awk '/foo/ {print $1}' file.txt
I had to get that off my chest. Now to your problem...
Awk is a programming language that assumes a single loop through all the lines in a set of files. And, you don't want to do this. Instead, you want to treat B.txt as a special file and loop though your other files. That normally calls for something like Python or Perl. (Older versions of BASH didn't handle hashed key arrays, so these versions of BASH won't work.) However, slitvinov looks like he found an answer.
Here's a Perl solution anyway:
use strict;
use warnings;
use feature qw(say);
use autodie;
my $b_file = shift;
open my $b_fh, "<", $b_file;
#
# This tracks the values in "B"
#
my %valid_lines;
while ( my $line = <$b_file> ) {
chomp $line;
my #array = split /\s+/, $line;
$valid_lines{$array[2]} = 1; #Third column
}
close $b_file;
#
# This handles the rest of the files
#
while ( my $line = <> ) { # The rest of the files
chomp $line;
my #array = split /\s+/, $line;
next unless exists $valid_lines{$array[2]}; # Next unless field #3 was in b.txt too
say $line;
}
Here is an example. Create the following files and run
awk -f c.awk B.txt A*.txt
c.awk
FNR==NR {
s[$3]
next
}
$3 in s {
print FILENAME, $0
}
A1.txt
1 2 3
1 2 6
1 2 5
A2.txt
1 2 3
1 2 6
1 2 5
B.txt
1 2 3
1 2 5
2 1 8
The output should be:
A1.txt 1 2 3
A1.txt 1 2 5
A2.txt 1 2 3
A2.txt 1 2 5

Resources