Bash split and sort by two columns in same file - bash

For my file which looks like this:
AABBCC 10 5 CCAABB 100
BBCCAA 4 2 AABBCC 50
CCAABB 16 8 BBCCAA 20
... I am trying to sort columns 4 and 5, by matching column 4 to column 1.
Ideally this would return:
AABBCC 10 5 AABBCC 50
BBCCAA 4 2 BBCCAA 20
CCAABB 16 8 CCAABB 100
I have tried using sort, however as far as I'm aware it doesn't have a utility to sort within files.
Any help would be greatly appreciated!

awk solution:
awk 'NR==FNR{ a[$4]=$5; next }$1 in a{ print $1,$2,$3,$1,a[$1] }' file1 OFS="\t" file1
The output:
AABBCC 10 5 AABBCC 50
BBCCAA 4 2 BBCCAA 20
CCAABB 16 8 CCAABB 100
You may pipe to sort at the end: ... | sort

join -t $'\t' -o 1.1,1.2,1.3,2.1,2.2 <(cut -f1-3 file.tsv | sort -k 1,1) <(cut -f4- file.tsv | sort -k 1,1) | sort
Cut the original file, then join on the first field of both. We need to specify the full join fields in -o to preserve the first column, or join will eat it.

With bash and GNU paste:
With temporary files for illustration:
cut -f 1-3 file | sort > file_1to3
cut -f 4-5 file | sort > file_4to5
paste -d '\t' file_1to3 file_4to5
Without temporary files:
paste -d '\t' <(cut -f 1-3 file | sort) <(cut -f 4-5 file | sort)
Output:
AABBCC 10 5 AABBCC 50
BBCCAA 4 2 BBCCAA 20
CCAABB 16 8 CCAABB 100

Related

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Shell scripting: how to sort array in file txt

I have a file txt
for example:
11 23 4 9
5 2 17 25
and the output that I want is:
2 4 5 9
11 17 23 25
Sort the numbers in the file with sort -un:
tr ' ' '\n' < file.txt | sort -un | tr '\n' ' '
$ tr -s ' ' '\n' <file | sort -n | paste -d ' ' - - - -
2 4 5 9
11 17 23 25
tr changes all spaces into newlines and removes excess newlines from the input. This create a stream of numbers, one number per line, which is then sorted numerically and pasted into four space-separate columns.
this gawk codes will work for the dynamic number of columns:
awk '{for(x=1;x<=NF;x++)a[++i]=$x}
END{asort(a,b)
for(x=1;x<=i;x++)printf "%s%s",b[x],x%NF==0?RS:FS,b[x]}' file

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
# CODE FOR ROWS
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
....
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
done
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Add column to csv file

I have two files and I need catch the last column of a file and append to other file.
file1
1 2 3
1 2 3
1 2 3
file2
5 5
5 5
5 5
Initial proposal
#!/usr/bin/env bash
column=$(awk '{print $(NF)}' $file1)
paste -d',' $file2 < $column
Expected result
file2
5 5 3
5 5 3
5 5 3
But, This script does not work yet
OBS: I do not know how many columns have in the file. I need more generic solution.
You can use this paste command:
paste -d " " file2 <(awk '{print $NF}' file1)
5 5 3
5 5 3
5 5 3
To append last column of file1 to file2:
paste -d " " file2 <(rev file1 | cut -d " " -f 1 | rev)
Output:
5 5 3
5 5 3
5 5 3
To paste the second column of file 1 to file 2:
while read line; do
read -u 3 c1 c2 c3;
echo $line $c2;
done < file2 3< file1
You can use Perl too:
$ paste -d ' ' file2.txt <(perl -lne 'print $1 if m/(\S+)\s*$/' file1.txt)
5 5 3
5 5 3
5 5 3
Or grep:
$ paste -d ' ' file2.txt <(grep -Eo '(\S+)\s*$' file1.txt)
5 5 3
5 5 3
5 5 3

How to grep two column from a single file

cat Error00
4 0 375
4 2001 21
4 2002 20
cat Error01
4 0 465
4 2001 12
4 2002 40
4 2016 1
I want output as below
4 0 375 465
4 2001 21 12
4 2002 20 20
4 2016 - 1
i am using the below query. here problem is i m not able to handle grep for two field because space is coming.
please suggest how can to get rid of this.
keylist=$(awk '{print $1,$2'} Error0[0-1] | sort | uniq)
for key in ${keylist} ; do
echo ${key}
val_a=$(grep "^${key}" Error00 | awk '{print $3}') ;val_a=${val_a:---}
val_b=$(grep "^${key}" Error01 | awk '{print $1,$2}') ; val_b=${val_b:--- --}
echo $key ${val_a} >>testreport
done
i m geting the oputput as below
4 375 465
0
4 21 12
2001
4 20 20
2002
4 - 1
2016
A single awk one liner can handle this easily:
awk 'FNR==NR{a[$1,$2]=$3;next}{print $1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3}' err0 err1
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
For formatted output you can use printf instead of print. Like Jonathan Leffler suggest:
printf "%s %-6s %-6s %s\n",$1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
However a general solution is to use column -t for a nice table output:
awk '{....}' err0 err1 | column -t
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
grep is not really the right tool for this job. You can either play with awk or Perl (or Python, or …), or you can use join. However, join only joins on a single column at a time, and you appear to need to join on two columns. So, we're going to have to massage the data so that it will work with join. I'm about to assume you're using bash and so have process substitution available. You can do the job without, but it is fiddlier and involves temporary files (and traps to clean them up, etc).
The key to the join will be to replace the blank between the first two columns with a colon (or any other convenient character — control-A would work fine too), then join the files on column 1 with a replacement character. The inputs must be sorted; the output must have the colon replaced with a blank.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$
The 's/ */:/' operation replaces the first sequence of one or more blanks with a colon; the input data has two blanks between the 4 and the 0 in the first line of Error00. The input to join must be in sorted order of the joining field, here the first field. The output is the join field, the second column of Error00 and the second column of Error01 (remembering that means the second column after the first two have been fused by the colon). If there's an unmatched line in the first file, generate an output line (-a 1); ditto for the second file; and for the missing fields, insert a dash (-e '-'). The final sed removes the colon that was added.
If you want the data formatted, pipe it through awk.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /' |
> awk '{printf("%s %-6s %-6s %s\n", $1, $2, $3, $4)}'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$

Resources