uniq -c in one column - shell

Imagine we have a txt file like the next one:
Input:
a1 D1
b1 D1
c1 D1
a1 D2
a1 D3
c1 D3
I want to count the time each element in the first column appears but also keep the information provided by the second column (someway). Potential possible output formats are represented, but any coherent alternative is also accepted:
Possible output 1:
3 a1 D1,D2,D3
1 b1 D1
2 c1 D1,D3
Possible output 2:
3 a1 D1
1 b1 D1
2 c1 D1
3 a1 D2
3 a1 D3
1 c1 D3
How can I do this? I guess a combination sort -k 1 input | uniq -c <keep col2> or perhaps using awk but I was not able to write anything that works. However, all answers are considered.

I would harness GNU AWK for this task following way, let file.txt content be
a1 D1
b1 D1
c1 D1
a1 D2
a1 D3
c1 D3
then
awk 'FNR==NR{arr[$1]+=1;next}{print arr[$1],$0}' file.txt file.txt
gives output
3 a1 D1
1 b1 D1
2 c1 D1
3 a1 D2
3 a1 D3
2 c1 D3
Explanation: 2-pass solution (observe that file.txt is repeated), first pass does count number of occurences of first column value storing that data into array arr, second pass is for printing computed number from array, followed by whole line.
(tested in GNU Awk 5.0.1)

Using any awk:
$ awk '
{
vals[$1] = ($1 in vals ? vals[$1] "," : "") $2
cnts[$1]++
}
END {
for (key in vals) {
print cnts[key], key, vals[key]
}
}
' file
3 a1 D1,D2,D3
1 b1 D1
2 c1 D1,D3

Related

awk or sed command for columns and rows selection from multiple files

Looking for a command for the following task:
I have three files, each with two columns, as seen below.
I would like to create file4 with four columns.
The output should resemble a merge-sorted version of file1, file2 and file3 such that the first column is sorted, the second column is the second column of file1 the third column is the second column of file2 and the fourth column is the second column of file3.
The entries in column 2 to 3 should not be sorted but should match the key-value in the first column of the original files.
I tried intersection in Linux, but not giving the desired outputs.
Any help will be appreciated. Thanks in advance!!
$ cat -- file1
A1 B5
A10 B2
A3 B15
A15 B6
A2 B10
A6 B19
$ cat -- file2
A10 C4
A4 C8
A6 C5
A3 C10
A12 C14
A15 C18
$ cat -- file 3
A3 D1
A22 D9
A20 D3
A10 D5
A6 D10
A21 D11
$ cat -- file 4
col1 col2 col3 col4
A1 B5
A2 B10
A3 B15 C10 D1
A4 C8
A6 B19 C5 D10
A10 B2 C4 D5
A12 C14
A15 B6 C18
A20 D3
A21 D11
A22 D9
Awk + Bash version:
( echo "col1, col2, col3, col4" &&
awk 'ARGIND==1 { a[$1]=$2; allkeys[$1]=1 } ARGIND==2 { b[$1]=$2; allkeys[$1]=1 } ARGIND==3 { c[$1]=$2; allkeys[$1]=1 }
END{
for (k in allkeys) {
print k", "a[k]", "b[k]", "c[k]
}
}' file1 file2 file3 | sort -V -k1,1 ) | column -t -s ','
Pure Bash version:
declare -A a
while read key value; do a[$key]="${a[$key]:-}${a[$key]:+, }$value"; done < file1
while read key value; do a[$key]="${a[$key]:-, }${a[$key]:+, }$value"; done < file2
while read key value; do a[$key]="${a[$key]:-, , }${a[$key]:+, }$value"; done < file3
(echo "col1, col2, col3, col4" &&
for i in ${!a[#]}; do
echo $i, ${a[$i]}
done | sort -V -k1,1) | column -t -s ','
Explanation for "${a[$key]:-, , }${a[$key]:+, }$value" please check Shell-Parameter-Expansion
Using GNU Awk:
gawk '{ a[$1] = substr($1, 1); b[$1, ARGIND] = $2 }
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (i in a) {
t = i
for (j = 1; j <= ARGIND; ++j)
t = t OFS b[i, j]
print t
}
}' file{1..3} | column -t
There is a simple tool called join that allows you to perform this operation:
#!/usr/bin/env bash
cut -d ' ' -f1 file{1,2,3} | sort -k1,1 -u > ftmp
for f in file1 file2 file3; do
mv -- ftmp file4
join -a1 -e "---" -o auto file4 <(sort -k1,1 "$f") > ftmp
done
sort -k1,1V ftmp > file4
cat file4
This outputs
A1 B5 --- ---
A2 B10 --- ---
A3 B15 C10 D1
A4 --- C8 ---
A6 B19 C5 D10
A10 B2 C4 D5
A12 --- C14 ---
A15 B6 C18 ---
A20 --- --- D3
A21 --- --- D11
A22 --- --- D9
I used --- to indicate an empty field. If you want to pretty print this, you have to re-parse it with awk or anything else.
This might work for you (GNU sed and sort):
s=''; for f in file{1,2,3}; do s="$s\t"; sed -E "s/\s+/$s/" $f; done |
sort -V |
sed -Ee '1i\col1\tcol2\tcol3\tcol4' -e ':a;N;s/^((\S+\t).*\S).*\n\2\t+/\1\t/;ta;P;D'
Replace spaces by tabs and insert the number of tabs between the key and value depending on which file is being processed.
Sort the output by key column order.
Coalesce each line with its key and print the result.

Bash: Output columns from array consisting of two columns

Problem
I am writing a bash script and I have an array, where each value consists of two columns. It looks like this:
for i in "${res[#]}"; do
echo "$i"
done
#Stream1
0 a1
1 b1
2 c1
4 d1
6 e1
#Stream2
0 a2
1 b2
3 c2
4 d2
9 f2
...
I would like to combine the output from this array into a larger table, and multiplex the indices. Furthermore, I would like to format the top row by inserting comment #Sec.
I would like the result to be something like this:
#Sec Stream1 Stream2
0 a1 a2
1 b1 b2
2 c1
3 c2
4 d1 d2
6 e1
9 f2
The insertion of #Sec and removal of the # behind the Streamkeyword is not necessary but desired if not too difficult.
Tried Solutions
I have tried piping to column and awk, but have not been able to produce the desired results.
EDIT
resis an array in a bash script. It is quite large, so I will only provide a short selection. Running echo "$( typeset -p res)"produces following output:
declare -a res='([1]="#Stream1
0 3072
1 6144
2 5120
3 1024
5 6144
..." [2]="#Stream2
0 3072
1 5120
2 4096
3 3072
53 3072
55 1024
57 2048")'
As for the 'result', my initial intention was to assign the resulting table to a variable and use it in another awk script to calculate the moving averages for specified indices, and plot the results. This will be done for ~20 different files. However I am open to other solutions.
The number of streams may vary from 10 to 50. Each stream having from 100 to 300 rows.
You may use this awk solution:
cat tabulate.awk
NF == 1 {
h = h OFS substr($1, 2)
++numSec
next
}
{
keys[$1]
map[$1,numSec] = $2
}
END {
print h
for (k in keys) {
printf "%s", k
for (i=1; i<=numSec; ++i)
printf "\t%s", map[k,i]
print ""
}
}
Then use it as:
awk -v OFS='\t' -v h='#Sec' -f tabulate.awk file
#Sec Stream1 Stream2
0 a1 a2
1 b1 b2
2 c1
3 c2
4 d1 d2
6 e1
9 f2

Awk - Control when my $# variables are expanded to merge two files with variable number of columns

My bash script is calling a awk script that nicely merges two files
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t"}
FNR==NR{hash1['"\$${mapfieldfile2}"']=$1 FS $3 FS $4 FS $5 FS $6;next}
('"\$${mapfieldfile1}"' in hash1){ print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
However I want to a more general version,where I don't have to hardcode the columns that I want to print, I simply want to print everything but my id column. Replacing $1 FS $3 FS $4 FS $5 FS $6 for $0 "almost" does the work, except that repeats the id column. I have been trying to dynamically create a a string similar to the $1 FS $3 FS $4 FS $5 FS $6 but I am getting literally the $1 $3 $4 $5 $6 strings in the merged file, as opposed to expanding their values. Also, smaller side effects: I am adding a tab in the middle and losing some headers, below is the code and example files.
I would like to find the solution to my merge and also understand what I am doing wrong and why my variables are not expanding.
I appreciate any help!
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t";strfields=""}
FNR==NR{for(i=1;i<=NF;i++) if(i!='"${mapfieldfile2}"') {strfields=strfields" "FS" $"i};
hash1['"\$${mapfieldfile2}"']=strfields;strfields="";next}
('"\$${mapfieldfile1}"' in hash1){print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
$cat file1
sampleid s1 s2 s3 s4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
$cat file2
a0 sampleid a1 a2 a3 a4
a0 1 a a a a4
a0 2 b b b a4
a0 3 c c c a4
a0 5 e e e a4
$cat first_code_result.txt (good one!)
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
$cat second_code_result.txt
sampleid s1 s2 s3 s4 $1 $3 $4 $5 $6
1 1 1 1 1 $1 $3 $4 $5 $6
2 2 2 2 2 $1 $3 $4 $5 $6
3 3 3 3 3 $1 $3 $4 $5 $6
Try this (untested):
awk -v mf1="$mapfieldfile1" -v mf2="$mapfieldfile2" '
BEGIN {FS=OFS="\t"}
FNR==NR{sub(/\t[^\t]+/,""); hash1[$mf2]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
Don't let shell variables expand within awk scripts, use a regexp to remove fields from the record and idk why the script you haven't shown us is printing literally $3, etc. but you must be including them in a string. You'd have to post that script for help debugging it.
Check where mf1 vs mf2 should appear, I got confused reading your scripts.
EDIT - I had to tweak it as above I was deleting $2 before using it:
$ awk -v mf1="1" -v mf2="2" '
BEGIN {FS=OFS="\t"}
FNR==NR{key=$mf2; sub(/\t[^\t]+/,""); hash1[key]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
Note that the sub() above relies on the key field being $2 and FS being a tab. If you need a more general solution let us know.
Here's a version that'll do what you want for any key field values and will work in any awk, it just requires the FS to be a tab or some other fixed string (i.e. not a regexp):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
key = $mf2
val = ""
nf = 0
for (i=1; i<=NF; i++) {
if (i != mf2) {
val = (nf++ ? val FS : "") $i
}
}
hash1[key] = val
next
}
$mf1 in hash1 { print $0, hash1[$mf1] }
$ awk -v mf1="1" -v mf2="2" -f tst.awk file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
if your files are sorted already, the default output of join is what you want
$ join -t$'\t' -11 -22 file1 file2
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
or, after prettying with column
$ join -t$'\t' -11 -22 file1 file2 | column -t
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4

Only display the largest number in head

I use sort -r | head and get the out put like this:
8 a1
8 a2
5 a3
5 a4
4 a5
4 a6
4 a7
4 a8
4 a9
4 a0
What can I do to make the output like this:
8 a1
8 a2
only the largest k1 number show up????
There are several ways to do it, but here is one using awk. Since it is already sorted, you want to check to just print lines that match the first value by piping the headed list into something like
awk 'BEGIN{maxval=0}; (maxval==0) {maxval=$1}; ($1==maxval) {print $0}'

How to replace pairs of strings in two files to identical IDs?

[Update2] As it often happens, the scope of the task expanded quite a bit as a understood it better. The obsolete parts are crossed out, and you find the updated explanation below. [/Update2]
I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:
UnifiedClassLoader3#19518cc | UnifiedClassLoader3#d0357a
JBossRMIClassLoader#13c2d7f | JBossRMIClassLoader#191777e
That is, wherever the first file contains UnifiedClassLoader3#19518cc, the second contains UnifiedClassLoader3#d0357a, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]
UnifiedClassLoader3#19518cc | UnifiedClassLoader3#d0357a
JBossRMIClassLoader#13c2d7f | JBossRMIClassLoader#191777e
Logi18n#177060f | Logi18n#12ef4c6
LogFactory$1#15e3dc4 | LogFactory$1#2942da
That is, wherever the first file contains UnifiedClassLoader3#19518cc, the second contains UnifiedClassLoader3#d0357a, and so on. Note that all these strings are inside long lines of text, and they appear in many rows, intermixed with each other. There are about 4000 distinct pairs of such identifiers, and the size of each file is about 34 MB. So performance became an issue as well.
I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3#19518cc in file1 and UnifiedClassLoader3#d0357a in file2 with UnifiedClassLoader3#1; all occurrences of both Logi18n#177060f in file1 and Logi18n#12ef4c6 in file2 with Logi18n#2 etc. The counters 1 and 2 are arbitrary choices - the only requirement is that there is a one to one mapping between the old and new strings (i.e. the same string is always replaced by the same value and no different strings are replaced by the same value).
Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with
grep -o -e 'ClassLoader[0-9]*#[0-9a-f][0-9a-f]*' file1.log | sort | uniq
grep -o -e '[A-Z][A-Za-z0-9]*\(\$[0-9][0-9]*\)*#[0-9a-f][0-9a-f]*' file1.log
| sort | uniq
However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.
I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1, concatenate the results and sort them again. The result would be something like
2 ClassLoader3#19518cc
137 ClassLoader#13c2d7f
563 ClassLoader3#1267649
...
Then I could (using sed itself) massage this into a sed command like
sed -e 's/ClassLoader3#19518cc/ClassLoader3#2/g'
-e 's/ClassLoader#13c2d7f/ClassLoader#137/g'
-e 's/ClassLoader3#1267649/ClassLoader3#563/g'
file1.log > file1_processed.log
and similarly for file2.
However, before I start, I would like to verify that my plan is the simplest possible working solution to this.
Is there any flaw in this approach? Is there a simpler way?
I think this does the trick, or at least comes close
#!/bin/sh
for PREFIX in file1 file2
do
cp ${PREFIX}.log /tmp/filter.$$.txt
FILE_MAP=`egrep -o -e 'ClassLoader[0-9a-f]*#[0-9a-f]+' ${PREFIX}.log | uniq | egrep -n .`
for MAP in `echo $FILE_MAP`
do
NUMBER=`echo $MAP | cut -d : -f 1`
WORD=`echo $MAP | cut -d : -f 2`
sed -e s/$WORD/ClassLoader#$NUMBER/g /tmp/filter.$$.txt > ${PREFIX}_processed.log
cp ${PREFIX}_processed.log /tmp/filter.$$.txt
done
rm /tmp/filter.$$.txt
done
Let me know if you have questions on how it works and why.
Here's my test data and the output
file1.log:
A1
UnifiedClassLoader3#a45bc1
A2
UnifiedClassLoader3#a45bc1
A3
UnifiedClassLoader3#a45bc1
A4
JBossRMIClassLoader#bc450a
A5
JBossRMIClassLoader#bc450a
A6
JBossRMIClassLoader#bc450a
B1
UnifiedClassLoader3#a45bc2
B2
UnifiedClassLoader3#a45bc2
B3
UnifiedClassLoader3#a45bc2
B4
JBossRMIClassLoader#bc450b
B5
JBossRMIClassLoader#bc450b
B6
JBossRMIClassLoader#bc450b
C1
UnifiedClassLoader3#a45bc3
C2
UnifiedClassLoader3#a45bc3
C3
UnifiedClassLoader3#a45bc3
C4
JBossRMIClassLoader#bc450c
C5
JBossRMIClassLoader#bc450c
C6
JBossRMIClassLoader#bc450c
file2.log (Similar patterns except the "C" set repeats the "A" set)
A1
UnifiedClassLoader3#d0357a
A2
UnifiedClassLoader3#d0357a
A3
UnifiedClassLoader3#d0357a
A4
JBossRMIClassLoader#191777e
A5
JBossRMIClassLoader#191777e
A6
JBossRMIClassLoader#191777e
B1
UnifiedClassLoader3#d0357b
B2
UnifiedClassLoader3#d0357b
B3
UnifiedClassLoader3#d0357b
B4
JBossRMIClassLoader#191777f
B5
JBossRMIClassLoader#191777f
B6
JBossRMIClassLoader#191777f
C1
UnifiedClassLoader3#d0357a
C2
UnifiedClassLoader3#d0357a
C3
UnifiedClassLoader3#d0357a
C4
JBossRMIClassLoader#191777e
C5
JBossRMIClassLoader#191777e
C6
JBossRMIClassLoader#191777e
And after processing you get file1_processed.log
A1
UnifiedClassLoader#1
A2
UnifiedClassLoader#1
A3
UnifiedClassLoader#1
A4
JBossRMIClassLoader#2
A5
JBossRMIClassLoader#2
A6
JBossRMIClassLoader#2
B1
UnifiedClassLoader#3
B2
UnifiedClassLoader#3
B3
UnifiedClassLoader#3
B4
JBossRMIClassLoader#4
B5
JBossRMIClassLoader#4
B6
JBossRMIClassLoader#4
C1
UnifiedClassLoader#5
C2
UnifiedClassLoader#5
C3
UnifiedClassLoader#5
C4
JBossRMIClassLoader#6
C5
JBossRMIClassLoader#6
C6
and file2_processed.log
A1
UnifiedClassLoader#1
A2
UnifiedClassLoader#1
A3
UnifiedClassLoader#1
A4
JBossRMIClassLoader#2
A5
JBossRMIClassLoader#2
A6
JBossRMIClassLoader#2
B1
UnifiedClassLoader#3
B2
UnifiedClassLoader#3
B3
UnifiedClassLoader#3
B4
JBossRMIClassLoader#4
B5
JBossRMIClassLoader#4
B6
JBossRMIClassLoader#4
C1
UnifiedClassLoader#1
C2
UnifiedClassLoader#1
C3
UnifiedClassLoader#1
C4
JBossRMIClassLoader#2
C5
JBossRMIClassLoader#2
C6
JBossRMIClassLoader#2

Resources