Merging many files based on matching column - bash

I have many files (I posted 5 as an example)
If there is no match with the 1st file then 0 should be appended in output
file1
1001 1 2
1002 1 2
1003 3 5
1004 6 7
1005 8 9
1009 2 3
file2
1002 7
1003 8
file3
1001 5
1002 3
file4
1002 10
1004 60
1007 4
file5
1001 102
1003 305
1005 809
output desired
1001 1 2 0 5 0 102
1002 1 2 7 3 10 0
1003 3 5 8 0 0 305
1004 6 7 0 0 60 0
1005 8 9 0 0 0 809
1007 0 0 0 0 4 0
1009 2 3 0 0 0 0
Using the below code I can merge two files, BUT how to merge all
awk 'FNR==NR{a[$1]=$2;next}{print $0,a[$1]?a[$1]:"0"}' file2 file1
1001 1 2 0
1002 1 2 7
1003 3 5 8
1004 6 7 0
1005 8 9 0
Thanks in advance

GNU Join to the rescue!
$ join -a1 -a2 -e '0' -o auto file1 file2 \
| join -a1 -a2 -e '0' -o auto - file3 \
| join -a1 -a2 -e '0' -o auto - file4 \
| join -a1 -a2 -e '0' -o auto - file5
The options -a1 and -a2 tell join to insert missing fields. and the -e '0' tells it to replace them with a ZERO. The output is specified with -o auto which assumes to take all fields.
When having a large amount of files, you cannot use the pipeline construct, but you could use a simple for loop:
out=output
tmp=$(mktemp)
[[ -e "$out" ]] && rm -rf "$out" || touch "$out"
for file in f*; do
join -a1 -a2 -e0 -o auto "$out" "$file" > "$tmp"
mv "$tmp" "$out"
done
cat "$out"
or if you really like the pipeline:
pipeline="cat /dev/null"
for file in f*; do pipeline="$pipeline | join -a1 -a2 -e0 -o auto - $file"; done
eval "$pipeline"
very much of interest here: Is there a limit on how many pipes I can use?
Remark: the usage of auto is extremely useful in this case but not part of the POSIX standard. It is a GNU extension which is part of the GNU coreutils. A pure POSIX version would read a bit more cumbersome as:
$ join -a1 -a2 -e '0' -o 0 1.2 2.2 file1 file2 \
| join -a1 -a2 -e '0' -o 0 1.2 1.3 2.2 - file3 \
| join -a1 -a2 -e '0' -o 0 1.2 1.3 1.4 2.2 - file4 \
| join -a1 -a2 -e '0' -o 0 1.2 1.3 1.4 1.5 2.2 - file5
More information on man join

With GNU awk for true multi-dimensional arrays and sorted_in:
$ cat tst.awk
FNR==1 { numCols = colNr }
{
key = $1
for (i=2; i<=NF; i++) {
colNr = numCols + i - 1
val = $i
lgth = length(val)
vals[key][colNr] = val
wids[colNr] = (lgth > wids[colNr] ? lgth : wids[colNr])
}
}
END {
numCols = colNr
PROCINFO["sorted_in"] = "#ind_num_asc"
for (key in vals) {
printf "%s", key
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%*d", OFS, wids[colNr], vals[key][colNr]
}
print ""
}
}
$ awk -f tst.awk file*
1001 1 2 0 5 0 102
1002 1 2 7 3 10 0
1003 3 5 8 0 0 305
1004 6 7 0 0 60 0
1005 8 9 0 0 0 809
1007 0 0 0 0 4 0
1009 2 3 0 0 0 0

Using GNU awk
awk '
NR>FNR && FNR==1{
colcount+=cols
}
{
for(i=2;i<=NF;i++){
rec[$1][colcount+i-1]=$i
}
}
{
cols=NF-1
}
END{
colcount++
for(ind in rec){
printf "%s%s",ind,OFS
for(i=1;i<=colcount;i++){
printf "%s%s",rec[ind][i]?rec[ind][i]:0,OFS
}
print ""
}
}' file{1..5} | sort -k1 | column -t
Output
1001 1 2 0 5 0 102
1002 1 2 7 3 10 0
1003 3 5 8 0 0 305
1004 6 7 0 0 60 0
1005 8 9 0 0 0 809
1006 0 0 0 0 0 666
Note: Will work for the case mentioned here and for any type of values.

Related

Add lines with 0 for missing values in a datatable

I have a dataset counting occurences of bins, for instance:
1 10
2 15
3 1
5 50
8 990
As you can see, I am missing bins in the first column. As I want to plot this data, I'm looking for a way to add those missing value, with a 0 on the second column, e.g. if I know my bins go up to 10:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
I'm looking for a unix/bash solution as it fits my pipeline and my files are rather big, but maybe R is more suited for this ?
EDIT: Thanks to karafaka sir, adding solutions which will capture very first line's digits too.
awk -v value=10 '$1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Let's say following is the Input_file:
cat Input_file
3 10
4 15
7 1
9 50
19 990
Then after running above code we will get following output.
1 0
2 0
3 10
4 15
5 0
6 0
7 1
8 0
9 50
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 990
Could you please try following.
awk -v value=10 'prev && $1-prev>1{while(++prev<$1){print prev,"0"}} {prev=$1;print} END{if(prev<value){while(prev<=value){print prev,"0";prev++}}}' Input_file
Adding a non-one liner form of solution too now.
awk -v value=10 '
prev && $1-prev>1{
while(++prev<$1){
print prev,"0"
}
}
{
prev=$1
print
}
END{
if(prev<value){
while(prev<=value){
print prev,"0"
prev++
}
}
}' Input_file
we can combine seq and awk to make the task easier:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' file <(seq 10)
You can do this as well:
awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$0}' f <(seq -f '%g 0' 10)
Test with your data:
kent$ cat f
1 10
2 15
3 1
5 50
8 990
kent$ awk 'NR==FNR{a[$1]=$0;next}{print $1 in a?a[$1]:$1 FS 0}' f <(seq 10)
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
Using Bash and join:
$ join -a 1 --nocheck-order -e 0 -o 1.1,2.2 <(seq 10) file
Output:
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0
another awk
$ awk -v mx=10 '{while(++k<$1) print k,0}1;
END {while(k++<mx) print k,0}' file
this will fill the first records if missing as well.
$ awk '{n[$1]=$2} END{for (i=1;i<=10;i++) print i,n[i]+0}' file
1 10
2 15
3 1
4 0
5 50
6 0
7 0
8 990
9 0
10 0

Count and percentage

Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

Subtract corresponding lines

I have two files, file1.csv
3 1009
7 1012
2 1013
8 1014
and file2.csv
5 1009
3 1010
1 1013
In the shell, I want to subtract the count in the first column in the second file from that in the first file, based on the identifier in the second column. If an identifier is missing in the second column, the count is assumed to be 0.
The result would be
-2 1009
-3 1010
7 1012
1 1013
8 1014
The files are huge (several GB). The second columns are sorted.
How would I do this efficiently in the shell?
Assuming that both files are sorted on second column:
$ join -j2 -a1 -a2 -oauto -e0 file1 file2 | awk '{print $2 - $3, $1}'
-2 1009
-3 1010
7 1012
1 1013
8 1014
join will join sorted files.
-j2 will join one second column.
-a1 will print records from file1 even it there is no corresponding row in file2.
-a2 Same as -a1 but applied for file2.
-oauto is in this case the same as -o1.2,1.1,2.1 which will print the joined column, and then the remaining columns from file1 and file2.
-e0 will insert 0 instead of an empty column. This works with -a1 and -a2.
The output from join is three columns like:
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
Which is piped to awk, to subtract column three from column 2, and then reformatting.
$ awk 'NR==FNR { a[$2]=$1; next }
{ a[$2]-=$1 }
END { for(i in a) print a[i],i }' file1 file2
7 1012
1 1013
8 1014
-2 1009
-3 1010
It reads the first file in memory so you should have enough memory available. If you don't have the memory, I would maybe sort -k2 the files first, then sort -m (merge) them and continue with that output:
$ sort -m -k2 -k3 <(sed 's/$/ 1/' file1|sort -k2) <(sed 's/$/ 2/' file2|sort -k2) # | awk ...
3 1009 1
5 1009 2 # previous $2 = current $2 -> subtract
3 1010 2 # previous $2 =/= current and current $3=2 print -$3
7 1012 1
2 1013 1 # previous $2 =/= current and current $3=1 print prev $2
1 1013 2
8 1014 1
(I'm out of time for now, maybe I'll finish it later)
EDIT by Ed Morton
Hope you don't mind me adding what I was working on rather than posting my own extremely similar answer, feel free to modify or delete it:
$ cat tst.awk
{ split(prev,p) }
$2 == p[2] {
print p[1] - $1, p[2]
prev = ""
next
}
p[2] != "" {
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
{ prev = $0 }
END {
split(prev,p)
print (p[3] == 1 ? p[1] : 0-p[1]), p[2]
}
$ sort -m -k2 <(sed 's/$/ 1/' file1) <(sed 's/$/ 2/' file2) | awk -f tst.awk
-2 1009
-3 1010
7 1012
1 1013
8 1014
Since the files are sorted¹, you can merge them line-by-line with the join utility in coreutils:
$ join -j2 -o auto -e 0 -a 1 -a 2 41144043-a 41144043-b
1009 3 5
1010 0 3
1012 7 0
1013 2 1
1014 8 0
All those options are required:
-j2 says to join based on the second column of each file
-o auto says to make every row have the same format, beginning with the join key
-e 0 says that missing values should be substituted with zero
-a 1 and -a 2 include rows that are absent from one file or another
the filenames (I've used names based on the question number here)
Now we have a stream of output in that format, we can do the subtraction on each line. I used this GNU sed command to transform the above output into a dc program:
sed -re 's/.*/c&-n[ ]np/e'
This takes the three values on each line and rearranges them into a dc command for the subtraction, then executes it. For example, the first line becomes (with spaces added for clarity)
c 1009 3 5 -n [ ]n p
which subtracts 5 from 3, prints it, then prints a space, then prints 1009 and a newline, giving
-2 1009
as required.
We can then pipe all these lines into dc, giving us the output file that we want:
$ join -o auto -j2 -e 0 -a 1 -a 2 41144043-a 41144043-b \
> | sed -e 's/.*/c& -n[ ]np/' \
> | dc
-2 1009
-3 1010
7 1012
1 1013
8 1014
¹ The sorting needs to be consistent with LC_COLLATE locale setting. That's unlikely to be an issue if the fields are always numeric.
TL;DR
The full command is:
join -o auto -j2 -e 0 -a 1 -a 2 "$file1" "$file2" | sed -e 's/.*/c& -n[ ]np/' | dc
It works a line at a time, and starts only the three processes you see, so should be reasonably efficient in both memory and CPU.
Assuming this is a csv with blank separation, if this is a "," use argument -F ','
awk 'FNR==NR {Inits[$2]=$1; ids[$2]++; next}
{Discounts[$2]=$1; ids[$2]++}
END { for (id in ids) print Inits[ id] - Discounts[ id] " " id}
' file1.csv file2.csv
for memory issue (could be in 1 serie of pipe but prefer to use a temporary file)
awk 'FNR==NR{print;next}{print -1 * $1 " " $2}' file1 file2 \
| sort -k2 \
> file.tmp
awk 'Last != $2 {
if (NR != 1) print Result " "Last
Last = $2; Result = $1
}
Last == $2 { Result+= $1; next}
END { print Result " " $2}
' file.tmp
rm file.tmp

Format output in bash

I have output in bash that I would like to format. Right now my output looks like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
abac.com 1
abbotsleigh.nsw.edu.au 1
abc.mre.gov.br 1
ableland.freeserve.co.uk 1
academicplanet.com 1
access-k12.org 1
acconnect.com 1
acconnect.com 1
accountingbureau.co.uk 1
acm.org 1
acsalaska.net 1
adam.com.au 1
ada.state.oh.us 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
aecom.yu.edu 1
aecon.com 1
aetna.com 1
agedwards.com 1
ahml.info 1
The problem with this is none of the numbers on the right line up. I would like them to look like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
Would there be anyway to make them look like this without knowing exactly how long the longest domain is? Any help would be greatly appreciated!
The code that outputted this is:
grep -E -o -r "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" $ARCHIVE | sed 's/.*#//' | uniq -ci | sort | sed 's/^ *//g' | awk ' { t = $1; $1 = $2; $2 = t; print; } ' > temp2
ALIGNMENT:
Just use cat with column command and thats it:
cat /path/to/your/file | column -t
For more details on column command refer http://manpages.ubuntu.com/manpages/natty/man1/column.1.html
EDITED:
View file in terminal:
column -t < /path/to/your/file
(as noted by anishsane)
Export to a file:
column -t < /path/to/your/file > /output/file

Resources