Merging two files column and row-wise in bash - bash

I would like to merge two files, column and row-wise but am having difficulty doing so with bash. Here is what I would like to do.
File1:
1 2 3
4 5 6
7 8 9
File2:
2 3 4
5 6 7
8 9 1
Expected output file:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
This is just an example. The actual files are two 1000x1000 data matrices.
Any thoughts on how to do this? Thanks!

Or use paste + awk
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }'
Note that this script adds a trailing space after the last value. This can be avoided with a more complicated awk script or by piping the output through an additional command, e.g.
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }' | sed 's/ $//'
awk solution without additional sed. Thanks to Jonathan Leffler. (I knew it is possible but was too lazy to think about this.)
awk '{ n=NF/2; pad=""; for(i=1; i<=n; i++) { printf "%s%s/%s", pad, $i, $(i+n); pad=" "; } printf "\n"; }'

paste + perl version that works with an arbitrary number of columns without having to hold an entire file in memory:
paste file1.txt file2.txt | perl -MList::MoreUtils=pairwise -lane '
my #a = #F[0 .. (#F/2 - 1)]; # The values from file1
my #b = #F[(#F/2) .. $#F]; # The values from file2
print join(" ", pairwise { "$a/$b" } #a, #b); # Merge them together again'
It uses the non-standard but useful List::MoreUtils module; install through your OS package manager or favorite CPAN client.

Assumptions:
no blank lines in files
both files have the same number of rows
both files have the same number of fieldds
no idea how many rows and/or fields we'll have to deal with
One awk solution:
awk '
# first file (FNR==NR):
FNR==NR { for ( i=1 ; i<=NF ; i++) # loop through fields
{ line[FNR,i]=$(i) } # store field in array; array index = row number (FNR) + field number (i)
next # skip to next line in file
}
# second file:
{ pfx="" # init printf prefix as empty string
for ( i=1 ; i<=NF ; i++) # loop through fields
{ printf "%s%s/%s", # print our results:
pfx, line[FNR,i], $(i) # prefix, corresponding field from file #1, "/", current field
pfx=" " # prefix for rest of fields in this line is a space
}
printf "\n" # append linefeed on end of current line
}
' file1 file2
NOTES:
remove comments to declutter code
memory usage will climb as the size of the matrix increases (probably not an issue for the smallish fields and OPs comment about a 1000 x 1000 matrix)
The above generates:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1

Related

Shell command to sum up numbers across similar lines of text in a file

I have a file with thousands of lines, each containing a number followed by a line of text. I'd like to add up the numbers for the lines whose text is similar. I'd like unique lines to be output as well.
For example:
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
The output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Any suggestions how this could be achieved in unix shell?
I looked at Shell command to sum integers, one per line? but this is about summing up a column of numbers across all lines in a file, not across similar text lines only.
There is no need for multiple processes and pipes. awk alone is more than capable of handling the entire job (and will be orders of magnitude faster on large files). With awk simply append each of the fields 2-NF as a string and use that as an index to sum the numbers in field 1 in an array. Then in the END section, simply output the contents of the array, e.g. presuming your data is stored in file, you could do:
awk '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
str=""
}
END {
for (i in a) print a[i], i
}' file
Above, the first for loop simply appends all fields from 2-NF in str, a[str] += $1 sums the values in field 1 into array a using str as an index. That ensures the values for similar lines are summed. In the END section, you simply loop over each element of the array outputting the element value (the sum) and then the index (original str for fields 2-NF).
Example Use/Output
Just take what is above, select it, and then middle-mouse paste it into a command line in the directory where your file is located (change the name of file to your data file name)
$ awk '{
> for (i=2; i<=NF; i++)
> str = str " " $i
> a[str] += $1
> str=""
> }
> END {
> for (i in a) print a[i], i
> }' file
30 take a test
37 cup of coffee
75 sign on the dotted
If you want the lines sorted in a different order, just add | sort [options] after the filename to pipe the output to sort. For example for output in the order you show, you would use | sort -k 2 and the output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Preserving Original Order Of Strings
Pursuant to your comment regarding how to preserve the original order of the lines of text seen in your input file, you can keep a second array where the strings are stored in the order they are seen using a sequential index to keep them in order. For example the o array (order array) is used below to store the unique string (fields 2-NF) and the variable n is used as a counter. A loop over the array is used to check whether the string is already contained, and if so, next is used to avoid storing the string and jump to the next record of input. In END the loop then uses a for (i = 0; i < n; i++) form to output the information from both arrays in the order the string were seen in the original file, e.g.
awk -v n=0 '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
for (i = 0; i < n; i++)
if (o[i] == str) {
str=""
next;
}
o[n++] = str;
str=""
}
END {
for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file
Output
37 cup of coffee
75 sign on the dotted
30 take a test
Here is a simple awk script that do the task:
script.awk
{ # for each input line
inpText = substr($0, length($1)+2); # read the input text after 1st field
inpArr[inpText] = inpArr[inpText] + 0 + $1; # accumulate the 1st field in array
}
END { # post processing
for (i in inpArr) { # for each element in inpArr
print inpArr[i], i; # print the sum and the key
}
}
input.txt
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
running:
awk -f script.awk input.txt
output:
75 sign on the dotted
37 cup of coffee
30 take a test
Using datamash is relatively succinct. First use sed to change the first space to a tab, (for this job datamash must have one and only one tab separator), then use -s -g2 to sort groups by the 2nd field, (i.e. "cup" etc.), then use sum 1 to add up the first column numbers by group, and it's done. No, not quite -- the number column migrated to the 2nd field for some reason, so reverse migrates it back to the 1st field:
sed 's/ /\t/' file | datamash -s -g2 sum 1 | datamash reverse
Output:
37 cup of coffee
75 sign on the dotted
30 take a test
You can do the following (assume the name of the file is file.txt):
for key in $(sort -k2 -u file.txt | cut -d ' ' -f2)
do
cat file.txt|grep $key | awk '{s+=$1} END {print $2 "\t" s}'
done
Explanation:
1. get all unique keys (cup of coffee, sign on the dotted, take a test):
sort -k2 -u file.txt | cut -d ' ' -f2
2. grep all lines with unique key from the file:
cat file.txt | grep $key
3. Sum the lines using awk where $1=number column and $2 = key
awk '{s+=$1} END {print $2 "\t" s}'
Put everything in for loop and iterate over the unique keys
Note: If a key can be a sub-string of another key, for example "coffee" and "cup of coffee" you will need to change step 2 to grep with regex
you mean something like this?
#!/bin/bash
# define a dictionary
declare -A dict
# loop over all lines
while read -r line; do
# read first word as value and the rest as text
IFS=' ' read value text <<< "$line"
# use 'text' as key, get value for 'text', default 0
[ ${dict[$text]+exists} ] && dictvalue="${dict[$text]}" || dictvalue=0
# sum value
value=$(( $dictvalue + value ))
# save new value in dictionary
dict[$text]="$value"
done < data.txt
# loop over dictionary, print sum and text
for key in "${!dict[#]}"; do
printf "%s %s\n" "${dict[$key]}" "$key"
done
output
37 cup of coffee
75 sign on the dotted
30 take a test
Another version based on the same logic as mentioned here #David.
Changes: It omits loops to speed up the process.
awk '
{
text=substr($0, index($0,$2))
if(!(text in text_sums)){ texts[i++]=text }
text_sums[text]+=$1
}
END {
for (i in texts) print text_sums[texts[i]],texts[i]
}' input.txt
Explanation:
substr returns the string starting with field 2. i.e. text part
array texts stores text on integer index, if its not present in text_sums array.
text_sums keep adding field 1 for a corresponding text.
Reason behind a separate array to store text as value backed by consecutive integer as index, is to assures the order of value (text) while accessing in same consecutive order.
See Array Intro
Footnotes says:
The ordering will vary among awk implementations, which typically use hash tables to store array elements and values.

How to grep for multiple word occurrences from multiple files and list them grouped as rows and columns

Hello: Need your help to count word occurrences from multiple files and output them as row and columns. I searched the site for a similar reference but could not locate, hence posting it here.
Setup:
I have 2 files with the following
[a.log]
id,status
1,new
2,old
3,old
4,old
5,old
[b.log]
id,status
1,new
2,old
3,new
4,old
5,new
Results required
The result i require using the command line only is (preferably):
file count(new) count(old)
a.log 1 4
b.log 3 2
Script
The script below provides me the count for a single word across multiple.
I am stuck trying to get results for multiple words. Please help.
grep -cw "old" *.log
You can get this output using gnu-awk that accepts comma separated word to be searched in a command line argument:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN{n=split(wrds, a, /,/); for(i=1; i<=n; i++) b[a[i]]=a[i]} FNR==1{next} $2 in b{freq[FILENAME][$2]++} END{printf "%s", "file" OFS; for(i=1; i<=n; i++) printf "count(%s)%s", a[i], (i==n?ORS:OFS); for(f in freq) {printf "%s", f OFS; for(i=1; i<=n; i++) printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)}}' a.log b.log | column -t
Output:
file count(new) count(old)
a.log 1 4
b.log 3 2
PS: column -t was only used for formatting the output in tabular format.
Readable awk:
awk -v OFS='\t' -F, -v wrds='new,old' 'BEGIN {
n = split(wrds, a, /,/) # split input words list by comma with int index
for(i=1; i<=n; i++) # store words in another array with key as words
b[a[i]]=a[i]
}
FNR==1 {
next # skip first row from all the files
}
$2 in b {
freq[FILENAME][$2]++ # store filename and word frequency in 2-dimesional array
}
END { # print formatted result
printf "%s", "file" OFS
for(i=1; i<=n; i++)
printf "count(%s)%s", a[i], (i==n?ORS:OFS)
for(f in freq) {
printf "%s", f OFS
for(i=1; i<=n; i++)
printf "%s%s", freq[f][a[i]], (i==n?ORS:OFS)
}
}' a.log b.log
I think you're looking for something like this, but it's not overly clear what your objectives are (if you're going for efficiency for example, this isn't overly efficient)...
for file in *.log; do
echo -n "${file}\t"
for word in "new" "old"; do
grep -cw $word $file;
echo -n "\t";
done
echo;
done
(for readability, I simplified the first line, but this doesn't work if there's spaces in the filenames -- the proper solution is to change the first line to read find . -iname "*.log" -maxdepth=1 | while read file; do)
for c in a b ; do egrep -o "new|old" $c.log | sort | uniq -c > $c.luc; done
Get rid of the headlines with grep, then sort and count.
join -1 2 -2 2 a.luc b.luc
> new 1 3
> old 4 2
Placing a new header is left as an exercise for the reader. Is there a flip command for unix/linux/bash to flip a table, or how would you say?
Handling empty cells is left as an exercise too, but possible with join.
without real multi-dim array support, this will count all values in field 2, not just "new/old". The header and number of columns are dynamic with number of distinct values as well.
$ awk -F, 'NR==1 {fs["file"]}
FNR>1 {c[FILENAME,$2]++; fs[FILENAME]; ks[$2];
c["file",$2]="count("$2")"}
END {for(f in fs)
{printf "%s", f;
for(k in ks) printf "%s", OFS c[f,k];
printf "\n"}}' file{1,2} | column -t
file count(new) count(old)
file1 1 4
file2 3 2
Awk solution:
awk 'BEGIN{
FS=","; OFS="\t"; print "file","count(new)","count(old)";
f1=ARGV[1]; f2=ARGV[2] # get filenames
}
FNR==1{ next } # skip the 1st header line
NR==FNR{ c1[$2]++; next } # accumulate occurrences of the 2nd field in 1st file
{ c2[$2]++ } # accumulate occurrences of the 2nd field in 2nd file
END{
print f1, c1["new"], c1["old"];
print f2, c2["new"], c2["old"]
}' a.log b.log
The output:
file count(new) count(old)
a.log 1 4
b.log 3 2

Merging word counts with Bash and Unix

I made a Bash script that extracts words from a text file with grep and sed and then sorts them with sort and counts the repetitions with wc, then sort again by frequency. The example output looks like this:
12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy
Now I'd like to merge all words with the same frequency into one line, like this:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Is there any way to do that with Bash and standard Unix toolset? Or I would have to write a script / program in some more sophisticated scripting language?
With awk:
$ echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
You can do something similar with Bash 4 associative arrays. awk is easier and POSIX though. Use that.
Explanation:
awk splits the line apart by the separator in FS, in this case the default of horizontal whitespace;
$1 is the first field of the count - use that to collect items with the same count in an associative array keyed by the count with cnt[$1];
cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2 is a ternary assignment - if cnt[$1] has no value, just assign the second field $2 to it (The RH of :). If it does have a previous value, concatenate $2 separated by the value of OFS (the LH of :);
At the end, print out the value of the associative array.
Since awk associative arrays are unordered, you need to sort again by the numeric value of the first column. gawk can sort internally, but it is just as easy to call sort. The input to awk does not need to be sorted, so you can eliminate that part of the pipeline.
If you want the digits to be right justified (as your have in your example):
$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
If you want gawk to sort numerically by descending values, you can add PROCINFO["sorted_in"]="#ind_num_desc" prior to traversing the array:
$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2}
END {PROCINFO["sorted_in"]="#ind_num_desc"
for (e in cnt) printf "%3s %s\n", e, cnt[e]} '
With single GNU awk expression (without sort pipeline):
awk 'BEGIN{ PROCINFO["sorted_in"]="#ind_num_desc" }
{ a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file
The output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy
Bonus alternative solution using GNU datamash tool:
datamash -W -g1 collapse 2 <file
The output (comma-separated collapsed fields):
12 the
7 code,with,add
5 quite
3 do,well
1 quick,can,pick,easy
awk:
awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file
sed:
sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'
You start with sorted data, so you only need a new line when the first field changes.
echo "12 the
7 code
7 with
7 add
5 quite
3 do
3 well
1 quick
1 can
1 pick
1 easy" |
awk '
{
if ($1==last) {
printf(" %s",$2)
} else {
last=$1;
printf("%s%s",(NR>1?"\n":""),$0)
}
}; END {print}'
next time you find yourself trying to manipulate text with a combination of grep and sed and shell and..., stop and just use awk instead - the end result will be clearer, simpler, more efficient, more portable, etc...
$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
# printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 was
4 the
4 of
4 it
2 times
2 age
1 worst
1 wisdom
1 foolishness
1 best
.
$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
for (i=1; i<NF; i++) {
word2cnt[tolower($i)]++
}
}
END {
for (word in word2cnt) {
cnt = word2cnt[word]
cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
# printf "%3d %s\n", cnt, word
}
for (cnt in cnt2words) {
words = cnt2words[cnt]
printf "%3d %s\n", cnt, words
}
}
$
$ awk -f tst.awk file | sort -rn
4 it was of the
2 age times
1 best worst wisdom foolishness
Just uncomment whichever printf line you like in the above script to get whichever type of output you want. The above will work in any awk on any UNIX system.
Using miller's nest verb:
mlr -p nest --implode --values --across-records -f 2 --nested-fs ' ' file
Output:
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Add leading zeroes to awk variable

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.
Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).
Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.
This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.
Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Resources