Reading delimited value from file into a array variable - bash

I want to read data.txt which has a 2x2 matrix number inside delimited by tab like this:
0.5 0.1
0.3 0.2
Is there any way to read this file in bash then store it into an array then process it a little then export it to a file again? Like for example in matlab:
a=dlmread('data.txt') //read file to array variable a
for i=1:2
for j=1:2
b[i][j]=a[i][j]+100
end
end
dlmwrite(b,'data2.txt') //exporting array value b to data2.txt

If the extent of your processing is to something simple like add 100 to every entry, a simple awk command like this might work:
awk '{ for(i = 1; i <= NF - 1; i++) { printf("%.1f%s", $i + 100, OFS); } printf("%.1f%s", $NF+100, ORS); }' < matrix.txt
This just loops through each row and adds 100. It's possible to do more complex operations too, but if you really want toprocess matrices there are better tools (like python+numpy or octave).
It's also possible to use bash arrays, but to do any of the operations you'd have to use an external program anyway, since bash doesn't handle floating point arithmetic.

Related

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

bash awk moving average with skipping

I am trying to calculate a moving average with a data set. But in addition, I want it to skip a few number of data each time the average 'window' moves. For example, if my data set is a column from 1 to 20 and my average window is 5, then the current calculation is the average of (1-5), (2-6), (3-7), (4-8).....
But I want to skip a few data each time the window moves, say I want to skip 2. then the new average will be (1-5), (4-8), (6-10), (8-12)......
Here is the current awk file I am using, can anyone help me edit it so that I can skip a few data each time the window moves? I want to change the skip size and window size as well. Thank you very much!
#!/bin/awk
BEGIN {
N=5 # the window size
}
{
n[NR]=$1 # store the value in an array
}
NR>=N { # for records where NR >= N
x=0 # reset the sum variable
delete n[NR-N] # delete the one out the window of N
for(i in n) # all array elements
x+=n[i] # ... must be summed
print x/N # print the row from the beginning of window
}
I think your ranges are not well specified, but you wanted to achieve can be done by parallel windowing as below
awk '{sum[1]+=$1}
!(NR%5){print NR-4"-"NR, sum[1]/5; sum[1]=0}
NR>3{sum[4]+=$1}
NR>3 && !((NR-3)%5){print NR-4"-"NR, sum[4]/5; sum[4]=0}' <(seq 15)
will give, you can remove printing ranges which it there for debugging.
1-5 3
4-8 6
6-10 8
9-13 11
11-15 13
for making window size and skip count variable
awk -v w=5 -v s=3 'function pr(x) {print (NR-s-1)"-"NR, sum[x]/w; sum[x]=0}
{sum[1]+=$1}
NR>s {sum[s+1]+=$1}
!(NR%w) {pr(1)}
NR>s && !((NR-s)%w){pr(s+1)}' file
first window always start at 1, second window starts at s+1. This can be generalize for more than 2 windows as well, perhaps you can find someone to do it...
I see that you want to print MA every K ticks instead of printing for every tick (K=1). So you could add a condition NR%K==0 before printing in your existing code.
But it would be better to keep an array of N elements and overwrite them instead of deleting. Using NR%N as array index. This way, when K is not 1 and want not to calculate the MA, you will avoid checking how many elements to delete etc.
awk -v n=5 -v k=2 '{ a[NR%n]=$0 }
NR>=n && (NR-n)%k==0 { s=0; for (i in a) s+=a[i]; print NR ":\t" s/n }' file
update condition to (NR-n)%k==0 for always starting from first tick where MA is calculated (that is for NR=n).

Using awk to interpolate data based on if statement

so I am trying to automate a data collection process by using awk to search the file for a certain pattern and plug values into the linear interpolation formula. The data in question tracks time versus position, and I need to interpolate the time at which position equals zero. Example:
100 0.5
200 0.2
300 -0.3
400 -0.7
Then, my interpolation looks like this:
interpolated_time = 200 + (0 - 0.2) * (300 - 200) / (-0.3 - 0.2)
I am going to write the script in bash and use bc calculator for the arithmetic. However, I am inexperienced with using awk and cannot figure out how to correctly search the file.
I want to do something like
awk '{if ($2 > 0) #add another statement to test if $2 < 0 on next line#}'
# If test is successful, store entries in variables or an array
The interpolation may need to be performed multiple times in one file. I may need to output all values in question to an array, and then input the paired indexes into the interpolation formula. (i.e. indices [1,2] [3,4] [5,6] are paired together for separate interpolations)
I know that awk works on a line-by-line test loop, but I don't know if there is a way to incorporate the previous or next line in the test (perhaps something like
next
or
getline
?)
Any suggestions or comments would be greatly appreciated!
This will give you the result 240
awk '{if(p2>0 && $2<0) print p1-p2*($1-p1)/($2-p2); p1=$1; p2=$2}'
doesn't handle if 0 is already in the data set and assumes transition is from positive to negative.

In a column of numbers, find the closest value to some target value

Let's say I have some numerical data in columns, something like
11.100000 36.829657 6.101642
11.400000 36.402069 5.731998
11.700000 35.953025 5.372652
12.000000 35.482082 5.023737
12.300000 34.988528 4.685519
12.600000 34.471490 4.358360
12.900000 33.930061 4.042693
13.200000 33.363428 3.738985
13.500000 32.770990 3.447709
13.800000 32.152473 3.169312
I also have a single target value and a column index. Given this set of data, I want to find the closest value to the target value in the column with the specified index.
For example, If my target value is 11.6 in column 1, then the script should output 11.7. If there are two numbers equidistant from the target value, then the higher value should be output.
I have a feeling that awk has the necessary functionality to do this, but any solution that works in a bash script is welcome.
try this:
awk -v c=2 -v t=35 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' file
the -v c=2 and -v t=35 could be dynamic value. they are the column idx (c) and your target value (t). in the above line, the parameter is column 2 and target 25. They could be shell variable.
the output of above line based on given input data is:
kent$ awk -v c=2 -v t=35 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' f
34.988528
kent$ awk -v c=1 -v t=11.6 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' f
11.700000
EDIT
If there are two numbers equidistant from the target value, then the higher value should be output
The above codes didn't check this requirement.... the below one should work:
awk -v c=1 -v t=11.6 '{a[NR]=$c}END{
asort(a);d=a[NR]-t;d=d<0?-d:d;v = a[NR]
for(i=NR-1;i>=1;i--){
m=a[i]-t;m=m<0?-m:m
if(m<d){
d=m;v=a[i]
}
}
print v
}' file
test:
kent$ awk -v c=1 -v t=11.6 '{a[NR]=$c}END{
asort(a);d=a[NR]-t;d=d<0?-d:d;v = a[NR]
for(i=NR-1;i>=1;i--){
m=a[i]-t;m=m<0?-m:m
if(m<d){
d=m;v=a[i]
}
}
print v
}' f
11.700000
short explanation.
I won't explain each line of code, what it does. just tell a bit the idea to do the job.
first read all element in the given column, save in an array
sort the array.
take the last element from the array(the greatest number). assign it to var v, and calculate the diff between it and the given target, save it(absolute value) in d
from the 2nd last element from the array loop to the first. if diff between element and target (absolute value) is less than d, overwrite d with diff, also save current element into v
print v, after looping, v is the answer.
some note:
there is room to optimize the logic. e.g. we don't have to loop thru the whole array. just compare the d(abs), if new diff > d, we can stop the loop.
due to the sort, this algorithm is O(nlogn). in fact this problem could be solved by O(n). If your input data were huge, and with a worst case(e.g. your column has value in range 500-99999999999, but your target is 1.) you may want to avoid the sort. but I assume the performance is not an issue by you.
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
#ARGV == 2 or die "Usage: closest column value < input\n";
my ($column, $target) = (shift, shift);
my $closest;
while (<>) {
my $value = (split)[$column - 1];
if ($. == 1
or abs($closest - $target) > abs($target - $value)
or abs($closest - $target) == abs($target - $value)
&& $value > $closest) {
$closest = $value;
}
}
print $closest, "\n";
Note that using float == float might not work (What Every Computer Scientist Should Know About Floatin-Point Arithmetic). You might need something like abs(abs($closest - $target) - abs($target - $value)) < 1e-14.
Let's try another way, although Kent's answer must be shorter and sharper :)
awk -vc=1 -vv=13.6 '
BEGIN{l=$c; ld=99}
{d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}}
END{print l}' file
We provide the c (=column) and v (=value) parameters in the beginning.
Then we keep track of the lower value l and the lowest distance ld. For each value we calculate the distance d to the value and if it is lower to the previous ld, we swap and save the new minimal value in l. Finally we print l.
The d=($c-v>=0) ? ($c-v) : v-$c is a way to save the distance as a absolute value: if c-v is negative, save it as positive. It is based on the value=(condition) ? if yes : else structure.
Tests
$ awk -vc=2 -vv=13.6 'BEGIN{l=$c; ld=99} {d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}} END{print l}' file
32.152473
$ awk -vc=3 -vv=10.6 'BEGIN{l=$c; ld=99} {d=($c-v>=0) ? ($c-v) : v-$c; if (d <= ld) {ld=d; l=$c}} END{print l}' file
3.169312

Calculating CRC in awk

Has anyone implemented the POSIX 1003.2 compiliant CRC algorithm (as output by cksum) in awk/gawk? I'm needing to do a checksum on an in memory string (not the whole file) and shelling out to call cksum is slow and expensive.
My overall need is to generate a numerical checksum that fits within 10 digits or less. Other hash/CRC functions could work too, anyone have any thing handy?
A Google search and a scan of awk.info turned up nothing interesting.
EDIT:
I ended up using the external cksum command, but caching the results into an awk associative array. Performance was good enough and I didn't need to reinvent the wheel.
gawk/awk implimentation of crc32 (compatible with the POSIX cksum command)
Here is a awk (gawk) implimentation of crc32.
Notice that we use T and X as our lookup table. T is used for the crc32_table, and X is used for lookup of data to int value.
If you wish, you can compute the crc32_table on runtime, however it was a bit slow on startup, so you would have a tradeoff between small codesize and slow tartup, or Reasonable speed crc32 calculation and large code size. I would recomend version with a crc_table, as the code size increace was well justifyable when compareson of speed was done.
If you have a version of awk that does not support and(),xor(),compl(),lshift(),rshift() then do not forget to load the bitwise operation libs.
BEGIN{
# Initialize CRC32 table
T[0]=0x00000000;
T[1]=0x04c11db7;T[2]=0x09823b6e;T[3]=0x0d4326d9;T[4]=0x130476dc;T[5]=0x17c56b6b;
T[6]=0x1a864db2;T[7]=0x1e475005;T[8]=0x2608edb8;T[9]=0x22c9f00f;T[10]=0x2f8ad6d6;
T[11]=0x2b4bcb61;T[12]=0x350c9b64;T[13]=0x31cd86d3;T[14]=0x3c8ea00a;T[15]=0x384fbdbd;
T[16]=0x4c11db70;T[17]=0x48d0c6c7;T[18]=0x4593e01e;T[19]=0x4152fda9;T[20]=0x5f15adac;
T[21]=0x5bd4b01b;T[22]=0x569796c2;T[23]=0x52568b75;T[24]=0x6a1936c8;T[25]=0x6ed82b7f;
T[26]=0x639b0da6;T[27]=0x675a1011;T[28]=0x791d4014;T[29]=0x7ddc5da3;T[30]=0x709f7b7a;
T[31]=0x745e66cd;T[32]=0x9823b6e0;T[33]=0x9ce2ab57;T[34]=0x91a18d8e;T[35]=0x95609039;
T[36]=0x8b27c03c;T[37]=0x8fe6dd8b;T[38]=0x82a5fb52;T[39]=0x8664e6e5;T[40]=0xbe2b5b58;
T[41]=0xbaea46ef;T[42]=0xb7a96036;T[43]=0xb3687d81;T[44]=0xad2f2d84;T[45]=0xa9ee3033;
T[46]=0xa4ad16ea;T[47]=0xa06c0b5d;T[48]=0xd4326d90;T[49]=0xd0f37027;T[50]=0xddb056fe;
T[51]=0xd9714b49;T[52]=0xc7361b4c;T[53]=0xc3f706fb;T[54]=0xceb42022;T[55]=0xca753d95;
T[56]=0xf23a8028;T[57]=0xf6fb9d9f;T[58]=0xfbb8bb46;T[59]=0xff79a6f1;T[60]=0xe13ef6f4;
T[61]=0xe5ffeb43;T[62]=0xe8bccd9a;T[63]=0xec7dd02d;T[64]=0x34867077;T[65]=0x30476dc0;
T[66]=0x3d044b19;T[67]=0x39c556ae;T[68]=0x278206ab;T[69]=0x23431b1c;T[70]=0x2e003dc5;
T[71]=0x2ac12072;T[72]=0x128e9dcf;T[73]=0x164f8078;T[74]=0x1b0ca6a1;T[75]=0x1fcdbb16;
T[76]=0x018aeb13;T[77]=0x054bf6a4;T[78]=0x0808d07d;T[79]=0x0cc9cdca;T[80]=0x7897ab07;
T[81]=0x7c56b6b0;T[82]=0x71159069;T[83]=0x75d48dde;T[84]=0x6b93dddb;T[85]=0x6f52c06c;
T[86]=0x6211e6b5;T[87]=0x66d0fb02;T[88]=0x5e9f46bf;T[89]=0x5a5e5b08;T[90]=0x571d7dd1;
T[91]=0x53dc6066;T[92]=0x4d9b3063;T[93]=0x495a2dd4;T[94]=0x44190b0d;T[95]=0x40d816ba;
T[96]=0xaca5c697;T[97]=0xa864db20;T[98]=0xa527fdf9;T[99]=0xa1e6e04e;T[100]=0xbfa1b04b;
T[101]=0xbb60adfc;T[102]=0xb6238b25;T[103]=0xb2e29692;T[104]=0x8aad2b2f;T[105]=0x8e6c3698;
T[106]=0x832f1041;T[107]=0x87ee0df6;T[108]=0x99a95df3;T[109]=0x9d684044;T[110]=0x902b669d;
T[111]=0x94ea7b2a;T[112]=0xe0b41de7;T[113]=0xe4750050;T[114]=0xe9362689;T[115]=0xedf73b3e;
T[116]=0xf3b06b3b;T[117]=0xf771768c;T[118]=0xfa325055;T[119]=0xfef34de2;T[120]=0xc6bcf05f;
T[121]=0xc27dede8;T[122]=0xcf3ecb31;T[123]=0xcbffd686;T[124]=0xd5b88683;T[125]=0xd1799b34;
T[126]=0xdc3abded;T[127]=0xd8fba05a;T[128]=0x690ce0ee;T[129]=0x6dcdfd59;T[130]=0x608edb80;
T[131]=0x644fc637;T[132]=0x7a089632;T[133]=0x7ec98b85;T[134]=0x738aad5c;T[135]=0x774bb0eb;
T[136]=0x4f040d56;T[137]=0x4bc510e1;T[138]=0x46863638;T[139]=0x42472b8f;T[140]=0x5c007b8a;
T[141]=0x58c1663d;T[142]=0x558240e4;T[143]=0x51435d53;T[144]=0x251d3b9e;T[145]=0x21dc2629;
T[146]=0x2c9f00f0;T[147]=0x285e1d47;T[148]=0x36194d42;T[149]=0x32d850f5;T[150]=0x3f9b762c;
T[151]=0x3b5a6b9b;T[152]=0x0315d626;T[153]=0x07d4cb91;T[154]=0x0a97ed48;T[155]=0x0e56f0ff;
T[156]=0x1011a0fa;T[157]=0x14d0bd4d;T[158]=0x19939b94;T[159]=0x1d528623;T[160]=0xf12f560e;
T[161]=0xf5ee4bb9;T[162]=0xf8ad6d60;T[163]=0xfc6c70d7;T[164]=0xe22b20d2;T[165]=0xe6ea3d65;
T[166]=0xeba91bbc;T[167]=0xef68060b;T[168]=0xd727bbb6;T[169]=0xd3e6a601;T[170]=0xdea580d8;
T[171]=0xda649d6f;T[172]=0xc423cd6a;T[173]=0xc0e2d0dd;T[174]=0xcda1f604;T[175]=0xc960ebb3;
T[176]=0xbd3e8d7e;T[177]=0xb9ff90c9;T[178]=0xb4bcb610;T[179]=0xb07daba7;T[180]=0xae3afba2;
T[181]=0xaafbe615;T[182]=0xa7b8c0cc;T[183]=0xa379dd7b;T[184]=0x9b3660c6;T[185]=0x9ff77d71;
T[186]=0x92b45ba8;T[187]=0x9675461f;T[188]=0x8832161a;T[189]=0x8cf30bad;T[190]=0x81b02d74;
T[191]=0x857130c3;T[192]=0x5d8a9099;T[193]=0x594b8d2e;T[194]=0x5408abf7;T[195]=0x50c9b640;
T[196]=0x4e8ee645;T[197]=0x4a4ffbf2;T[198]=0x470cdd2b;T[199]=0x43cdc09c;T[200]=0x7b827d21;
T[201]=0x7f436096;T[202]=0x7200464f;T[203]=0x76c15bf8;T[204]=0x68860bfd;T[205]=0x6c47164a;
T[206]=0x61043093;T[207]=0x65c52d24;T[208]=0x119b4be9;T[209]=0x155a565e;T[210]=0x18197087;
T[211]=0x1cd86d30;T[212]=0x029f3d35;T[213]=0x065e2082;T[214]=0x0b1d065b;T[215]=0x0fdc1bec;
T[216]=0x3793a651;T[217]=0x3352bbe6;T[218]=0x3e119d3f;T[219]=0x3ad08088;T[220]=0x2497d08d;
T[221]=0x2056cd3a;T[222]=0x2d15ebe3;T[223]=0x29d4f654;T[224]=0xc5a92679;T[225]=0xc1683bce;
T[226]=0xcc2b1d17;T[227]=0xc8ea00a0;T[228]=0xd6ad50a5;T[229]=0xd26c4d12;T[230]=0xdf2f6bcb;
T[231]=0xdbee767c;T[232]=0xe3a1cbc1;T[233]=0xe760d676;T[234]=0xea23f0af;T[235]=0xeee2ed18;
T[236]=0xf0a5bd1d;T[237]=0xf464a0aa;T[238]=0xf9278673;T[239]=0xfde69bc4;T[240]=0x89b8fd09;
T[241]=0x8d79e0be;T[242]=0x803ac667;T[243]=0x84fbdbd0;T[244]=0x9abc8bd5;T[245]=0x9e7d9662;
T[246]=0x933eb0bb;T[247]=0x97ffad0c;T[248]=0xafb010b1;T[249]=0xab710d06;T[250]=0xa6322bdf;
T[251]=0xa2f33668;T[252]=0xbcb4666d;T[253]=0xb8757bda;T[254]=0xb5365d03;T[255]=0xb1f740b4;
# Init raw data to int lookup table
for(i=0;i<=255;i++)X[sprintf("%c",i)]=i;
}
Then calculate the crc32
# Limit var size to 32bit
function u32(v){return and(v,0xffffffff)}
{
# Lets try with $0 as buf in this example.
buf = $0;
# Step 1) Start CRC32 calculation.
len = 0; #// Total size.
crc=u32( 0 ); #// Initial seed. POSIX compatible crc32 uses 0
# Step 2) Repeat this for as many buf as nessary, we assume "buf" contains data.
A[0]=split(buf,A,"");
len += A[0]
for(i=1;i<=A[0];i++)crc=u32(xor(u32(lshift(crc,8)),T[u32(and(xor(rshift(crc,24),X[A[i]]),0xFF))]));
# Step 3) End CRC32 calculation. Calculate the total size of buf read, and write into CRC
while(len){crc=u32(xor(u32(lshift(crc,8)),T[u32(and(xor(rshift(crc,24),and(len,0xFF)),0xFF))]));len=rshift(len,8);}
crc=u32(compl(crc));
print "crc=["crc"]";
}
Result
crc=[4294967295] <-- ""
crc=[1220704766] <-- "a"
crc=[1219131554] <-- "abc"
crc=[3644109718] <-- "message digest"
All crc32 values match, exactly like how cksum command does it :-)
Since cksum uses a large table, it's probably impractical to re-implement it in AWK. You might be able to calculate it on the fly without using a table, but that's likely to be slower than calling cksum.
References:
POSIX
GNU cksum source
Translating it from C to AWK should be fairly trivial, however, if someone were so inclined.
By the way, gawk has coprocesses:
gawk 'BEGIN {
cmd="cksum"
print "hello" |& cmd
close(cmd, "to")
while (cmd |& getline a > 0)
print a
close(cmd)
}'

Resources