How to calculate Adler32 checksum for zip in Bash? - bash

I need get checksum Adler32 and store to variable in bash.
It will be used in automatic script and it will be useful if no additional app/liberty will used which user need to install.
Is it possible to use common / basic command Bash command to get this value?

This is monumentally slow (about 60,000 times slower than C), but it shows that yes, it is possible.
#!/bin/bash
sum1=1
sum2=0
while LANG=C IFS= read -r -d '' -n 1 ch; do
printf -v val '%d\n' "'$ch"
(( val = val < 0 ? val + 256 : val, sum1 = (sum1 + val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done
(( adler = sum1 + 65536 * sum2 ))
echo $adler
Hopefully someone who actually knows bash could vastly improve on this.

Maybe this solution?:
python -c "import zlib; print(zlib.adler32(\"${file}\"))"

Tried two adler bash functions
one with an ordination dictionary and one with printf
also tried some bit shifting like
instead of sum1=(sum1+val)%65521 -> temp= (sum1+val),sum1=temp >> 16 *15 + (temp & 65355)%65521
wasn't able to improve it a lot, perhaps somebody knows a faster one.
last function is a awk function, it is the fastest, works also on files.
#!/bin/bash
a=$'Hello World'; b=""
for ((i=0;i<1000;i++)); do b+=$a; done
#-- building associative array ord byte character array
declare -Ai ordCHAR=()
for ((i=1;i<256;i++)); do printf -v hex "%x" $i; printf -v char "\x"$hex; ordCHAR[$char]=$i; done
unset hex char i
#-- building associative array ord byte character array -- END
#-- with dictionary
function adler32_A ()
{
local char; local -i sum1=1 sum2=0 val
LC_ALL=C; while read -rN 1 char; do
val=${ordCHAR[$char]};
((sum1=(sum1+val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done <<< $1
#-- removing 0A=\n addition, because of here string
(( sum2-=sum1, sum2<0 ? sum2+=65521 :0, sum1-=val, sum1<0 ? sum1+=65521 :0 ));
printf "%08x" $(( (sum2 << 16) + sum1 ))
LC_ALL=""
}
#-- with printf
function adler32_B ()
{
local char; local -i sum1=1 sum2=0 val
LC_ALL=C; while read -rN 1 char;
do
printf -v val '%d' "'$char"
(( sum1 = (sum1 + val) % 65521, sum2 = (sum2 + sum1) % 65521 ))
done <<< $1
#-- removing 0A=\n addition, because of here string
(( sum2-=sum1, sum2<0 ? sum2+=65521 :0, sum1-=val, sum1<0 ? sum1+=65521 :0 ));
printf "%x" $((sum1 + 65536 * sum2 ))
LC_ALL=""
}
#-- call adler32_awk [text STR] [evaluate text as path bINT]
function adler32_awk ()
{
local -i bPath=$2;
awk -b \
' BEGIN {RS="^$"; bPath='"$bPath"'; for (i=0;i<256;i++) charOrdARR[sprintf("%c",i)]=i; A=1; B=0;}
{
recordSTR=substr($0,1,length($0)-1); if (bPath) {getline byte_data < recordSTR; close(recordSTR);} else byte_data=recordSTR;
l=length(byte_data); for (i=1;i<=l;i++) {
A+=charOrdARR[substr(byte_data,i,1)]; if (A>65520) A-=65521;
B+=A; if (B>65520) B-=65521;}
printf "%x", lshift(B,16)+A; }
' <<<$1
}
time adler32_A "$b"
time adler32_B "$b"
#-- adler 32 of file -> adler32_awk "/home/.../your file" 1
time adler32_awk "$b"

Related

Generic "append to file if not exists" function in Bash

I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13

Java String.hashCode() implementation in Bash

I am trying to imlplement String.hashCode() function in Bash. I Couldn't figure out the bug.
this is my sample implementation
function hashCode(){ #similar function to java String.hashCode()
foo=$1
echo $foo
h=0
for (( i=0; i<${#foo}; i++ )); do
val=$(ord ${foo:$i:1})
echo $val
if ((31 * h + val > 2147483647))
then
h=$((-2147483648 + (31 * h + val) % 2147483648 ))
elif ((31 * h + val < -2147483648))
then
h=$(( 2147483648 - ( 31 * h + val) % 2147483648 ))
else
h=$(( 31 * h + val))
fi
done
printf %d $h
}
function ord() { #asci to int conversion
LC_CTYPE=C printf %d "'$1"
}
Java function looks like this
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
Expected output for string "__INDEX_STAGING_DATA__0_1230ee6d-c37a-46cf-821c-55412f543fa6" is "1668783629" but the output is -148458597
Note - Have to handle java int overflow and underflow.
Vinujan, your code is working for the purpose of hashing a given string using the algorithm you have included. You do not need the ord function as you can cause the literal conversion to ASCII value with printf -v val "%d" "'${foo:$i:1}" (unless you need the LC_CTYPE=C for character set differences).
For example, with just minor tweaks to your code, it will hash the string "hello" properly:
#!/bin/bash
function hashCode() {
local foo="$1"
local -i h=0
for ((i = 0; i < ${#foo}; i++)); do
printf -v val "%d" "'${foo:$i:1}" # val is ASCII val
if ((31 * h + val > 2147483647)) # hash scheme
then
h=$((-2147483648 + (31 * h + val) % 2147483648 ))
elif ((31 * h + val < -2147483648))
then
h=$(( 2147483648 - ( 31 * h + val) % 2147483648 ))
else
h=$(( 31 * h + val))
fi
done
printf "%d" $h # final hashCode in decimal
}
hash=$(hashCode "$1")
printf "\nhashCode: 0x%02x (%d decimal)\n" $hash $hash
Example Use/Output
$ bash hashcode.sh hello
hashCode: 0x5e918d2 (99162322 decimal)
Where you look like you have problems is in the algorithm for hashing itself. For example a longer string like password will result in your scheme returning a negative 64-bit value that looks suspect, e.g.:
$ bash hashcode.sh password
hashCode: 0xffffffffb776462d (-1216985555 decimal)
This may be your intended hash, I have nothing to compare the algorithm against. Look things over, and if you still have problems, edit your question and describe exactly what problems/error/etc. you are getting when you run the script and add that output to your question.
Edit of Hash Function for Better Behavior
Without an algorithm to implement, the only thing I can do is to reformulate the algorithm you provided to be better behaved when the calculations exceed INT_MAX/INT_MIN. Looking at your existing algorithm, it appeared to make the problems worse as large numbers were encountered rather than smoothing the values to insure they remained within the bounds.
Frankly, it looked like you had omitted subtracting INT_MIN or adding INT_MAX to h before reducing the value modulo 2147483648 when it exceeded/fell below those limits. (e.g. you forgot the parenthesis around the subtraction and addition) Simply adding that to the hash algorithm seemed to produce better behavior and your desired output.
I also save the result of your hash calculation in hval, so that it is not computed multiple times each loop, e.g.
function hashCode() {
local foo="$1"
local -i h=0
for ((i = 0; i < ${#foo}; i++)); do
printf -v val "%d" "'${foo:$i:1}" # val is ASCII val
hval=$((31 * h + val))
if ((hval > 2147483647)) # hash scheme
then
h=$(( (hval - 2147483648) % 2147483648 ))
elif ((hval < -2147483648))
then
h=$(( (hval + 2147483648) % 2147483648 ))
else
h=$(( hval ))
fi
done
printf "%d" $h # final hashCode in decimal
}
New Values
Note the hash for "hello" remains the same (as you would expect), but the value for "password" is now better behaved and returns what looks like would be expected, instead of some sign-extended 64-bit value. E.g.,
$ bash hashcode2.sh hello
hashCode: 0x5e918d2 (99162322 decimal)
$ bash hashcode2.sh password
hashCode: 0x4889ba9b (1216985755 decimal)
And note, it does produce your expected output:
$ bash hashcode2.sh "__INDEX_STAGING_DATA__0_1230ee6d-c37a-46cf-821c-55412f543fa6"
hashCode: 0x63779e0d (1668783629 decimal)
Let me know if that is more what you were attempting to do.
I got an lean solution:
hashCode() {
o=$1
h=0
for j in $(seq 1 ${#o})
do
a=$((j-1))
c=${o:$a:1}
v=$(echo -n "$c" | od -d)
i=${v:10:3}
h=$((31 * $h + $i ))
# echo -n a $a c $c i $i h $h
h=$(( (2**31-1) & $h ))
# echo -e "\t"$h
done
echo $h
}
which was wrong. :) The error was in my clever bitwise-ORing of (2**31-1) ^ $h a bitwise ANDing seems a bit wiser: (2**31-1) & $h
This might be condensed to:
hashCode() {
o=$1
h=0
for j in $(seq 1 ${#o})
do
v=$(echo -n "${$o:$((j-1)):1}" | od -d)
h=$(( (31 * $h + ${v:10:3}) & (2**31-1) ))
done
echo $h
}

How to merge files line by line in bash

My files look like
file0 file1 file2
a 1 ##
a 1 ##
b 2 ##
b 2 ##
and I want to merge these files lines by lines, so it should look like
merged file
a
a
1
1
##
##
b
b
2
2
##
##
I mean, choose some lines for each file and merge them into one file.
I tried below bash script.
touch ini.dat
n=2
linenum=$(wc -l < file0)
iter=$((linenum/n))
for i in $(seq 0 1 $iter)
do
for j in $(seq 0 1 2)
do
awk 'NR > '$(($i*$n))' && NR <= '$((($i+1)*$n))'' file"$j" > tmp
cat ini.dat tmp > tmpp
cp tmpp ini.dat
rm tmpp
done
done
It works fine, but takes too much time. Is there any efficient way?
Limiting Factors
Your script had two flaws which made it slow:
A lot of files were created and copied. Especially the ... > tmp; cat ini.dat tmp > tmpp; cp tmpp ini.dat could have been written as ... >> ini.dat.
To read the i-th line of a file, the script has to scan that file from the beginning until the i-th line is reached. If done repeatedly for i = 1, 2, 3, ..., n it will take O(n2). Reading the whole file once (O(n)) into an array and accesing the lines by indices (O(1)) only takes O(n).
Pure Bash Solution
The following bash script does the job a bit faster. linesPerBlock corresponds to the parameter n from your script. The script will print as much blocks as possible. That is:
Once the shortest input file was printed, the script terminates. Following lines from longer files will not be printed.
If the shortest input file's number of lines is not divisible by n, the last lines (fewer than n) will be omitted.
#! /bin/bash
files=(file{0..2})
linesPerBlock=2
starts=(0)
maxLines=9223372036854775807 # bash's max. number
for i in "${!files[#]}"; do
lineCount="$(wc -l < "${files[i]}")"
(( lineCount < maxLines )) && (( maxLines = lineCount ))
(( starts[i+1] = starts[i] + maxLines ))
mapfile -t -O "${starts[i]}" -n "$maxLines" lines < "${files[i]}"
done
for (( b = 0; b < maxLines / linesPerBlock; ++b )); do
for f in "${!files[#]}"; do
start="${starts[f]}"
for (( i = 0; i < linesPerBlock; ++i )); do
echo "${lines[start + b*linesPerBlock + i]}"
done
done
done > outputFile
This awk should do the job and will be much quicker that your shell script:
awk 'fn != FILENAME {
fn = FILENAME
n = 1
}
NF {
a[FILENAME,n++] = $0
}
END {
for(i=0; i<(n-1)/2; i++) {
for(j=1; j<ARGC; j++)
printf "%s\n%s\n", a[ARGV[j],i*2+1], a[ARGV[j],i*2+2];
print ""
}
}' file{0..2}
a
a
1
1
##
##
b
b
2
2
##
##
In a single line:
awk 'fn != FILENAME{fn=FILENAME; n=1} NF{a[FILENAME,n++]=$0} END{for(i=0; i<(n-1)/2; i++) { for(j=1; j<ARGC; j++) printf "%s\n%s\n", a[ARGV[j],i*2+1], a[ARGV[j],i*2+2]; print "" } }' file{0..2}
here is another awk, not caching all contents
paste file{0..2} | awk -v n=2 '
function pr() {for(j=1;j<=NF;j++)
for(i=0;i<n;i++) print a[i,j]}
{for(j=1;j<=NF;j++) a[c+0,j]=$j; c++}
!(NR%n) {pr(); delete a; c=0}
END {pr()}'
if the number of lines is not divisible by n, it will fill up with empty lines.

Using Shell tools (sed | awk... etc) to compute max, min and average field values from a given sample.dat file

I have a sample.dat file which contains experiment values for 10 different
fields, recorded over time. Using sed, awk or any other shell tool, i need to write a script that reads in sample.data file and for each field computes the max, min and average.
sample.dat
field1:experiment1: 10.0
field2:experiment1: 12.5
field1:experiment2: 5.0
field2:experiment2: 14.0
field1:experiment3: 18.0
field2:experiment3: 3.5
Output
field1: MAX = 18.0, MIN = 5.0, AVERAGE = 11.0
field2: MAX = 14.0, MIN = 3.5, AVERAGE = 10.0
awk -F: '
{
sum[$1]+=$3;
if(!($1 in min) || (min[$1]>$3))
min[$1]=$3;
if(!($1 in max) || (max[$1]<$3))
max[$1]=$3;
count[$1]++
}
END {
for(element in sum)
printf("%s: MAX=%.1f, MIN=%.1f, AVARAGE=%.1f\n",
element,max[element],min[element],sum[element]/count[element])
}' sample.dat
Output
field1: MAX=18.0, MIN=5.0, AVARAGE=11.0
field2: MAX=14.0, MIN=3.5, AVARAGE=10.0
Here is a Perl solution I made (substitute the file name for whatever file you use):
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(max min sum);
open( my $fh, "<", "sample.dat" ) or die $!;
my %fields;
while (<$fh>) {
chomp;
$_ =~ s/\s+//g;
my #line = split ":";
push #{ $fields{ $line[0] } }, $line[2];
}
close($fh);
foreach ( keys %fields ) {
print "$_: MAX="
. max #{ $fields{$_} };
print ", MIN="
. min #{ $fields{$_} };
print ", AVERAGE="
. ( (sum #{ $fields{$_} }) / #{ $fields{$_} } ) . "\n";
}
In bash with bc:
#!/bin/bash
declare -A min
declare -A max
declare -A avg
declare -A avgCnt
while read line; do
key="${line%%:*}"
value="${line##*: }"
if [ -z "${max[$key]}" ]; then
max[$key]="$value"
min[$key]="$value"
avg[$key]="$value"
avgCnt[$key]=1
else
larger=`echo "$value > ${max[$key]}" | bc`
smaller=`echo "$value < ${min[$key]}" | bc`
avg[$key]=`echo "$value + ${avg[$key]}" | bc`
((avgCnt[$key]++))
if [ "$larger" -eq "1" ]; then
max[$key]="$value"
fi
if [ "$smaller" -eq "1" ]; then
min[$key]="$value"
fi
fi
done < "$1"
for i in "${!max[#]}"
do
average=`echo "${avg[$i]} / ${avgCnt[$i]}" | bc`
echo "$i: MAX = ${max[$i]}, MIN = ${min[$i]}, AVERAGE = $average"
done
You can make use of this python code :
from collections import defaultdict
d = defaultdict(list)
[d[(line.split(":")[0])].append(float(line.split(":")[2].strip("\n "))) for line in open("sample.dat")]
for f in d: print f, ": MAX=", max(d[f]),", MIN=", min(d[f]),", AVG=", sum(d[f])/float(len(d[f]))
You can use gnu-R for something like this:
echo "1" > foo
echo "2" >> foo
cat foo \
| r -e \
'
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
write(max(v),stdout())
'
2
For summary statistics,
cat foo \
| r -e \
'
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
write(summary(v),stdout())
'
# Max, Min, Mean, median, quartiles, deviation, etc.
...
And in json:
... | r -e \
'
library(rjson)
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
json_summary <- toJSON(summary(v))
write(json_summary,stdout())
'
# same stats
| jq '.Max'
# for maximum
If you are using the linux command line environment, then you probably don't want to reimplement wheels, stay vectorized, and have clean code that is easy to read and develop, and which performs some standard, composable function.
In this case, you don't need an object oriented language (using python will tend to induce interface and code bloat and iterations with google, pip, and conda depending on the libs you need and type conversions you have to code by hand), you don't need verbose syntax, and you probably need to deal with dataframes/vectors/rows/columns of numerical data by default.
You probably also want scripts that can float around your particular machine without issues. If you are on Linux, that probably means: gnu-R. Install dependencies via apt-get.

Edit functions in bash - awk

I have such a function...
function size {
export FILENAME=$1
export SIZE=$(du -sb $FILENAME | awk '{ print $1 }')
awk 'BEGIN{x = ENVIRON["SIZE"]
split("Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte", type)
for(i=8; y < 1; i--)
y = x / (2**(10*i))
print y " " type[i+2]
}'
}
size "/home/foo.bar" # 1 MegaByte
how can I insert: print y " " type[i+2]
to variable: SIZE_FILE ?
test: SIZE_FILE=${print y " " type[i+2]} # error :-(
Thank you very much
The $( expr ) construct will save the result of evaluating "expr" in to a variable:
theDate=$(date)
You can also use backticks, but I think the $() is more readable:
theDate=`date`
So for your scripts, you'll use:
function size {
export FILENAME=$1
SIZE=$(du -sb $FILENAME | awk '{ print $1 }')
export FILE_SIZE=$(awk -v x=$SIZE 'BEGIN{
split("Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte", type)
for(i=8; y < 1; i--)
y = x / (2**(10*i))
print y " " type[i+2]
}')
echo $FILE_SIZE
}
You can do this without awk, which is more suited for processing text files.
function size () {
# Non-environment variables should be lowercased
# Always quote parameter expansions, in case they contain spaces
local filename="$1"
# Simpler way to get the file size in bytes
local size=$(stat -c%s "$filename")
# You could put all the units in an array, but we'll keep it simple.
for unit in Byte KiloByte MegaByte GigaByte TeraByte PetaByte ExaByte ZettaByte YottaByte; do
echo "$size $unit"
(( size /= 1024 ))
done
}
sizes=$( size $myfile )

Resources