Piped input for `bc` division generates random numbers - bash

I've got two files formatted in this way:
File1:
word token occurence
File2:
token occurence
What I want is a third file with this output:
word token occurrence1/occurence2
This is my code:
while read token pos count
do
#get pos counts
poscount=$(grep "^$pos" $2 | cut -f 2)
#calculate probability
prob=$(echo "scale=5;$count / $poscount" | bc -l)
#print token, pos-tag & probability
echo -e "$token\t$pos\t$prob"
done < $1
The problem is that my output is something like this:
- : .25000
: : .75000
' '' 1.00000
0 CD .00396
1000 CD .00793
13 CD .00793
13th JJ .00073
36
29
16 CD .00396
17 CD .00396
There are lines with numbers that I don't know where they come from, they are not in the previous files.
Why do these numbers appear? Is there a way to remove those lines?
Thanks in advance!

Method using paste, cut, & dc:
echo "5 k $(paste file[12] | cut -f 3,5) / p" | dc | \
paste file1 - | cut --complement -f 3
Method using bash, paste & dc:
paste <(join -1 2 file1 -2 1 file2 -o 1.1,1.2) \
<(echo "5 k $(join -1 2 file1 -2 1 file2 -o 1.3,2.2) / p" | dc)

Related

Slow speed with gawk for multiple edits the same file

i run a test enviroment where i created 40 000 testfiles with lorem alg. the files are between 200k and 5 MB big. I wanna modify lots of random files. I will change 5% of the lines by delete 2 lines and insert 1 line with base64 string.
the probleme is that this procedere needs to much time per file. i try to fix with copying testfile to ram and change it there, but i see a single thread that use only one full core and gawk show the most cpu work. i'm looking for some solutions, but i dont find the right advice. i think gawk could do this in one step but for big files i get a to long string when i caculate with "getconf ARG_MAX".
how can i speed this up?
zeilen=$(wc -l < testfile$filecount.txt);
durchlauf=$(($zeilen/20))
zeilen=$((zeilen-2))
for (( c=1; c<=durchlauf; c++ ))
do
zeile=$(shuf -i 1-$zeilen -n 1);
zeile2=$((zeile+1))
zeile3=$((zeile2+1))
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
if [[ $c -eq 1 ]]
then
gawk -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next;print} \
NR==n2{next; print} NR==n3{print s}1' testfile$filecount.txt > /mnt/RAM/tempfile.tmp
else
gawk -i inplace -v n1="$zeile" -v n2="$zeile2" -v n3="$zeile3" -v s="$string" 'NR==n1{next; print} \
NR==n2{next; print} NR==n3{print s}1' /mnt/RAM/tempfile.tmp
fi
done
Assumptions:
generate $durchlauf (a number) random line numbers; we'll refer to a single number as n ...
delete lines numbered n and n+1 from the input file and in their place ...
insert $string (a randomly generated base64 string)
this list of random line numbers must not have any consecutive line numbers
As others have pointed out you want to limit yourself to a single gawk call per input file.
New approach:
generate $durchlauf (count) random numbers (see gen_numbers() function)
generate $durchlauf (count) base64 strings (we'll reuse Ed Morton's code)
paste these 2 sets of data into a single input stream/file
feed 2 files to gawk ... the paste result and the actual file to be modified
we won't be able to use gawk's -i inplace so we'll use an intermediate tmp file
when we find a matching line in our input file we'll 1) insert the base64 string and then 2) skip/delete the current/next input lines; this should address the issue where we have two random numbers that are different by +1
One idea to insure we do not generate consecutive line numbers:
break our set of line numbers into ranges, eg, 100 lines split into 5 ranges => 1-20 / 21-40 / 41-60 / 61-80 / 81-100
reduce the end of each range by 1, eg, 1-19 / 21-39 / 41-59 / 61-79 / 81-99
use $RANDOM to generate numbers between each range (this tends to be at least a magnitude faster than comparable shuf calls)
We'll use a function to generate our list of non-consecutive line numbers:
gen_numbers () {
max=$1 # $zeilen eg, 100
count=$2 # $durchlauf eg, 5
interval=$(( max / count )) # eg, 100 / 5 = 20
for (( start=1; start<max; start=start+interval ))
do
end=$(( start + interval - 2 ))
out=$(( ( RANDOM % interval ) + start ))
[[ $out -gt $end ]] && out=${end}
echo ${out}
done
}
Sample run:
$ zeilen=100
$ durchlauf=5
$ gen_numbers ${zeilen} ${durchlauf}
17
31
54
64
86
Demonstration of the paste/gen_numbers/base64/tr/gawk idea:
$ zeilen=300
$ durchlauf=3
$ paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )
This generates:
74 7VFhnDN4J...snip...rwnofLv
142 ZYv07oKMB...snip...xhVynvw
261 gifbwFCXY...snip...hWYio3e
Main code:
tmpfile=$(mktemp)
while/for loop ... # whatever OP is using to loop over list of input files
do
zeilen=$(wc -l < "testfile${filecount}".txt)
durchlauf=$(( $zeilen/20 ))
awk '
# process 1st file (ie, paste/gen_numbers/base64/tr/gawk)
FNR==NR { ins[$1]=$2 # store base64 in ins[] array
del[$1]=del[($1)+1] # make note of zeilen and zeilen+1 line numbers for deletion
next
}
# process 2nd file
FNR in ins { print ins[FNR] } # insert base64 string?
! (FNR in del) # if current line number not in del[] array then print the line
' <( paste <( gen_numbers ${zeilen} ${durchlauf} ) <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' )) "testfile${filecount}".txt > "${tmpfile}"
# the last line with line continuations for readability:
#' <( paste \
# <( gen_numbers ${zeilen} ${durchlauf} ) \
# <( base64 /dev/urandom | tr -dc '[[:print:]]' | gawk -v max="${durchlauf}" -v RS='.{230}' '{print RT} FNR==max{exit}' ) \
# ) \
#"testfile${filecount}".txt > "${tmpfile}"
mv "${tmpfile}" "testfile${filecount}".txt
done
Simple example of awk code in action:
$ cat orig.txt
line1
line2
line3
line4
line5
line6
line7
line8
line9
$ cat paste.out # simulated output from paste/gen_numbers/base64/tr/gawk
1 newline1
5 newline5
$ awk '...' paste.out orig.txt
newline1
line3
line4
newline5
line7
line8
line9
I don't know what the rest of your script is doing but below will give you the idea how to vastly improve it's performance.
Instead of this which calls base64, tr, head, and awk on each iteration of the loop with all of the overhead that implies:
for (( c=1; c<=3; c++ ))
do
string=$(base64 /dev/urandom | tr -dc '[[:print:]]' | head -c 230)
echo "$string" | awk '{print "<" $0 ">"}'
done
<nSxzxmRQc11+fFnG7ET4EBIBUwoflPo9Mop0j50C1MtRoLNjb43aNTMNRSMePTnGub5gqDWeV4yEyCVYC2s519JL5OLpBFxSS/xOjbL4pkmoFqOceX3DTmsZrl/RG+YLXxiLBjL//I220MQAzpQE5bpfQiQB6BvRw64HbhtVzHYMODbQU1UYLeM6IMXdzPgsQyghv1MCFvs0Nl4Mez2Zh98f9+472c6K+44nmi>
<9xfgBc1Y7P/QJkB6PCIfNg0b7V+KmSUS49uU7XdT+yiBqjTLcNaETpMhpMSt3MLs9GFDCQs9TWKx7yXgbNch1p849IQrjhtZCa0H5rtCXJbbngc3oF9LYY8WT72RPiV/gk4wJrAKYq8/lKYzu0Hms0lHaOmd4qcz1hpzubP7NuiBjvv16A8T3slVG1p4vwxa5JyfgYIYo4rno219ba/vRMB1QF9HaAppdRMP32>
<K5kNgv9EN1a/c/7eatrivNeUzKYolCrz5tHE2yZ6XNm1aT4ZZq3OaY5UgnwF8ePIpMKVw5LZNstVwFdVaNvtL6JreCkcO+QtebsCYg5sAwIdozwXFs4F4hZ/ygoz3DEeMWYgFTcgFnfoCV2Rct2bg/mAcJBZ9+4x9IS+JNTA64T1Zl+FJiCuHS05sFIsZYBCqRADp2iL3xcTr913dNplqUvBEEsW1qCk/TDwQh>
you should write this which only calls each tool once and so will run orders of magnitude faster:
$ base64 /dev/urandom | tr -dc '[[:print:]]' |
gawk -v RS='.{230}' '{print "<" RT ">"} NR==3{exit}'
<X0If1qkQItVLDOmh2BFYyswBgKFZvEwyA+WglyU0BhqWHLzURt/AIRgL3olCWZebktfwBU6sK7N3nwK6QV2g5VheXIY7qPzkzKUYJXWvgGcrIoyd9tLUjkM3eusuTTp4TwNY6E/z7lT0/2oQrLH/yZr2hgAm8IXDVgWNkICw81BRPUqITNt3VqmYt/HKnL4d/i88F4QDE0XgivHzWAk6OLowtmWAiT8k1a0Me6>
<TqCyRXj31xsFcZS87vbA50rYKq4cvIIn1oCtN6PJcIsSUSjG8hIhfP8zwhzi6iC33HfL96JfLIBcLrojOIkd7WGGXcHsn0F0XVauOR+t8SRqv+/t9ggDuVsn6MsY2R4J+mppTMB3fcC5787u0dO5vO1UTFWZG0ZCzxvX/3oxbExXb8M54WL6PZQsNrVnKtkvllAT/s4mKsQ/ojXNB0CTw7L6AvB9HU7W2x+U3j>
<ESsGZlHjX/nslhJD5kJGsFvdMp+PC5KA+xOYlcTbc/t9aXoHhAJuy/KdjoGq6VkP+v4eQ5lNURdyxs+jMHqLVVtGwFYSlc61MgCt0IefpgpU2e2werIQAsrDKKT1DWTfbH1qaesTy2IhTKcEFlW/mc+1en8912Dig7Nn2MD8VQrGn6BzvgjzeGRqGLAtWJWkzQjfx+74ffJQUXW4uuEXA8lBvbuJ8+yQA2WHK5>
#mark-fuso, Wow, thats incredibly fast! But there is a mistake in the script. The file grows in size a little bit, which is something I have to avoid. I think if two random line numbers ($durchlauf) are following each other, then one line is not deleted. Honestly, I dont completely understand what your command is doing, but it works very well. I think for such a task, I have to have more bash experience.
Sample output:
64
65
66
gOf0Vvb9OyXY1Tjb1r4jkDWC4VIBpQAYnSY7KkT1gl5MfnkCMzUmN798pkgEVAlRgV9GXpknme46yZURCaAjeg6G5f1Fc7nc7AquIGnEER>
AFwB9cnHWu6SRnsupYCPViTC9XK+fwGkiHvEXrtw2aosTGAAFyu0GI8Ri2+NoJAvMw4mv/FE72t/xapmG5wjKpQYsBXYyZ9YVV0SE6c6rL>
70
71

Line differences with element location in shell script

Input:
file1.txt
abc 1 2 3 4
file2.txt
abc 1 2 5 6
Expected output:
differences is
3
5
at location 3
I am able to track the differences using:
comm -3 file1.txt file2.txt | uniq -c | awk '{print $4}' | uniq
But not able to track the element location.
Could you guys please suggest the shell script to track the element location?
With perl, and Path::Class from CPAN for convenience
perl -MPath::Class -MList::Util=first -e '
#f1 = split " ", file(shift)->slurp;
#f2 = split " ", file(shift)->slurp;
$idx = first {$f1[$_] ne $f2[$_]} 0..$#f1;
printf "difference is\n%s\n%s\nat index %d\n", $f1[$idx], $f2[$idx], $idx;
' file{1,2}.txt
difference is
3
5
at index 3

how to find maximum and minimum values of a particular column using AWK [duplicate]

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

awk: find minimum and maximum in column

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

Merge columns cut & cat

I have file.txt 3 columns.
1 A B
2 C D
3 E F
I want to add #1&#3 as the end of #2. Result should look like this:
1A
2C
3E
1B
2D
3F
I am doing this by
cut -f 1,2 > tmp1
cut -f 1,3 > tmp2
cat *tmp * > final_file
But I am getting repeated lines! If I check the final output with:
cat * | sort | uniq -d
there are plenty of repeated lines and there are none in the primary file.
Can anyone suggest other way of doing this? I believe the one I am trying to use is too complex and that's why I am getting such a weird output.
pzanoni#vicky:/tmp$ cat file.txt
1 A B
2 C D
3 E F
pzanoni#vicky:/tmp$ cut -d' ' -f1,2 file.txt > result
pzanoni#vicky:/tmp$ cut -d' ' -f1,3 file.txt >> result
pzanoni#vicky:/tmp$ cat result
1 A
2 C
3 E
1 B
2 D
3 F
I'm using bash
Preserves the order with one pass through the file
awk '
{print $1 $2; pass2 = pass2 sep $1 $3; sep = "\n"}
END {print pass2}
' file.txt
The reason this (cat tmp* * > final_file) is wrong:
I assume *tmp was a typo
I assume as this point the directory only contains "tmp1" and "tmp2"
Look at how those wildcards will be expanded:
tmp* expands to "tmp1" and "tmp2"
* also expands to "tmp1" and "tmp2"
So your command line becomes cat tmp1 tmp2 tmp1 tmp2 > final_file and hence you get all the duplicated lines.
cat file.txt | awk '{print $1 $2 "\n" $1 $3};'

Resources