AWK: subset randomly and without replacement a string in every row of a file - random

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.

I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)

Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.

if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

Related

Listing skipped numbers in a large txt file using bash

I need to find a way to display the missing numbers from a large txt file. It's a web graph that has 875,713 vertices. However, when I sort the file the largest number that is displayed at the end is 916,427. So there are some numbers not being used for vertex index. Is there a bash command I could use to do this?
I found this after searching around some other threads but I'm not entirely sure if its correct:
awk 'NR != $1 { for (i = prev + 1; i < $1; i++) {print i} } { prev = $1 + 1 }' file
Assuming the 'number' of each vertex is in the first column, you can use:
awk '{a[$1]} END{for(i = 1; i <= 916427; i++){if(!(i in a)){print i}}}' file
E.g.
# create some example data and remove "10"
seq 916427 | sed '10d' > test.txt
head test.txt
1
2
3
4
5
6
7
8
9
11
awk '{a[$1]} END { for (i = 1; i <= 916427; i++) { if (!(i in a)) {print i}}}' test.txt
10
If you don't want to store the array in memory (otherwise #jared_mamrot solution would work), you can use
awk 'NR==1 {p=$1; next} {for (i=p+1; i<$1; i++) {print i}; p=$1}' < <( sort -n file)
which sorts the file first.
Just because you tagged your question bash, I'll provide a bash solution. :)
# sample data as jared suggested, with 10 removed...
seq 916427 | sed '10d' > test.txt
# read sample data into an array...
mapfile -t a < test.txt
# reverse the $a array into $b
for i in "${a[#]}"; do b[$i]=1; done
# step through list of possible numbers, testing if each one is an index of $b
for ((i=1; i<${a[((${#a[#]}-1))]}; i++)); do [[ -z ${b[i]} ]] && echo $i; done
The line noise in the last line (${a[((${#a[#]}-1))]}) simply means "the value of the last array element", and the -1 is there because without instructions otherwise, mapfile starts numbering things at zero.
This takes a little longer to run than awk, because awk is awesome. But it runs in bash without calling any external tools. Aside from the ones generating our sample data, of course!
Note that the last line verifies $b array membership with a string comparison. You might get a very slight performance increase by doing a math comparison instead ((( ${b[i]} )) || echo $i) but the improvement would be so small that it's not even worth mentioning. Oh dang.
Note also that both this and the awk solution involve creating very large arrays in memory, then stepping through those arrays. Be careful of your memory, and don't waste array space with unnecessary data. You will probably want to pull just your indices out of your original dataset for this comparison, rather than loading everything into a bash or awk array.

how to round the output in shell?

For our webshop we get from the manufacturers a csv file (automatically updated) with product data.
Some manufacturers use prices without Tax and some within.
I want to change prices with a shell script to add 21% TAX and round it to nearest .95 or .50
For example I get a sheet:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900
I use this code:
sed -i "1 s/price/price_excl_vat/" inputfile
awk '{FS="|"; OFS="|"; if (NR<=1) {print $0 "|price"} else {print $0 "|" $5*1.21}}' inputfile > outputfile
the output is:
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|30.2379
How do I round it to the correct price like below ?
sku|ean|name|type|price_excl_vat|price
EU-123|123123123123|Product name|simple|24.9900|29.95
awk to the rescue!
awk 'BEGIN {FS=OFS="|"}
$NF==$NF+0 {a=$NF*1.21;
r=a-int(a);
if (r<0.225) a=a-r-0.05;
else if (r<0.725) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'
note that in your example the nearest number for 30.2379 will be 30.50 Perhaps you want to round down?
To round down instead of the nearest, and with a variable price column. The new computed value will be appended to the end of the row.
awk 'BEGIN {FS=OFS="|"; k=5}
$k==$k+0 {a=$k*1.21;
r=a-int(a);
if (r<0.50) a=a-r-0.05;
else if (r<0.95) a=a-r+0.50;
else a=a-r+0.95;
$(NF+1)=a} 1'
awk '#define field separator in and out
BEGIN{FS=OFS="|"}
# add/modify a 6th field for price label if missing on header only
NR==1 && NF == 6 { $6 = "price"; print; next}
NR==1 && NF == 5 { $6 = "price"; print; next}
# add price with tva rounded to 0.01 if missing
NF == 5 { $6 = int( $5 * 121 ) / 100 }
# print the line (modified or not, ex empty lines) [7 is just a *not 0*)
7
' inputfile \
> outputfile
self documented
not sure about your sed for header becasue sample show already a header with price so take the one you want
Not knowing what you're program looks like, it makes it difficult to give you more information.
However, both awk and bash have the printf command. This command can be used for rounding floating point numbers. (Yes, Bash is integer arithmetic, but it can pretend a number is a decimal number).
I gave you the link for the C printf command because the one for Bash doesn't include the formatting codes. Read it and weep because the documentation is a bit dense, and if you've never used printf before, it can be quite difficult to understand. Fortunately, an example will bring things to light:
$ foo="23.42532"
$ printf "%2.2f\n", $foo
$ 23.43 #All rounded for you!
The f means it's a floating point number. The % tells you that this is the beginning of a formatting sequence. The 2.2 means you want 2 digits on the left side of the decimal and two digits on the right. If you said %4.2f, it would make sure there's enough room for four digits on the left side of the decimal, and left pad the number with spaces. The \n on the end is the New Line character.
Fortunately, although printf can be hard to understand at first, it's pretty much the same in almost all programming languages. It's in awk, Perl, Python, C, Java, and many more languages. And, if the information you need isn't in printf, try the documentation on sprintf which is like printf, but prints the formatted text into a string.
The best documentation I've seen is in the Perl sprintf documentation because it gives you plenty of examples.

Bash: arithmetic addressed by line number and column

I have normally done this with Excel, but as I am trying to learn bash, I'd like to ask for advice here on how to do so. My input file resembles:
# s0 legend "1001"
# s1 legend "1002"
#target G0.S0
#type xy
2.0 -1052.7396157664
2.5 -1052.7330560932
3.0 -1052.7540013664
3.5 -1052.7780321236
4.0 -1052.7948229060
4.5 -1052.8081313831
5.0 -1052.8190310613
&
#target G0.S1
#type xy
2.0 -1052.5384564253
2.5 -1052.7040374678
3.0 -1052.7542803612
3.5 -1052.7781686744
4.0 -1052.7948927247
4.5 -1052.8081704241
5.0 -1052.8190543049
&
where the above only shows two data sets: s0 and s1. In reality I have 17 data sets and will combine them arbitrarily. By combine, I mean I would like to:
For two data sets, extract the second column of each separately.
Subtract these two columns row by row.
Multiply the difference by a constant, $C.
Note: $C multiplies very small numbers and the only way I could get it to not divide by zero was to take a massive scale.
Edit: After requests, I was apparently not entirely clear what I was going for. Take for example:
set0
2 x
3 y
4 z
set1
2 r
3 s
4 t
I also have defined a constant C.
I would like to perform the following operation:
C*(r - x)
C*(s - y)
C*(t - z)
I will be doing this for sets > 1, up to 16, for example (set 10) minus (set 0). Therefore, I need the flexibility to target a value based on its line number and column number, and preferably acting over a range of line numbers to make it efficient.
So far this works:
C=$(echo "scale=45;x=(small numbers)*(small numbers); x" | bc -l)
sed -n '5,11p' input.in | cut -c 5-20 > tmp1.in
sed -n '15,21p' input.in | cut -c 5-20 > tmp2.in
pr -m -t -s tmp1.in tmp2.in > tmp3.in
awk '{printf $2-$1 "\n"}' tmp3.in > tmp4.in
but the multiplication failed:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
returning:
0.00
0.00
0.00
0.00
0.00
0.00
0.00
I have a feeling the whole thing can be accomplished more elegantly with awk. I also tried this:
for (( i=0; i<=6; i++ ))
do
n=5+$i
m=10+n
awk 'NR==n{a=$2};NR==m{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
done
but all I get in temp.in is a long column of 0s.
I also tried
awk 'NR==5,NR==11{a=$2};NR==15,NR==21{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
but got the error
awk: (FILENAME=input.in FNR=20) fatal: attempt to access field -1052
Any idea how to formulate this with awk, and if that doesn't work, then why I cannot multiply with awk above? Thank you!
this does the math in one go
$ awk -v c=1 '/^&/ {s++}
s==1 {a[$1]=$2}
s==3 {print $1,a[$1],$2,c*(a[$1]-$2)}
/#type/ {s++}' file
2.0 -1052.7396157664 -1052.5384564253 -0.201159
2.5 -1052.7330560932 -1052.7040374678 -0.0290186
3.0 -1052.7540013664 -1052.7542803612 0.000278995
3.5 -1052.7780321236 -1052.7781686744 0.000136551
4.0 -1052.7948229060 -1052.7948927247 6.98187e-05
4.5 -1052.8081313831 -1052.8081704241 3.9041e-05
5.0 -1052.8190310613 -1052.8190543049 2.32436e-05
you can remove the decorations and add print formatting easily. The magic numbers 1=g1 and 3=2*g2-1 correspond to data groups 1 and 2 as the order presented in the data file, can be converted to awk variables as well.
The counter s keeps track of whether you're in a set or not, Odd numbers correspond to sets and even numbers between sets. The increment is done both at the start pattern and end pattern. The order of increment statements were set in such a way they, they are not printed following the pattern (unset first, print set values, reset last}. You can change the order and observe the effects.
This might be what you're looking for:
$ cat tst.awk
/^[#&]/ { lineNr=0; next }
{
++lineNr
if (lineNr in prev) {
print $1, c * ($2 - prev[lineNr])
}
prev[lineNr] = $2
}
$ awk -v c=100000 -f tst.awk file
2.0 20115.9
2.5 2901.86
3.0 -27.8995
3.5 -13.6551
4.0 -6.98187
4.5 -3.9041
5.0 -2.32436
In your first try, you should replace that line:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
with that one:
awk -v C=$C '{printf "%11.2f\n", C*$1 }' tmp4.in > tmp5.in
You are mixing notations of bash shell with notation with awk.
in shell you define variable without $, and you use them with $.
Here you are in awk script, there is no $ to use variables. Yet there are some special variables : $1 $2 ...
You have put single quote ' around your awk script, so the shell variables cant be used. I mean you have written $C, but the shell can not see it inside single-quote. That is why you have to write awk -v C=$C so that the shell variable $C is transferred to an awk variable called C.
In your other tries with awk, we can see such errors also. Now I think you'll make it.

Removing repeated pairs from a very big text file

I have a very big text file (few GB) that has the following format:
1 2
3 4
3 5
3 6
3 7
3 8
3 9
File is already sorted and double lines were removed. There are repeated pairs like '2 1', '4 3' reverse order that I want to remove. Does anybody have any solution to do it in a very resource limited environments, in BASH, AWK, perl or any similar languages? I can not load the whole file and loop between the values.
You want to remove lines where the second number is less than the first?
perl -i~ -lane'print if $F[0] < $F[1]' file
Possible solution:
Scan the file
For any pair where the second value is less than the first, swap the two numbers
Sort the pairs again by first then second number
Remove duplicates
I'm still thinking about more efficient solution in terms of disk sweeps, but this is a basic naive approach
For each value, perform a binary search on the file on the hard drive, without loading it into memory. Delete the duplicate if you see it. Then do a final pass that removes all instances of two or more \n.
Not exactly sure if this works / if it's any good...
awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext
You want remove duplicates considering 1 2 and 2 1 to be the same?
< file.in \
| perl -lane'print "#F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \
| sort -n \
| perl -lane'$t="#F[0,1]"; print "#F[2,3]" if $t ne $p; $p=$t;' \
> file.out
This can handle arbitrarily large files.
Here's a general O(n) algorithm to do this in 1 pass (no loops or sorting required):
Start with an empty hashset as your blacklist (a set is a map with just keys)
Read file one line at a time.
For each line:
Check to see this pair is in your blacklist already.
If so, ignore it.
If not, append it to your result file; and also add the swapped value to the blacklist (e.g., if you just read "3 4", and "4 3" to the blacklist)
This takes O(n) time to run, and O(n) storage for the blacklist. (No additional storage for the result if you manipulate the file as r/w to remove lines as you check them against the blacklist)
perl -lane '
END{
print for sort {$a<=>$b} keys %h;
}
$key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]";
$h{$key} = "";
' file.txt
Explanations :
I sort the current line in numeric order
I make the hash key variable $key by concatenating first and second value with a space
I defined the $hash{$key} to nothing
At the end, I print all the keys sorted in numeric order.
A hash key is uniq by nature, so no duplicate.
You just need to use Unix redirections to create a new file.

Shell Scripting and use of gawk along with arithmetic operations

I have a tab delimited file and I want to perform some mathematical calculations on the columns present in file.
let the file name be sndf and $tag have some integer value, I want to first find difference between the values of column 3 and 2 then divide column 4 value with the value in $tag again divide the resultant with the difference in values of column 3 and 2 and final result is multiplied by 100.
cat $sndf | gawk '{for (i = 1; i <= NF; i += 1) {
printf "%f\t" $3 -$2 "\t", (((($4/"'$tag'")/($3-$2)))*100);
} printf "\n"}'>normal_wrt_region
the command is writing answer 4 times instead of one time to the output file..... can u all suggest an improvement?
thank you
SOLUTION: Dear all, I have solved the problem, thank you all for reading the problem and investing your time.
the command is writing answer 4 times instead of one time to the output file, can u all suggest an improvement?
Don't use the for loop if you don't need one?
cat $sndf | gawk '{ printf "%f\t" $3 -$2 "\t", (((($4/"'$tag'")/($3-$2)))*100) }'

Resources