Can someone explain me this behavior ?
$ echo "A_B_C_D" | awk '{split($0,a,"_"); for (i in a) {print i,a[i]}}'
2 B
3 C
4 D
1 A
Same with
$ awk '{split("A_B_C_D",a,"_"); for (i in a) {print i,a[i]}}' empty
2 B
3 C
4 D
1 A
Where empty is a file with one line.
However, this works :
$ echo "A_B_C_D" | awk '{n=split($0,a,"_"); for (i=1;i<=n;i++) {print i,a[i]}}'
1 A
2 B
3 C
4 D
Thanks
man awk and look up the in operator. If you want to control the output order using the in operator, you can do so with GNU awk by populating PROCINFO["sorted_in"]. See http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal for details.
Thanks to #fedorqui, the answer was already asked here:
And here is the best answer which satisfies my curiosity :
From 8. Arrays in awk --> 8.5 Scanning All Elements of an Array in the
GNU Awk user's guide (gawk 3.1.0) when referring to the for (value in array)
syntax:
The order in which elements of the array are accessed by this statement
is determined by the internal arrangement of the array
elements within awk and cannot be controlled or changed. This can lead
to problems if new elements are added to array by statements in the
loop body; it is not predictable whether or not the for loop will
reach them. Similarly, changing var inside the loop may produce
strange results. It is best to avoid such things.
It's like python dictionary when looping over dict.keys(), isn't it ?
Related
So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL
I have normally done this with Excel, but as I am trying to learn bash, I'd like to ask for advice here on how to do so. My input file resembles:
# s0 legend "1001"
# s1 legend "1002"
#target G0.S0
#type xy
2.0 -1052.7396157664
2.5 -1052.7330560932
3.0 -1052.7540013664
3.5 -1052.7780321236
4.0 -1052.7948229060
4.5 -1052.8081313831
5.0 -1052.8190310613
&
#target G0.S1
#type xy
2.0 -1052.5384564253
2.5 -1052.7040374678
3.0 -1052.7542803612
3.5 -1052.7781686744
4.0 -1052.7948927247
4.5 -1052.8081704241
5.0 -1052.8190543049
&
where the above only shows two data sets: s0 and s1. In reality I have 17 data sets and will combine them arbitrarily. By combine, I mean I would like to:
For two data sets, extract the second column of each separately.
Subtract these two columns row by row.
Multiply the difference by a constant, $C.
Note: $C multiplies very small numbers and the only way I could get it to not divide by zero was to take a massive scale.
Edit: After requests, I was apparently not entirely clear what I was going for. Take for example:
set0
2 x
3 y
4 z
set1
2 r
3 s
4 t
I also have defined a constant C.
I would like to perform the following operation:
C*(r - x)
C*(s - y)
C*(t - z)
I will be doing this for sets > 1, up to 16, for example (set 10) minus (set 0). Therefore, I need the flexibility to target a value based on its line number and column number, and preferably acting over a range of line numbers to make it efficient.
So far this works:
C=$(echo "scale=45;x=(small numbers)*(small numbers); x" | bc -l)
sed -n '5,11p' input.in | cut -c 5-20 > tmp1.in
sed -n '15,21p' input.in | cut -c 5-20 > tmp2.in
pr -m -t -s tmp1.in tmp2.in > tmp3.in
awk '{printf $2-$1 "\n"}' tmp3.in > tmp4.in
but the multiplication failed:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
returning:
0.00
0.00
0.00
0.00
0.00
0.00
0.00
I have a feeling the whole thing can be accomplished more elegantly with awk. I also tried this:
for (( i=0; i<=6; i++ ))
do
n=5+$i
m=10+n
awk 'NR==n{a=$2};NR==m{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
done
but all I get in temp.in is a long column of 0s.
I also tried
awk 'NR==5,NR==11{a=$2};NR==15,NR==21{b=$2} {printf "%d\n", $b-$a}' input.in > temp.in
but got the error
awk: (FILENAME=input.in FNR=20) fatal: attempt to access field -1052
Any idea how to formulate this with awk, and if that doesn't work, then why I cannot multiply with awk above? Thank you!
this does the math in one go
$ awk -v c=1 '/^&/ {s++}
s==1 {a[$1]=$2}
s==3 {print $1,a[$1],$2,c*(a[$1]-$2)}
/#type/ {s++}' file
2.0 -1052.7396157664 -1052.5384564253 -0.201159
2.5 -1052.7330560932 -1052.7040374678 -0.0290186
3.0 -1052.7540013664 -1052.7542803612 0.000278995
3.5 -1052.7780321236 -1052.7781686744 0.000136551
4.0 -1052.7948229060 -1052.7948927247 6.98187e-05
4.5 -1052.8081313831 -1052.8081704241 3.9041e-05
5.0 -1052.8190310613 -1052.8190543049 2.32436e-05
you can remove the decorations and add print formatting easily. The magic numbers 1=g1 and 3=2*g2-1 correspond to data groups 1 and 2 as the order presented in the data file, can be converted to awk variables as well.
The counter s keeps track of whether you're in a set or not, Odd numbers correspond to sets and even numbers between sets. The increment is done both at the start pattern and end pattern. The order of increment statements were set in such a way they, they are not printed following the pattern (unset first, print set values, reset last}. You can change the order and observe the effects.
This might be what you're looking for:
$ cat tst.awk
/^[#&]/ { lineNr=0; next }
{
++lineNr
if (lineNr in prev) {
print $1, c * ($2 - prev[lineNr])
}
prev[lineNr] = $2
}
$ awk -v c=100000 -f tst.awk file
2.0 20115.9
2.5 2901.86
3.0 -27.8995
3.5 -13.6551
4.0 -6.98187
4.5 -3.9041
5.0 -2.32436
In your first try, you should replace that line:
awk '{printf "%11.2f\n", "$C"*$1 }' tmp4.in > tmp5.in
with that one:
awk -v C=$C '{printf "%11.2f\n", C*$1 }' tmp4.in > tmp5.in
You are mixing notations of bash shell with notation with awk.
in shell you define variable without $, and you use them with $.
Here you are in awk script, there is no $ to use variables. Yet there are some special variables : $1 $2 ...
You have put single quote ' around your awk script, so the shell variables cant be used. I mean you have written $C, but the shell can not see it inside single-quote. That is why you have to write awk -v C=$C so that the shell variable $C is transferred to an awk variable called C.
In your other tries with awk, we can see such errors also. Now I think you'll make it.
Suppose I have a file as follows (a sorted, unique list of integers, one per line):
1
3
4
5
8
9
10
I would like the following output (i.e. the missing integers in the list):
2
6
7
How can I accomplish this within a bash terminal (using awk or a similar solution, preferably a one-liner)?
Using awk you can do this:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file
2
6
7
Explanation:
{p = $1}: Variable p contains value from previous record
{for ...}: We loop from p+1 to the current row's value (excluding current value) and print each value which is basically the missing values
Using seq and grep:
seq $(head -n1 file) $(tail -n1 file) | grep -vwFf file -
seq creates the full sequence, grep removes the lines that exists in the file from it.
perl -nE 'say for $a+1 .. $_-1; $a=$_'
Calling no external program (if filein contains the list of numbers):
#!/bin/bash
i=0
while read num; do
while (( ++i<num )); do
echo $i
done
done <filein
To adapt choroba's clever answer for my own use case, I needed my sequence to deal with zero-padded numbers.
The -w switch to seq is the magic here - it automatically pads the first number with the necessary number of zeroes to keep it aligned with the second number:
-w, --equal-width equalize width by padding with leading zeroes
My integers go from 0 to 9999, so I used the following:
seq -w 0 9999 | grep -vwFf "file.txt"
...which finds the missing integers in a sequence from 0000 to 9999. Or to put it back into the more universal solution in choroba's answer:
seq -w $(head -n1 "file.txt") $(tail -n1 "file.txt") | grep -vwFf "file.txt"
I didn't personally find the - in his answer was necessary, but there may be usecases which make it so.
Using Raku (formerly known as Perl_6)
raku -e 'my #a = lines.map: *.Int; say #a.Set (^) #a.minmax.Set;'
Sample Input:
1
3
4
5
8
9
10
Sample Output:
Set(2 6 7)
I'm sure there's a Raku solution similar to #JJoao's clever Perl5 answer, but in thinking about this problem my mind naturally turned to Set operations.
The code above reads lines into the #a array, mapping each line so that elements in the #a array are Ints, not strings. In the second statement, #a.Set converts the array to a Set on the left-hand side of the (^) operator. Also in the second statement, #a.minmax.Set converts the array to a second Set, on the right-hand side of the (^) operator, but this time because the minmax operator is used, all Int elements from the min to max are included. Finally, the (^) symbol is the symmetric set-difference (infix) operator, which finds the difference.
To get an unordered whitespace-separated list of missing integers, replace the above say with put. To get a sequentially-ordered list of missing integers, add the explicit sort below:
~$ raku -e 'my #a = lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;' file
2
6
7
The advantage of all Raku code above is that finding "missing integers" doesn't require a "sequential list" as input, nor is the input required to be unique. So hopefully this code will be useful for a wide variety of problems in addition to the explicit problem stated in the Question.
OTOH, Raku is a Perl-family language, so TMTOWTDI. Below, a #a.minmax array is created, and grepped so that none of the elements of #a are returned (none junction):
~$ raku -e 'my #a = lines.map: *.Int; .put for #a.minmax.grep: none #a;' file
2
6
7
https://docs.raku.org/language/setbagmix
https://docs.raku.org/type/Junction
https://raku.org
I have a file with the same pattern several times.
Something like:
time
12:00
12:32
23:22
time
10:32
1:32
15:45
I want to print the lines after the pattern, in the example time
in several files. The number of lines after the pattern is constant.
I found I can get the first part of my question with awk '/time/ {x=NR+3;next}(NR<=x){print}' filename
But I have no idea how to output each chunk into different files.
EDIT
My files are a bit more complex than my original question.
They have the following format.
4
gen
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
4
gen
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
I want to print the lines after
4
gen
EDIT 2
My expected output is x files x=# pattern.
From my second example, I want two files:
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
and
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
You can use this awk command:
awk '/time/{close(out); out="output" ++i; next} {print > out}' file
This awk command creates a variable out based on a fixed prefix output and an incrementing counter i which gets incremented every time we get a line time. All subsequent lines are redirected to this output file. Is is a good practice to close these file handles to avoid memory leak.
PS: If you want time line also in output then remove next in above command.
The revised "4/gen" requirements are somewhat ambiguous but the following script (which is just a variant of #anubhava's) conforms with those that are given and can easily be modified to deal with various edge cases:
awk '
/^ *4 *$/ {close(out); out=0; next}
/^ *gen *$/ && out==0 {out = "output." ++i; next}
out {print > out} '
I found another answer from anubhava here How to print 5 consecutive lines after a pattern in file using awk
and with head and for loop I solved my problem:
for i in {1..23}; do grep -A25 "gen" allTemp | tail -n 25 > xyz$i; head -n -27 allTemp > allTemp2; cp allTemp2 allTemp; done
I know that file allTemp has 23 occurrences of gen.
Head will remove the lines I printed to xyzi as well as the two lines I don't want and it will output a new file to allTemp2
I have a tab delimited file and I want to perform some mathematical calculations on the columns present in file.
let the file name be sndf and $tag have some integer value, I want to first find difference between the values of column 3 and 2 then divide column 4 value with the value in $tag again divide the resultant with the difference in values of column 3 and 2 and final result is multiplied by 100.
cat $sndf | gawk '{for (i = 1; i <= NF; i += 1) {
printf "%f\t" $3 -$2 "\t", (((($4/"'$tag'")/($3-$2)))*100);
} printf "\n"}'>normal_wrt_region
the command is writing answer 4 times instead of one time to the output file..... can u all suggest an improvement?
thank you
SOLUTION: Dear all, I have solved the problem, thank you all for reading the problem and investing your time.
the command is writing answer 4 times instead of one time to the output file, can u all suggest an improvement?
Don't use the for loop if you don't need one?
cat $sndf | gawk '{ printf "%f\t" $3 -$2 "\t", (((($4/"'$tag'")/($3-$2)))*100) }'