awk reading in values

awk reading in values - bash

Hello the following code is used by me to split a file
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
the file looks like here
header
//
SeqT: {"POS-s":174.683, "time":0.0130084}
SeqT: {"POS-s":431.49, "time":0.0221447}
[2.04545e+2]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
The problem is that in the file 4 print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME the number with the sci notation with e does not get through. it works only as long as it is written without that . how can i cahnge the awk to also get the number in the way like 2.7e+7 or so

The problem is you're trying to match E notation when your regex is looking for integers only.
Instead of:
/^\[[0-9]+\]/
use something like:
/^\[[0-9]+(\.[0-9]+(e[+-]?[0-9]+)?)?\]/
This will match positive integers, floats, and E notation wrapped in square brackets at the start of the line.
See demo

Related

bash shell script for conditional assignment

I have a shell script that gives me a text file output in following format:
OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
58
FirmD
58
FirmE
58
This output is good or a YES that my job completed as expected since the value for all of the firms is 58.
So I used to take count of '58' in this text file to automatically tell in a RESULT job that everything worked out well.
Now there seems to be a bug due to which for sometimes the output comes like below: OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
61
FirmD
58
FirmE
61
which is impacting my count(since only 3 counts of 58 instead of expected 5) and hence my RESULT job states that it FAILED or a NO.
But actually the job has worked fine as long as the value stays within 58 to 61 for each firm.
So how can I ensure that in case the count is >=58 and <=61 for these five firms, than it has worked as expected ?
my simple one liner to check count in OUTPUT.TXT file
grep -cow 58 "OUTPUT.TXT"

Try Awk for simple jobs like this. You can learn enough in an hour to solve these problems yourself easily.
awk '(NR % 3 == 2) && ($1 < 58 || $1 > 61)' OUTPUT.TXT
This checks every third line, starting from the second, and prints any which are not in the range 58 to 61.
It would not be hard to extend the script to remember the string from the previous line. In fact, let's do that.
awk '(NR % 3 == 1) { firm = $0; next }
(NR % 3 == 2) && ($1 < 58 || $1 > 61) { print NR ":" firm, $0 }' OUTPUT.TXT
You might also want to check how many you get of each. But let's just make a separate script for that.
awk '(NR % 3 == 2) { ++a[$1] }
END { for (k in a) print k, a[k] }' OUTPUT.TXT
The Stack Overflow awk tag info page has links to learning materials etc.

how to get minimum prefix from a large file

I have a big file whose entries are like this .
Input:
1113
1113456
11134567
12345
1734
123
194567
From this entries , I need to find out the minimum number of prefix which can represent all these entries.
Expected output:
1113
123
1734
194567
If we have 1113 then there is no need to use 1113456 or 1113457.
Things I have tried:
I can use grep -v ^123 and compare with input file and store the unique results in the output file. IF I use a while loop , I dont know , how I can delete the entries from the input file itself.

I will assume that input file is:
790234
790835
795023
79788
7985904
7902713
791
7987
7988
709576
749576
7902712
790856
79780
798599
791453
791454
791455
791456
791457
791458
791459
791460
You can use
awk '!(prev && $0~prev){prev = "^" $0; print}' <(sort file)
Returns
709576
749576
790234
7902712
7902713
790835
790856
791
795023
79780
79788
7985904
798599
7987
7988
How does it work ? First it sorts the file using lexicographic sort (1 < 10 < 2). Then it keeps the minimal prefix and checks if next lines match. If they do they are skipped. If a line doesn't, it will update the minimal prefix and prints the line.
Let's say that input is
71
82
710
First it orders the lines and input becomes (lexicographic sort : 71 < 710 < 82) :
71
710
82
First line is printed because awk variable prev is not set so condition !(prev && $0~prev) is reached. prev becomes 71. On next row, 710 will match regexp ^71 so line is skipped and prev variable stays 71. On next row, 82does not match ^71, condition !(prev && $0~prev) is reached again, line is printed, prev is set to 82.

You may use this awk command:
awk '{
n = (n != "" && index($1, n) == 1 ? n : $1)
}
p != n {
print p = n
}' <(sort file)
1113
123
1734
194567

$ awk 'NR==1 || (index($0,n)!=1){n=$0; print}' <(sort file)
1113
123
1734
194567

Converting second pattern to millisecond in awk

I have file which is having pattern 's' , I need to convert into 'ms' by multiplying by 1000. I am unable to do it. Please help me.
file.txt
First launch 1
App: +1s170ms
First launch 2
App: +186ms
First launch 3
App: +1s171ms
First launch 4
App: +1s484ms
First launch 5
App: +1s227ms
First launch 6
App: +204ms
First launch 7
App: +1s180ms
First launch 8
App: +1s177ms
First launch 9
App: +1s183ms
First launch 10
App: +1s155ms
My code:
awk 'BEGIN { FS="[: ]+"}
/:/ && $2 ~/ms$/{vals[$1]=vals[$1] OFS $2+0;next}
END {
for (key in vals)
print key vals[key]
}' file.txt
Expected output:
App 1170 186 1171 1484 1227 204 1180 1177 1183 1155
Output Coming:
App 1 186 1 1 1 204 1 1 1 1
How to convert in above pattern 's' to 'ms' if second pattern comes .

What I will try to do here is explain it a bit generic and then apply it to your case.
Question: I have a string of the form 123a456b7c8d where the numbers are numeric integral values of any length and the letters are corresponding units. I also have conversion factors to convert from unit a,b,c,d to unit f. How can I convert this to a single quantity of unit f?
Example: from 1s183ms to 1183ms
Strategy:
create per string a set of key-value pairs 'a' => 123,'b' => 456, 'c' => 7 and 'd' => 8
multiply each value with the corect conversion factor
add the numbers together
Assume we use awk and the key-value pairs are stored in array a with the key as an index.
Extract key-value pairs from str:
function extract(str,a, t,k,v) {
delete a; t=str;
while(t!="") {
v=t+0; match(t,/[a-zA-Z]+/); k=substr(t,RSTART,RLENGTH);
t=substr(t,RSTART+RLENGTH);
a[k]=v
}
return
}
convert and sum: here we assume we have an array f which contains the conversion factors:
function convert(a,f, t,k) {
t=0; for(k in a) t+=a[k] * f[k]
return t
}
The full code (for the example of the OP)
# set conversion factors
BEGIN{ f['s']=1000; f['ms'] = 1 }
# print first word
BEGIN{ printf "App:" }
# extract string and print
/^App/ { extract($2,a); printf OFS "%dms", convert(a,f) }
END { printf ORS }
which outputs:
App: 1170ms 186ms 1171ms 1484ms 1227ms 204ms 1180ms 1177ms 1183ms 1155ms

perl -n -e '$s=0; ($s)=/(\d+)s/; ($ms)=/(\d+)ms/;
s/^(\w+):/push #{$vals{$1}}, $ms+$s*1000/e;
eof && print "$_: #{$vals{$_}}\n" for keys %vals;' file`
perl -n doesn't print anything as it loops through the input.
$s and $ms are set to those fields. $s is ensured to reset to zero
s///e is stuffing the %vals hash with a list of numbers in ms for each key, App, in this case.
eof && executes the subsequent code after the end of the file.
print "$_: #{$vals{$_}}\n" for keys %vals is printing the %vals hash as the OP wants.
App: 1170 186 1171 1484 1227 204 1180 1177 1183 1155

awk for loop to break up file into chunks

I have a large file that I would like to break into chunks by field 2. Field 2 ranges in value from about 0 to about 250 million.
1 10492 rs55998931 C T 6 7 3 3 - 0.272727272727273 0.4375
1 13418 . G A 6 1 2 3 DDX11L1 0.25 0.0625
1 13752 . T C 4 4 1 3 DDX11L1 0.153846153846154 0.25
1 13813 . T G 1 4 0 1 DDX11L1 0.0357142857142857 0.2
1 13838 rs200683566 C T 1 4 0 1 DDX11L1 0.0357142857142857 0.2
I want field 2 to be broken up into intervals of 50,000, but overlapping by 2,000. For example, the first three awk commands would look like:
awk '$1=="1" && $2>=0 && $2<=50000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.0kb.50kb
awk '$1=="1" && $2>=48000 && $2<=98000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.48kb.98kb
awk '$1=="1" && $2>=96000 && $2<=146000{print$0}' Highalt.Lowalt.allelecounts.filteredformissing.freq > chr1.96kb.146kb
I know that there's a way I can do this using a for loop with variables like i and j. Can someone help me out?

awk '$1=="1"{n=int($2/48000); print>("chr1." (48*n) "kb." (48*n+50) "kb");n--; if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb");}' Highalt.Lowalt.allelecounts.filteredformissing.freq
Or spread out over multiple lines:
awk '$1=="1"{
n=int($2/48000)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
n--
if (n>=0 && $2/1000<=48*n+50)
print>("chr1." (48*n) "kb." (48*n+50) "kb")
}' Highalt.Lowalt.allelecounts.filteredformissing.freq
How it works
$1=="1"{
This selects all lines whose first field is 1. (You didn't mention this in the text but your code applied this restriction.
n=int($2/48000)
This computes which bucket the line belongs in.
print>("chr1." (48*n) "kb." (48*n+50) "kb")
This writes the line to the appropriate file
n--
This decrements the bucket number
if (n>=0 && $2/1000<=48*n+50) print>("chr1." (48*n) "kb." (48*n+50) "kb")
If this line also fits within the overlapping range of the previous bucket, then write it to that bucket also.
}
This closes the group started by selecting $1=="1".

Count and sum up specific decimal number (bash,awk,perl)

I have a tab delimited file and I want to sum up certain decimal number to the output (1.5) each time its find number instead of character to the first column and print out the results for all the rows from the first to the last.
I have example file which look like this:
It has 8 rows
1st-column 2nd-Column
a ship
1 name
b school
c book
2 blah
e blah
3 ...
9 ...
Now I want my script to read line by line and if it finds number sum up 1.5 and give me output just for first column like this:
0
1.5
1.5
1.5
3
3
4.5
6
my script is:
#!/bin/bash
for c in {1..8}
do
awk 'NR==$c { if (/^[0-9]/) sum+=1.5} END {print sum }' file
done
but I don't get any output!
Thanks for your help in advance.

The last item in your expected output appears to be incorrect. If it is, then you can do:
$ awk '$1~/^[[:digit:]]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6

use warnings;
use strict;
my $sum = 0;
while (<DATA>) {
my $data = (split)[0]; # 1st column
$sum += 1.5 if $data =~ /^\d+$/;
print "$sum\n";
}
__DATA__
a ship
1 name
b school
c book
2 blah
e blah
3 ...
6 ...

Why not just use awk:
awk '{if (/^[0-9]+[[:blank:]]/) sum+=1.5} {print sum+0 }' file
Edited to simplify based on jaypal's answer, bound the number and work with tabs and spaces.

How about
perl -lane 'next unless $F[0]=~/^\d+$/; $c+=1.5; END{print $c}' file
Or
awk '$1~/^[0-9]+$/{c+=1.5}END{print c}' file
These only produce the final sum as your script would have done. If you want to show the numbers as they grow use:
perl -lane 'BEGIN{$c=0}$c+=1.5 if $F[0]=~/^\d+$/; print "$c"' file
Or
awk 'BEGIN{c=0}{if($1~/^[0-9]+$/){c+=1.5}{print c}}' file

I'm not sure if you're multiplying the first field by 1.5 or adding 1.5 to a sum every time there's any number in $1 and ignoring the contents of the line otherwise. Here's both in awk, using your sample data as the contents of "file."
$ awk '$1~/^[0-9]+$/{val=$1*1.5}{print val+0}' file
0
1.5
1.5
1.5
3
3
4.5
9
$ awk '$1~/^[0-9]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
Or, here you go in ksh (or bash if you have a newer bash that can do floating point math), assuming the data is on STDIN
#!/usr/bin/ksh
sum=0
while read a b
do
[[ "$a" == +([0-9]) ]] && (( sum += 1.5 ))
print $sum
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

awk reading in values - bash

The problem is you're trying to match E notation when your regex is looking for integers only. Instead of: /^\[[0-9]+\]/ use something like: /^\[[0-9]+(\.[0-9]+(e[+-]?[0-9]+)?)?\]/ This will match positive integers, floats, and E notation wrapped in square brackets at the start of the line. See demo

Related

bash shell script for conditional assignment

how to get minimum prefix from a large file

Converting second pattern to millisecond in awk

awk for loop to break up file into chunks

Count and sum up specific decimal number (bash,awk,perl)

Categories

Resources