I get some monthly stats in the below format,
What I need to do is to get the smallest and largest for each column,
I already use awk to get the table out of a much larger file using this script
awk 'c-->3;/By Day/{c=35; print}' file1.txt
and get the output:
By Day:
Separate user logon counts-(max sessions)-(external counts)-(lock actions):
2013/04/07 - 6 ( 6) ( 37) ( 0)
2013/04/08 - 190 ( 70) (6528) ( 30)
2013/04/09 - 185 ( 68) (5986) ( 29)
2013/04/10 - 213 ( 85) (5571) ( 36)
2013/04/11 - 189 ( 82) (5410) ( 35)
2013/04/12 - 165 ( 69) (5130) ( 25)
2013/04/13 - 16 ( 15) ( 662) ( 0)
2013/04/14 - 20 ( 14) (1016) ( 2)
2013/04/15 - 160 ( 64) (6770) ( 39)
2013/04/16 - 205 ( 96) (5978) ( 25)
2013/04/17 - 197 ( 83) (5816) ( 37)
2013/04/18 - 167 ( 78) (5554) ( 38)
2013/04/19 - 152 ( 71) (5479) ( 29)
2013/04/20 - 18 ( 10) ( 578) ( 1)
2013/04/21 - 11 ( 7) (1018) ( 2)
2013/04/22 - 193 ( 74) (6931) ( 30)
2013/04/23 - 176 ( 66) (6184) ( 23)
2013/04/24 - 192 ( 74) (5891) ( 26)
2013/04/25 - 188 ( 79) (5575) ( 28)
2013/04/26 - 170 ( 75) (5513) ( 26)
2013/04/27 - 17 ( 12) ( 597) ( 0)
2013/04/28 - 17 ( 10) (1021) ( 0)
2013/04/29 - 193 ( 79) (6786) ( 38)
2013/04/30 - 217 ( 87) (6094) ( 36)
2013/05/01 - 185 ( 82) (5706) ( 32)
2013/05/02 - 188 ( 76) (5602) ( 29)
2013/05/03 - 167 ( 63) (5149) ( 21)
2013/05/04 - 22 ( 14) ( 634) ( 1)
2013/05/05 - 21 ( 14) ( 728) ( 1)
2013/05/06 - 2 ( 8) ( 46) ( 0)
Can I edit the awk script to sort by a set column and only display the sorted column and the first column?
The correct way to print the line containing "By Day" and the subsequent 35 lines is:
awk '/By Day/{c=36} c&&c--' file1.txt
Now, post some representative input (and no, we do NOT need it to be 35 lines - make it 5 or less) and the expected output from that input and we can take a look at what you want to do next.
I see from a comment that you want to print the 3 lines before "By Day" too. That on it's own would be:
awk '
/By Day/{
for (i=0;i<3;i++) {
j=(NR+i)%3
if (j in buf) {
print buf[j]
}
}
}
{ buf[NR%3]=$0 }
' file
so you can combine those as:
awk -v pre=3 -v post=35 '
/By Day/{
for (i=0;i<pre;i++) {
j = (NR+i) % pre
if (j in buf) {
print buf[j]
}
}
c = post + 1
}
{ buf[NR%pre]=$0 }
c&&c--
' file
I'm pretty sure your script is programming by coincidence. As it stands you decrement the variable c and test if it's greater than 3 on every line in the input. Depending the result the line will be print as the default block will be executed. The second block seems useless as it matches line containing By Day but your input contain a single match? As it stands c will be initialized as 0 as only decremented meaning the condition c-->3 is never true and therefor this script will print nothing with the current input!?
awk 'c-->3;/By Day/{c=35; print}' file1.txt
You should post the original file to get help on how to rewrite this script.
Ignoring your awk script and taking your current input I would remove the brackets and use sort. For example to do numerical sort on the fifth column:
$ sed 's/[()]//g' file | sort -nk5 | awk '{print $1,$5}'
Separate sessions-external
2013/04/07 37
2013/05/06 46
2013/04/20 578
2013/04/27 597
2013/05/04 634
2013/04/13 662
2013/05/05 728
2013/04/14 1016
2013/04/21 1018
2013/04/28 1021
2013/04/12 5130
2013/05/03 5149
2013/04/11 5410
2013/04/19 5479
2013/04/26 5513
2013/04/18 5554
2013/04/10 5571
2013/04/25 5575
2013/05/02 5602
2013/05/01 5706
2013/04/17 5816
2013/04/24 5891
2013/04/16 5978
2013/04/09 5986
2013/04/30 6094
2013/04/23 6184
2013/04/08 6528
2013/04/15 6770
2013/04/29 6786
2013/04/22 6931
Edit:
The simplest way to print 35 after the match a 3 lines before the match if you have GNU grep:
grep -A35 -B3 'By Day' file
Then pipe to sort using the numeric sort option -n and the specify the column with -k and use cut or awk to grab only the columns you want.
Related
I have a file containing two columns e.g.
10 25
26 38
40 62
85 65
88 96
97 8
I want first column to contain all minimum values and second column containing all maximum values. Something like this:
10 25
26 38
40 62
65 85
88 96
8 97
Hope this helps.
awk '{
if ($2 < $1 )
print $2," "$1;
else
print $1," "$2;
}' filename.txt
Make sure columns are space separated
Using Python, it is straightforward:
values = [
(10, 25),
(26, 38),
(40, 62),
(85, 65),
(88, 96),
(97, 8),
]
result = [(min(v), max(v)) for v in values]
You get:
[(10, 25), (26, 38), (40, 62), (65, 85), (88, 96), (8, 97)]
Using bash… I don't know:
python -c "<your command here>'
I have tried this
#!/bin/sh
echo -n "Enter a file name > "
read name
if ["$1" -gt "$2"];
then
awk ' { t= $1; $1 = $2; $2=t ; print; } ' $name
fi
exit 0;
I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1
I have a daily file output on a linux system like the below and was wondering is there a way to group the data in 30min increments based on $1 and avg $3 and sum $4 $5 $6 $7 $8 via a shell script using awk/gawk or something similar?
04:04:13 04:10:13 2.13 36 27 18 18 0
04:09:13 04:15:13 2.37 47 38 13 34 0
04:14:13 04:20:13 2.19 57 37 23 33 1
04:19:13 04:25:13 2.43 43 35 13 30 0
04:24:13 04:30:13 2.29 48 40 19 28 1
04:29:13 04:35:13 2.33 56 42 16 40 0
04:34:13 04:40:13 2.21 62 47 30 32 0
04:39:13 04:45:13 2.25 44 41 19 25 0
04:44:13 04:50:13 2.20 65 50 32 33 0
04:49:13 04:55:13 2.47 52 38 16 36 0
04:54:13 05:00:13 2.07 72 54 40 32 0
04:59:13 05:05:13 2.35 53 41 19 34 0
so basically this hour of data would result in something like this:
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.25 348 271 156 192 0
this is what I have gotten so far using awk to search between the time frames but I think there is an easier way to get the grouping done without awking each 30min interval
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ total += $3; count++ } END { print total/count }'|awk '{printf "%0.2f\n", $1'}
awk '$1>=from&&$1<=to' from="04:00:00" to="04:30:00" | awk '{ sum+=$4} END {print sum}'
This should do what you want:
{
split($1, times, ":");
i = (2 * times[1]);
if (times[2] >= 30) i++;
if (!start[i] || $1 < start[i]) start[i] = $1;
if (!end[i] || $1 > end[i]) end[i] = $1;
count[i]++;
for (col = 3; col <= 8; col++) {
data[i, col] += $col;
}
}
END {
for (i = 1; i <= 48; i++) {
if (start[i]) {
data[i, 3] = data[i, 3] / count[i];
printf("%s-%s %.2f", start[i], end[i], data[i, 3]);
for (col = 4; col <= 8; col++) {
printf(" " data[i, col]);
}
print "";
}
}
}
As you can see, I divide the day into 48 half-hour intervals and place the data into one of these bins depending on the time in the first column. After the input has been exhausted, I print out all bins that are not empty.
Personally, I would do this in Python or Perl. In awk, the arrays are not ordered (well, in gawk you could use assorti to sort the array...) which makes printing ordered buckets more work.
Here is the outline:
Read input
Convert the time stamp to seconds
Add to an ordered (or sortable) associative array of the data elements in buckets of the desired time frame (or, just keep running totals).
After the data is read, process as you wish.
Here is a Python version of that:
#!/usr/bin/python
from collections import OrderedDict
import fileinput
times=[]
interval=30*60
od=OrderedDict()
for line in fileinput.input():
li=line.split()
secs=sum(x*y for x,y in zip([3600,60,1], map(int, li[0].split(":"))))
times.append([secs, [li[0], float(li[2])]+map(int, li[3:])])
current=times[0][0]
for t, li in times:
if t-current<interval:
od.setdefault(current, []).append(li)
else:
current=t
od.setdefault(current, []).append(li)
for s, LoL in od.items():
avg=sum(e[1] for e in LoL)/len(LoL)
sums=[sum(e[i] for e in LoL) for i in range(2,7)]
print "{}-{} {:.3} {}".format(LoL[0][0], LoL[-1][0], avg, ' '.join(map(str, sums)))
Running that on your example data:
$ ./ts.py ts.txt
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0
The advantage is you can easily change the interval and a similar technic can use timestamps that are longer than days.
If you really want awk you could do:
awk 'BEGIN{ interval=30*60 }
function fmt(){
line=sprintf("%s-%s %.2f %i %i %i %i %i", ls, $1, sums[3]/count,
sums[4], sums[5], sums[6], sums[7], sums[8])
}
{
split($1,a,":")
secs=a[1]*3600+a[2]*60+a[3]
if (NR==1) {
low=secs
ls=$1
count=0
for (i=3; i<=8; i++)
sums[i]=0
}
for (i=3; i<=8; i++){
sums[i]+=$i
}
count++
if (secs-low<interval) {
fmt()
}
else {
print line
low=secs
ls=$1
count=1
for (i=3; i<=8; i++)
sums[i]=$i
}
}
END{
fmt()
print line
}' file
04:04:13-04:29:13 2.29 287 219 102 183 2
04:34:13-04:59:13 2.26 348 271 156 192 0
I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!
This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.
How can I add 6 more single space separated columns to a file.
The input file that looks like this:
-11.160574
...
-11.549076
-12.020907
...
-12.126601
...
-11.93235
...
-8.297653
Where ... represents 50 more lines of numbers.
The output I want is this:
-11.160574 1 1 1 1 1 14
...
-11.549076 51 51 1 1 1 14
-12.020907 1 1 2 2 1 14
...
-12.126601 51 51 2 2 1 14
...
-11.93235 1 1 51 51 1 14
...
-8.297653 51 51 51 51 1 14
The 2nd and 3rd columns are loops for 1 to 51.
The 4th and 5th columns are also loops for 1 to 51, but at the upper level from above.
The last two ones constants columns of 1 and 14.
Use a loop to read the file line-by-line and maintain counters to keep track of the field numbers as shown below:
#!/bin/bash
field1=1
field2=1
while read line
do
echo "$line $field1 $field1 $field2 $field2 1 14"
(( field1++ ))
if (( $field1 == 52 )); then
field1=1
(( field2++ ))
fi
done < file
Here you go, an awk script:
{
mod = 51
a = (NR - 1) % mod + 1
b = int((NR - 1) / mod) + 1
c = 1
d = 14
print $0,a,a,b,b,c,d
}
Run it with something like awk -f the-script.awk in-file.txt. Or make it executable and add #!/usr/bin/awk -f at the top, and you can run it directly without typing awk -f.