Basically, I've two columns. First one stands for the users and the second one for the time they've spent on the server. So I'd like to sum for each client, how many minutes did he spend on the server.
user1 21:03
user2 19:55
user3 20:09
user1 18:57
user1 19:09
user3 21:05
user4 19:57
Let's say that I've this. I know how to split but there's one problem. Whenever I do awk -F: '{print $1} it prints users and the first parameter of the time (the number before :), and when I do awk -F: '{print $2} it prints only the numbers after :. After all of the sum, I'd like to get something like
user1 59:09
user2 19:55
user3 41:14
user4 19:57
Here's a possible solution:
perl -ne '/^(\S+) (\d\d):(\d\d)$/ or next; $t{$1} += $2 * 60 + $3; END { printf "%s %02d:%02d\n", $_, $t{$_} / 60, $t{$_} % 60 for sort keys %t }'
Or with better formatting:
perl -ne '
/^(\S+) (\d\d):(\d\d)$/ or next;
$t{$1} += $2 * 60 + $3;
END {
printf "%s %02d:%02d\n", $_, $t{$_} / 60, $t{$_} % 60
for sort keys %t;
}
'
We loop over all input lines (-n). We make sure every line matches the pattern \S+ \d\d:\d\d (i.e. a sequence of 1 or more non-space characters, a space, two digits, a colon, two digits) or else we skip it.
We accumulate the number of seconds per user in the hash %t. The keys are the user names, the values are the numbers.
At the end we print the contents of %t in a nicely formatted way.
this is an awk solution
cat 1.txt | awk '{a[$1]+=substr($2,0,2)*60+substr($2,4)} END {for(i in a) printf("%s %02d:%02d\n", i,a[i]/60,a[i]%60)}'
user1 59:09
user2 19:55
user3 41:14
user4 19:57
first construct an array with index=$1, and the value = convert the time to an integer by minutes * 60 + seconds
{a[$1]+=substr($2,0,2)*60+substr($2,4)}
then print the array in the desired format which converts integer to a mi:ss format.
printf("%s %02d:%02d\n", i,a[i]/60,a[i]%60)
If you want to use awk (and assuming the duration is always hh:mm, though their sizes can be arbitrary), the following will do the trick:
{
split($2, flds, ":") # Get hours and minutes.
mins[$1] += flds[1] * 60 + flds[2] # Add to initially zero array item.
}
END {
for (key in mins) { # For each key in array.
printf "%s %d:%02d\n", # Output specific format.
key, # Key, hours, and minutes.
mins[key] / 60,
mins[key] % 60
}
}
That's the expanded, readable variant, the compressed one is shown in the following transcript, along with the output as expected:
pax> awk '{split($2,flds,":");mins[$1] += flds[1] * 60 + flds[2]}END{for(key in mins){printf "%s %d:%02d\n",key,mins[key]/60,mins[key]%60}}' testprog.in
user1 59:09
user2 19:55
user3 41:14
user4 19:57
Just keep in mind you haven't specified the input format whenever a user entry has more than 24 hours. If it goes something like 25:42, the script will work as is.
If instead it decides to break out days (into something like 1:01:42 rather than 25:42), you'll need to adjust how the minutes are calculated. This can be done relatively easily (including the possibility for minutes-only entries) by checking the flds array size with (in the main body of the script, the non-END bit):
num = split($2, flds, ":")
if (num == 1) { add = flds[1] }
else if (num == 2) { add = flds[1] * 60 + flds[2] }
else { add = flds[1] * 1440 + flds[2] * 60 + flds[3] }
mins[$1] += add
Related
I need small help related to Unix shell script using awk.
I have a file like below:
139341 8.61248 python_dev ntoma2 r 07/17/2017 07:27:43 gpuml#acepd1641.udp.finco.com 1
139342 8.61248 python_val ntoma2 r 07/17/2017 07:27:48 gpuml#acepd1611.udp.finco.com 1
139652 8.61248 python_dev ntoma2 r 07/17/2017 10:55:57 gpuml#acepd1671.udp.finco.com 1
Which is space separated. I need to get 1st col and 4th col which are job-id and user-name(ntoma2 in this case) based on 6th col (which is date in date formate - mm/dd/yyyy), older than 7days. Compare 6th column with current date and I need to get cols which are older than 7days.
I have below one to get Job id and user name of older than 7 days:
cat filename.txt | awk -v dt="$(date "--date=$(date) -7 day" +%m/%d/%Y)" -F" " '/qw/{ if($6<dt) print $4,":",$1 }' >> ./longRunningJob.$$
Also i have another command to get email ids like below using user-name (from the above 4th col):
/ccore/pbis/bin/enum-members "adsusers" | grep ^UNIX -B3 | grep <User-Name> -B2 | grep UPN | awk '{print $2}'
I need to combined above 2 commands and need to send a report to every user as like below:
echo "Hello <User Name>, There is a long running job which is of job-id: <job-id> more than 7days, so please kill the job or let us know if we can help. Thank you!" | mailx -s "Long Running Job"
NOTE: if user name repeated, all the list should go in one email.
I am not sure how can i combine these 2 and send email to user, can some one please help me?
Thank you in advance!!
Vasu
You can certainly do this in awk -- easier in gawk because of date support.
Just to give you an outline of how to do this, I wrote this in Ruby:
$ cat file
139341 8.61248 python_dev ntoma2 r 07/10/2017 07:27:43 gpuml#acepd1641.udp.finco.com 1
139342 8.61248 python_val ntoma2 r 07/09/2017 07:27:48 gpuml#acepd1611.udp.finco.com 1
139652 8.61248 python_dev ntoma2 r 07/17/2017 10:55:57 gpuml#acepd1671.udp.finco.com 1
$ ruby -lane 'BEGIN{ require "date"
jobs=Hash.new { |h,k| h[k]=[] }
users=Hash.new()
pn=7.0
}
t=DateTime.parse("%s %s" % [$F[5].split("/").rotate(-1).join("-"), $F[6]])
ti_days=(DateTime.now-t).to_f
ts="%d days, %d hours, %d minutes and %d seconds" % [60,60,24]
.reduce([ti_days*86400]) { |m,o| m.unshift(m.shift.divmod(o)).flatten }
users[$F[3]]=$F[7]
jobs[$F[3]] << "Job: %s has been running %s" % [$F[0], ts] if (DateTime.now-t).to_f > pn
END{
jobs.map { |id, v|
w1,w2=["is a","job"]
w1,w2=["are","jobs"] if v.length>1
s="Hello #{id}, There #{w1} long running #{w2} running more than the policy of #{pn.to_i} days. Please kill the #{w2} or let us know if we can help. Thank you!\n\t" << v.join("\n\t")
puts "#{users[id]} \n#{s}"
# s is the formated email address and body. You take it from here...
}
}
' /tmp/file
gpuml#acepd1671.udp.finco.com
Hello ntoma2, There are long running jobs running more than the policy of 7 days. Please kill the jobs or let us know if we can help. Thank you!
Job: 139341 has been running 11 days, 9 hours, 28 minutes and 44 seconds
Job: 139342 has been running 12 days, 9 hours, 28 minutes and 39 seconds
I got the Solution, but there is a bug in it, here is the solution:
!#/bin/bash
{ qstat -u \*; /ccore/pbis/bin/enum-members "adsusers"; } | awk -v dt=$(date "--date=$(date) -7 day" +%m/%d/%Y) '
/^User obj/ {
F2 = 1
FS = ":"
T1 = T2 = ""
next
}
!F2 {
if (NR < 3) next
if ($5 ~ "qw" && $6 < dt) JID[$4] = $1 "," JID[$4]
next
}
/^UPN/ {T1 = $2
}
/^Display/ {T2 = $2
}
/^Alias/ {gsub (/ /, _, $2)
EM[$2] = T1
DN[$2] = T2
}
END {for (j in JID) {print "echo -e \"Hello " DN[j] " \\n \\nJob(s) with job id(s): " JID[j] " executing more than last 7 days, hence request you to take action, else job(s) will be killed in another 1 day \\n \\n Thank you.\" | mailx -s \"Long running job for user: " DN[j] " (" j ") and Job ID(s): " JID[j] "\" " EM[j]
}
}
' | sh
The bug in the above code is -- the if condition of date compare (as shown below) is is not working as expected, i am really not sure how to compare the $6 and the variable dt (both of format mm/dd/yyyy). I think i should use either mkdate() or something else. can some one please help?
if ($5 ~ "qw" && $6 < dt)
Thank you!!
Vasu
I am getting execution time of various processes in a file from their respective log files. The file with execution time looks similar to following (it may have hundreds of entries)
1:00:01.11
2:2.20
1.02
The first line is hours:minutes:seconds, the second line is minutes:seconds and, the third is just seconds.
I want to sum all entries to come to a total execution time. How can I achieve this in bash? If not bash, then can you provide me some examples from other scripting language to sum timestamps?
To complement Matt Jacob's elegant perl solution with a (POSIX-compliant) awk solution:
awk -F: '{ n=0; for(i=NF; i>=1; --i) secs += $i * 60 ^ n++ } END { print secs }' file
With the sample input, this outputs (the sum of all time spans in seconds):
3724.33
See the section below for how to format this value as a time span, similar to the input (01:02:04.33).
Explanation:
-F: splits the input lines into fields by :, so that the resulting fields ($1, $2, ...) represent the hour, minute, and seconds components individually.
n=0; for(i=NF; i>=1; --i) secs += $i * 60 ^ n++ enumerates the fields in reverse order (first seconds, then minutes, then hours, if defined; NF is the number of fields) and multiplies each field with the appropriate multiple of 60 to yield an overall value in seconds, stored in variable secs, cumulatively across lines.
END { print secs } is executed after all lines have been processed and simply prints the cumulative value in seconds.
Formatting the output as a time span:
Custom output formatting must be used:
awk -F: '
{ n=0; for(i=NF; i>=1; --i) secs += $i * 60 ^ n++ }
END {
hours = int(secs / 3600)
minutes = int((secs - hours * 3600) / 60)
secs = secs % 60
printf "%02d:%02d:%05.2f\n", hours, minutes, secs
}
' file
The above yields (the equivalent of 3724.33 seconds):
01:02:04.33
The END { ... } block splits the total number of seconds accumulated in secs back into hours, minutes, and seconds, and outputs the result with appropriate formatting of the components using printf.
The reason that utilities such as date and GNU awk's (nonstandard) date-formatting functions cannot be used to format the output is twofold:
The standard time format specifier %H wraps around at 24 hours, so if the cumulative time span exceeds 24 hours, the output would be incorrect.
Fractional seconds would be lost (the granularity of Unix time stamps is whole seconds).
The mostly-readable full perl script:
use strict;
use warnings;
my $seconds = 0;
while (<DATA>) {
my #fields = reverse(split(/:/));
for my $i (0 .. $#fields) {
$seconds += $fields[$i] * 60 ** $i;
}
}
print "$seconds\n";
__DATA__
1:00:01.11
2:2.20
1.02
Or, the barely-readable one-liner version:
$ perl -F: -wane '#F = reverse(#F); $seconds += $F[$_] * 60 ** $_ for 0 .. $#F; END { print "$seconds\n" }' times.log
Output:
3724.33
In both cases, we're splitting each line on the H:M:S separator : and then reversing the array so that we can process from right-to-left. To get the total time in seconds, we can rely on a neat trick where we multiply each field by powers of 60.
If you want the result in H:M:S format instead of raw seconds, strftime() from the POSIX core module makes it easy:
use POSIX qw(strftime);
print strftime('%H:%M:%S', gmtime($seconds)), "\n";
Output:
01:02:04
I work in telecoms and regularly need to expand number ranges.
For example, 6121234567X [note that there are 10 numbers preceeding the X] is shorthand for:
61212345670
61212345671
61212345672....... etc (a 10 number range)
and 612123456X [note that there are only 9 numbers preceeding the X] is shorthand for
61212345600
61212345601....... etc (a 100 number range)
So I need a grep command that...
reads how many characters in the line preceeding the X (to determine how many suffixes)
writes the appropriate amount of lines (10, 100, or 100) with ascending suffixes
hopefully removes the original line
Below is the Python script that does it, file-name is the expected first argument. Example usage: python script.py file.in > file.out
#!/usr/bin/env python
import sys
def generate(pattern):
p = pattern.lower().find('x')
ret = ""
for i in range(10**(10-p+1)):
ret += pattern[:p] + str(i).zfill(10-p+1) + " "
return ret
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Filename needed!")
else:
with open(sys.argv[1]) as f:
for ln in f:
print(generate(ln.rstrip()))
You can do this in awk quite quickly:
awk -v val=$a -v max=10
'BEGIN {
gsub("X","",val)
items=max - length(val)
for (i=0; i<=10^items; i++)
print val*(10^items)+i
}'
This works as an example. To do the same reading from a file, you just need to play with $1 (first field of the field) instead of val and move all the code from BEGIN into the main block.
Explanation
-v val=$a -v max=10 pass parameters: $a is the variable containing the string on the form 12345678X AND max contains the maximum amount of digits the number will have (10 in your case).
BEGIN {} perform all these actions [before/without] reading a file.
gsub("X","",val) remove X from val.
items=max - length(val) count the size of the variable without the X.
for (i=0; i<=10^items; i++) print val*(10^items)+i loop from 0 to 10^remaining_size. This means from 0 to 10 or from 0 to 100... depending on the result of 10 - size without X.
Test
With 9 as maximum:
$ a=12345678X
$ awk -v val=$a -v max=9 'BEGIN {gsub("X","",val); items=max - length(val); for (i=0; i<=10^items; i++) print val*(10^items)+i}'
123456780
123456781
123456782
123456783
123456784
123456785
123456786
123456787
123456788
123456789
123456790
echo 6121234567X | perl -nE 'm/(.*)X/;
say $1. $_ foreach (0..10**(11-length $1)-1)'
61212345670
61212345671
61212345672
61212345673
61212345674
61212345675
61212345676
61212345677
61212345678
61212345679
It's a little uglier to get the zero padded format:
echo 611234567X | perl -wne 'm/(.*)X/; $b=$1; $r=11 - length $b;
$fmt="%0" . $r . "s\n";
printf "$b$fmt", $_ foreach (0..10**$r-1) '
Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.
I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"