bash - append specific column of multiple files to a new file - bash

I have many files in a folder. All files have same format:
file A:
090722 12:15 - 12:45 2342
090722 12:30 - 13:00 73
090722 12:45 - 13:15 543
...
file B:
090722 12:15 - 12:45 874
090722 12:30 - 13:00 32
090722 12:45 - 13:15 2543
...
and so on ... The first part is always the same and should print only once a time.
Would like to get an ouput like this:
090722 12:15 - 12:45 2342 874 values_fileC values_fileD ...
090722 12:30 - 13:00 73 32 values_fileC values_fileD ...
090722 12:45 - 13:15 543 2543 values_fileC values_fileD ...
...
I've tried something like:
paste files* > final.txt
That work's fine, but don't know how to add only the columns with the values from the files?
Some ideas failed:
paste files* | awk '{ print $5 }' > final.txt
for f in files*; do cat $f | awk '{print $5}'; done > final.txt

Try this:
awk -F' ' '{a[$1" "$2" "$3" "$4]=a[$1" "$2" "$3" "$4]"\t"$5}END{for(i in a) print i a[i]}' file*
Output:
090722 12:15 - 12:45 2342 874
090722 12:45 - 13:15 543 2543
090722 12:30 - 13:00 73 32
Update:
awk -F' ' '{a[$1" "$2" "$3" "$4]=a[$1" "$2" "$3" "$4]"\t"$5}END{for(i in a) print i a[i]}' file* | sort -t " " -k 2,2n
Output:
090722 12:15 - 12:45 2342 874
090722 12:30 - 13:00 73 32
090722 12:45 - 13:15 543 2543

One option would be to use awk to combine the fields:
awk '{
key = $1 FS $2 FS $3 FS $4; if (NR == FNR) a[NR] = key; out[key] = out[key] FS $5
} END { for(i = 1; i <= FNR; ++i) print a[i], out[a[i]] }' file*
Laid out as a script (which you could run with awk -f script.awk file*):
{
key = $1 FS $2 FS $3 FS $4 # build key using first four fields
if (NR == FNR) a[NR] = key # record order in which fields appear
out[key] = out[key] FS $5 # build output array using fifth field
}
END {
# loop through and print keys, values
for(i = 1; i <= FNR; ++i) print a[i], out[a[i]]
}
This makes the assumption that each file contains the same number of records.
I can think of two ways to achieve a fixed width output. If you're sure that the values being combined will only vary in length within the range of one tab stop, then the simplest solution is just to use a \t instead of FS in this line:
out[key] = out[key] "\t" $5 # build output array using fifth field
Otherwise you could use sprintf to pad each value to a length of your choice:
out[key] = out[key] sprintf("%6s", $5)
You can left-align the fields using -6 instead of 6.

Related

Get the longest logon time of a given user using awk

My task is to write a bash script, using awk, to find the longest logon of a given user ("still logged in" does not count), and print the month day IP logon time in minutes.
Sample input: ./scriptname.sh username1
Content of last username1:
username1 pts/ IP Apr 2 .. .. .. .. (00.03)
username1 pts/ IP Apr 3 .. .. .. .. (00.13)
username1 pts/ IP Apr 5 .. .. .. .. (12.00)
username1 pts/ IP Apr 9 .. .. .. .. (12.11)
Sample output:
Apr 9 IP 731
(note: 12 hours and 11 minutes is in total 731 minutes)
I have written this script, but a bunch of errors pop up, and I am really confused:
#!/bin/bash
usr=$1
last $usr | grep -v "still logged in" | awk 'BEGIN {max=-1;}
{
h=substr($10,2,2);
min=substr($10,5,2) + h/60;
}
(max < min){
max = min;
}
END{
maxh=max/60;
maxmin=max-maxh;
($maxh == 0 && $maxmin >=10){
last $usr | grep "00:$maxmin" | awk '{print $5," ",$6," ", $3," ",$maxmin}'
exit 1
}
($maxh == 0 $$ $maxmin < 10){
last $usr | grep "00:0$maxmin" | awk '{print $5," ",$6," ",$3," ",$maxmin}'
exit 1
}
($maxh < 10 && $maxmin == 0){
last $usr | grep "0$maxh:00" | awk '{print $5," ",$6," ",$3," ",$maxmin}'
exit 1
}
($maxh < 10 && $maxmin < 10){
last $usr | grep "0$maxh:0$maxmin" | awk '{print $5," ",$6," ",$3," ",$maxmin}'
exit 1
}
($maxh >= 10 && $maxmin < 10){
last $usr | grep "$maxh:0$maxmin" | awk '{print $5," ",$6," ",$3," ",$maxmin}'
exit 1
}
($maxh >=10 && $maxmin >= 10){
last $usr | grep "$maxh:$maxmin" | awk '{print $5," ",$6," ",$3," ",$maxmin}'
exit 1
}
}'
So a bit of explaining of how I imagined this would work:
After the initialization, I want to find the (hh:mm) column of the last $usr command, save the h and min of every line, find the biggest number (in minutes, meaning it is the longest logon time).
After I found the longest logon time (in minutes, stored in the variable max), I then have to reformat the only minutes format to hh:mm to be able to use a grep, use the last command again, but now only searching for the line(s) that contain the max logon time, and print all of the needed information in the month day IP logon time in minutes format, using another awk.
Errors I get when running this code: A bunch of syntax errors when I try using grep and awk inside the original awk.
awk is not shell. You can't directly call tools like last, grep and awk from awk any more than you could call them directly from a C program.
Using any awk in any shell on every Unix box and assuming if multiple rows have the max time you'd want all of them printed and that if no timestamped rows are found you want something like No matching records printed (easy tweak if not, just tell us your requirements for those cases and include them in the example in your question):
last username1 |
awk '
/still logged in/ {
next
}
{
split($NF,t,/[().]/)
cur = (t[2] * 60) + t[3]
}
cur >= max {
out = ( cur > max ? "" : out ORS ) $4 OFS $5 OFS $3 OFS cur
max = cur
}
END {
print (out ? out : "No matching records")
}
'
Apr 9 IP 731
If gnu-awk is available, you might use a pattern with 2 capture groups for the numbers in the last field. In the END block print the format that you want.
If in this example, file contains the example content, and the last column contains the logon:
awk '
match ($(NF), /\(([0-9]+)\.([0-9]+)\)/, a) {
hm = (a[1] * 60) + a[2]
if(hm > max) {max = hm; line = $0;}
}
END {
n = split(line,a,/[[:space:]]+/)
print a[3], a[4], a[5], max
}
' file
Output
IP Apr 9 731
Testing last command in my machine:
Using Red Hat Linux 7.8
Got the following output:
user0022 pts/1 10.164.240.158 Sat Apr 25 19:32 - 19:47 (00:14)
user0022 pts/1 10.164.243.80 Sat Apr 18 22:31 - 23:31 (1+01:00)
user0022 pts/1 10.164.243.164 Sat Apr 18 19:21 - 22:05 (02:43)
user0011 pts/0 10.70.187.1 Thu Nov 21 15:26 - 18:37 (03:10)
user0011 pts/0 10.70.187.1 Thu Nov 7 16:21 - 16:59 (00:38)
astukals pts/0 10.70.187.1 Mon Oct 7 19:10 - 19:13 (00:03)
reboot system boot 3.10.0-957.10.1. Mon Oct 7 22:09 - 14:30 (156+17:21)
astukals pts/0 10.70.187.1 Mon Oct 7 18:56 - 19:08 (00:12)
reboot system boot 3.10.0-957.10.1. Mon Oct 7 21:53 - 19:08 (-2:-44)
IT pts/0 10.70.187.1 Mon Oct 7 18:50 - 18:53 (00:03)
IT tty1 Mon Oct 7 18:48 - 18:49 (00:00)
user0022 pts/1 30.30.30.168 Thu Apr 16 09:43 - 14:54 (05:11)
user0022 pts/1 30.30.30.59 Wed Apr 15 11:48 - 04:59 (17:11)
user0022 pts/1 30.30.30.44 Tue Apr 14 19:03 - 04:14 (09:11)
Found time format is DD+HH:MM appears only when DD is not zero.
Found there are additional technical users: IT, system, reboot need to filtered.
Suggesting solution:
last | awk 'BEGIN {FS="[ ()+:]*"}
/reboot|system|still/{next}
{ print $5 OFS $6 OFS $3 OFS $(NF-1) + ($(NF-2) * 60) + ($(NF-3) * 60 * 24)}
' |sort -nk 4| head -1
Result:
Apr 15 30.30.30.59 85991

shell script for extracting line of file using awk

I want, the selected lines of file to be print in output file side by side separated by space. Here what I have did so far,
for file in SAC*
do
awk 'FNR==2 {print $4}' $file >>exp
awk 'FNR==3 {print $4}' $file >>exp
awk 'FNR==4 {print $4}' $file >>exp
awk 'FNR==5 {print $4}' $file >>exp
awk 'FNR==7 {print $4}' $file >>exp
awk 'FNR==8 {print $4}' $file >>exp
awk 'FNR==24 {print $0}' $file >>exp
done
My output is:
XV
AMPY
BHZ
2012-08-15T08:00:00
2013-12-31T23:59:59
I want output should be
XV AMPY BHZ 2012-08-15T08:00:00 2013-12-31T23:59:59
First the test data (only 9 rows, tho):
$ cat file
1 2 3 14
1 2 3 24
1 2 3 34
1 2 3 44
1 2 3 54
1 2 3 64
1 2 3 74
1 2 3 84
1 2 3 94
Then the awk. No need for that for loop in shell, awk can handle multiple files:
$ awk '
BEGIN {
ORS=" "
a[2];a[3];a[4];a[5];a[7];a[8] # list of records for which $4 should be outputed
}
FNR in a { print $4 } # output the $4s
FNR==9 { printf "%s\n",$0 } # replace 9 with 24
' file file # ... # the files you want to process (SAC*)
24 34 44 54 74 84 1 2 3 94
24 34 44 54 74 84 1 2 3 94

awk Count number of occurrences

I made this awk command in a shell script to count total occurrences of the $4 and $5.
awk -F" " '{if($4=="A" && $5=="G") {print NR"\t"$0}}' file.txt > ag.txt && cat ag.txt | wc -l
awk -F" " '{if($4=="C" && $5=="T") {print NR"\t"$0}}' file.txt > ct.txt && cat ct.txt | wc -l
awk -F" " '{if($4=="T" && $5=="C") {print NR"\t"$0}}' file.txt > tc.txt && cat ta.txt | wc -l
awk -F" " '{if($4=="T" && $5=="A") {print NR"\t"$0}}' file.txt > ta.txt && cat ta.txt | wc -l
The output is #### (number) in shell. But I want to get rid of > ag.txt && cat ag.txt | wc -l and instead get output in shell like AG = ####.
This is input format:
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 185 185 T - 24 100 10 14 10 14
>seq1 194 194 T C 24 100 12 12 12 12
>seq1 185 185 T AAA 24 100 10 14 10 14
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
I want output like this in the shell or in file for a single occurrences not other patterns.
AG 2
CT 1
TC 1
TA 1
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
n++ increments a counter whenever the condition is matched.
The magic condition END is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
In the END block, we step through the array in a for loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN.
The parentheses in the increment condition are not required, and are included only for clarity.
Just count them all then print the ones you care about:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
Note that this will produce a count of zero for any of your target pairs that don't appear in your input, e.g. if you want a count of "XY"s too:
$ awk '{cnt[$4$5]++} END{split("AG CT TC TA XY",t); for (i=1;i in t;i++) print t[i], cnt[t[i]]+0}' file
AG 2
CT 1
TC 1
TA 1
XY 0
If that's desirable, check if other solutions do the same.
Actually, this might be what you REALLY want, just to make sure $4 and $5 are single upper case letters:
$ awk '$4$5 ~ /^[[:upper:]]{2}$/{cnt[$4$5]++} END{for (i in cnt) print i, cnt[i]}' file
TA 1
AG 2
TC 1
CT 1

pick up files based on dates in ksh script

I have this list of files . Now I will have to pick the latest file based on some condition
3679 Jul 21 23:59 belk_rpo_error_**po9324892**_07212014.log
0 Jul 22 23:59 belk_rpo_error_**po9324892**_07222014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
0 Jul 20 05:50 belk_rpo_error_**po9999992**_07202014.log
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
742 Jul 21 07:30 belk_rpo_error_**po9999991**_07212014.log
0 Jul 23 2014 belk_rpo_error_**po9999991**_07232014.log
For a PATRICULAR Order_No(Marked with ** **)
If the latest file is 0 kB then we will discard it (rest of the files with same Order_no as well)
if the latest file is non Zero then I will take it.(Only the latest one)
Then append the contents in a txt file .
My expected output would be ::
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
I am at my wits end here. I cant seem to figure out how to compare dates in Unix. Any help is very appreciated.
You can try something like:
touch test.txt
for var in ` find . ! -empty -exec ls -r {} \;`
do
cat $var>>test.txt
done
untested
use stat to emit date (epoch time), size and filename.
use awk to filter out zero-length files and extract order number.
sort by order number and date
awk to pick up the last filename for each order number
stat -c $'%Y\t%s\t%n' *.log |
awk -F'\t' -v OFS='\t' '
$2 > 0 {
split($3, a, /_/)
print a[4], $1, $3
}' |
sort -t $'\t' -k1,1 -k2,2n |
awk -F'\t' '
NR > 1 && $1 != prev_order {print filename}
{filename = $3; prev_order = $1}
END {print filename}
'
The sort command might be wrong: In order to group by order number, you might need to sort first by file time then by order number.
If I understand your question, the resulting files need to be concatenated and appended to a file. If the above pipeline is working OK, then pipe into | xargs cat >> something.log

Shell script to find common values and write in particular pattern with subtraction math to range pattern

Shell script to find common values and write in particular pattern with subtraction math to range pattern
Shell script to get command values in two files and write i a pattern to new file AND also have the first value of the range pattern to be subtracted by 1
$ cat file1
2
3
4
6
7
8
10
12
13
16
20
21
22
23
27
30
$ cat file2
2
3
4
8
10
12
13
16
20
21
22
23
27
Script that works:
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s ":" l; s=$1}'
Script out:
2:4
8
10
12:13
16
20:23
27
Desired output:
1:4
8
10
11:13
16
19:23
27
Similar to sputnick's, except using comm to find the intersection of the file contents.
comm -12 <(sort file1) <(sort file2) |
sort -n |
awk '
function print_range() {
if (start != prev)
printf "%d:", start-1
print prev
}
FNR==1 {start=prev=$1; next}
$1 > prev+1 {print_range(); start=$1}
{prev=$1}
END {print_range()}
'
1:4
8
10
11:13
16
19:23
27
Try doing this :
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 |
sort |
awk 'NR==1 {s=l=$1; next}
$1!=l+1 {if(l == s) print l; else print s -1 ":" l; s=$1}
{l=$1}
END {if(l == s) print l; else print s -1 ":" l; s=$1}'

Resources