comparing two files with common file name but want output from the both if matched, how using awk? - shell

I want to compare filenames of Today.txt with Main.txt.
If there is match, then print all 6 columns of matched file of Main.txt and Time column of Today.txt with a new file say matched.txt.
and the files which are not matched with Main.txt, then list the filename and time of TODAY.txt in a new file say unmatched.txt
NOTE: Plus sign(+) indicates files are from inprogress directory,sometimes filenames are appended with "+".
Main.txt
date filename timestamp space count status
Nov 4 +CHCK01_20161104.txt 06:39 2.15M 17153 on_time
Nov 4 TRIPS11_20161104.txt 09:03 0.00M 24 On_Time
Nov 4 AR02_20161104.txt 09:31 0.00M 7 On_Time
Nov 4 AR01_20161104.txt 09:31 0.04M 433 On_Time
Today.txt
filename time
CHCK01_20161104.txt 06:03
CHCK05_20161104.txt 11:10
CHCK09_20161104.txt 21:46
AR01_20161104.txt 09:36
AR02_20161104.txt 09:36
ifs01_20161104.txt 21:16
TRIPS11_20161104.txt 09:16
Required Output:
matched.txt
Nov 4 +CHCK01_20161104.txt 06:39 06:03 2.15M 17153 on_time
Nov 4 TRIPS11_20161104.txt 09:03 09:16 0.00M 24 On_Time
Nov 4 AR02_20161104.txt 09:31 09:36 0.00M 7 On_Time
Nov 4 AR01_20161104.txt 09:31 09:36 0.04M 433 On_Time
unmatched.txt
CHCK05_20161104.txt 11:10
CHCK09_20161104.txt 21:46
ifs01_20161104.txt 21:16
Below command gives me proper output except the 2nd column output of Today.txt, kindly help me on this please ?
awk 'FNR==1{next}
NR==FNR{a[$1]=$2; next}
{k=$3; sub(/^\+/,"",k)} k in a{print; delete a[k]}
END{for(k in a) print k,a[k] > "unmatched.txt"}' Today.txt Main.txt > matched.txt
Thanks a lot in advance !

Assuming that the filenames in today.txt occur only once in the main.tx, you can write an awk script as
$ cat prog.awk
NR==FNR && NR>1{file[$1]=$2; next}
$3 in file{$4 = $4" "file[$3]; print > "matched.txt"; delete file[$3]}
END{for (filename in file) print filename, file[filename] > "unmatched.txt"}
$ awk -F"[ +]*" -f prog.awk today main
-F"[ +]*" Set the field separator as space or +. This will be use full to remove the + from the filenames.

Related

I dont want to print repeated lines based on column 6 and 7

I don't want to print repeated lines based on column 6 and 7. sort -u does not seem to help
cat /tmp/testing :-
-rwxrwxr-x. 1 root root 52662693 Feb 27 13:11 /home/something/bin/proxy_exec
-rwxrwxr-x. 1 root root 27441394 Feb 27 13:12 /home/something/bin/keychain_exec
-rwxrwxr-x. 1 root root 45570820 Feb 27 13:11 /home/something/bin/wallnut_exec
-rwxrwxr-x. 1 root root 10942993 Feb 27 13:12 /home/something/bin/log_exec
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
When I try cat /tmp/testing | sort -u -k 6,6 -k 7,7 I get :-
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
-rwxrwxr-x. 1 root root 52662693 Feb 27 13:11 /home/something/bin/proxy_exec
Desired output is below, as that is the only file different from others based on month and date column
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
[not] to print repeated lines based on column 6 and 7 using awk, you could:
$ awk '
++seen[$6,$7]==1 { # count seen instances
keep[$6,$7]=$0 # keep first seen ones
}
END { # in the end
for(i in seen)
if(seen[i]==1) # the ones seen only once
print keep[i] # get printed
}' file # from file or pipe your ls to the awk
Output for given input:
-rwxrwxr-x. 1 root root 137922408 Apr 16 03:43 /home/something/bin/android_exec
Notice: All standard warnings against parsing ls output still apply.
tried on gnu sed
sed -E '/^\s*(\S+\s+){5}Feb\s+27/d' testing
tried on gnu awk
awk 'NR==1{a=$6$7;next} a!=$6$7{print}' testing

How to get a filename list with ncftp?

So I tried
ncftpls -l
which gives me a list
-rw-r--r-- 1 100 ftpgroup 3817084 Jan 29 15:50 1548773401.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817089 Jan 29 15:51 1548773461.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817083 Jan 29 15:52 1548773521.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817085 Jan 29 15:53 1548773582.tar.gz
-rw-r--r-- 1 100 ftpgroup 3817090 Jan 29 15:54 1548773642.tar.gz
But all I want is to check the timestamp (which is the name of the tar.gz)
How to only get the timestamp list ?
As requested, all I wanted to do is delete old backups, so awk was a good idea (at least it was effective) even it wasn't the right params. My method to delete old backup is probably not the best but it works
ncftpls *authParams* | (awk '{match($9,/^[0-9]+/, a)}{ print a[0] }') | while read fileCreationDate; do
VALIDITY_LIMIT="$((`date +%s`-600))"
a=$VALIDITY_LIMIT
b=$fileCreationDate
if [ $b -lt $a ];then
deleteFtpFile $b
fi
done;
You can use awk to only display the timestamps from the output like so:
ncftpls -l | awk '{ print $5 }'

awk: Group by and then Sort by sub strings of a string

Assuming we have following files:
-rw-r--r-- 1 user group 120 Aug 17 18:27 A.txt
-rw-r--r-- 1 user group 155 May 12 12:28 A.txt
-rw-r--r-- 1 user group 155 May 10 21:14 A.txt
-rw-rw-rw- 1 user group 700 Aug 15 17:05 B.txt
-rw-rw-rw- 1 user group 59 Aug 15 10:02 B.txt
-rw-r--r-- 1 user group 180 Aug 15 09:38 B.txt
-rw-r--r-- 1 user group 200 Jul 2 17:09 C.txt
-rw-r--r-- 1 user group 4059 Aug 9 13:58 D.txt
Considering only HH:MM in timestamp (ie ignoring date/day part of timestamp), I want to sort this listing to pick maximum and minimum timestamp for each file name.
So we want to group by last column and get min & max HH:MM.
Please assume that filename duplicates are allowed in my input data.
In awk code, I particularly got stuck to group by and then sort by HH first and then MM.
Output we are expecting is in format:
Filename | Min HHMM | Max HHMM
A.txt 12:28 21:14
C.txt 17:09 17:09
..
(or any other output format giving this details is good)
Can you please help..TIA
Try:
awk '{if ($8<min[$9] || !min[$9])min[$9]=$8; if ($8>max[$9])max[$9]=$8} END{for (f in min)print f,min[f],max[f]}' file | sort
Example
$ cat file
-rw-r--r-- 1 user group 120 Aug 17 18:27 A.txt
-rw-r--r-- 1 user group 155 May 12 12:28 A.txt
-rw-r--r-- 1 user group 155 May 10 21:14 A.txt
-rw-rw-rw- 1 user group 700 Aug 15 17:05 B.txt
-rw-rw-rw- 1 user group 59 Aug 15 10:02 B.txt
-rw-r--r-- 1 user group 180 Aug 15 09:38 B.txt
-rw-r--r-- 1 user group 200 Jul 2 17:09 C.txt
-rw-r--r-- 1 user group 4059 Aug 9 13:58 D.txt
$ awk '{if ($8<min[$9] || !min[$9])min[$9]=$8; if ($8>max[$9])max[$9]=$8} END{for (f in min)print f,min[f],max[f]}' file | sort
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
Warning
Your input looks like it was produced by ls. If that is so, be aware that the output of ls has a myriad of peculiarities and compatibility issues. The authors of ls recommend against parsing the output of ls.
How the code works
awk implicitly loops over every line of input. This code uses two associative arrays. min keeps track of the minimum time for each file name. max keeps track of the maximum.
if ($8<min[$9] || !min[$9])min[$9]=$8
This updates min if the time, $8, in the time for the current line is less than the previously seen time for this filename, $9.
if ($8>max[$9])max[$9]=$8
This updates max if the time, $8, in the time for the current line is greater than the previously seen time for this filename, $9.
END{for (f in min)print f,min[f],max[f]}
This prints out the results for each file name.
sort
This sorts the output into a cosmetically pleasing form.
similar awk
$ awk '{k=$9;v=$8} # set key (k), value (v)
!(k in min){min[k]=max[k]=v} # initial value for min/max
min[k]>v{min[k]=v} # set min
max[k]<v{max[k]=v} # set max
END{print "Filename | Min HHMM | Max HHMM";
for(k in min) print k,min[k],max[k] | "sort"}' file
Filename | Min HHMM | Max HHMM
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
note that printing header and piping data to sort in awk keeps the header in the first line.
$ cat > test.awk
BEGIN {
min["\x00""Filename"]="Min_HHMM"OFS"Max_HHMM" # set header in min[], preceded by NUL
} # to place on top when ordering (HACK)
!($9 in min)||min[$9]>$8 { # if candidate smaller than current min
min[$9]=$8 # set new min
}
max[$9]<$8 {
max[$9]=$8 # set new max
}
END {
PROCINFO["sorted_in"]="#ind_str_asc" # set array scanning order for for loop
for(i in min)
print i,min[i],max[i]
}
$ awk -f test.awk file
Filename Min_HHMM Max_HHMM
A.txt 12:28 21:14
B.txt 09:38 17:05
C.txt 17:09 17:09
D.txt 13:58 13:58
The BEGIN hack can be replaced by a static print in the beginning of END block:
print "Filename"OFS"Min_HHMM"OFS"Max_HHMM";

How to extract date from filename with extenstion using shell script

I tried to extract date from filenames for first two rows only with extension .log
ex: filenames are as follows
my_logFile.txt contains
abc20140916_1.log
abhgg20140914_1.log
abf20140910_1.log
log.abc_abc20140909_1
The code I tried:
awk '{print substr($1,length($1)-3,4)}' my_logFile.txt
But getting op as:
.log
.log
.log
Need op as:
20140916
20140914
*****revised query*
I have a txt file containing n number of log files. Each line in txt file is like this.
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 abc20140405_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 aghtff20140404_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 log.pqrs20140403_1
I need to extract date out of file names from only first two rows. Here the filename has varying number of char before date.
The op should beL
20140405
20140404
Will this work to you?
$ head -2 file | grep -Po ' [a-z]+\K[0-9]+(?=.*\.log$)'
20140405
20140404
Explanation
head -2 file gets the first two lines of the file.
grep -Po ' [a-z]+\K[0-9]+(?=.*\.log$)' gets the set of digits in between a block of (space + a-z letters) and (.log + end of line).
try this,
cut -f9 -d " " <file> | grep -o -E "[0-9]{8}"
worked on my machine,
[root#giam20 ~]# cat sample.txt
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 abc20140405_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 aghtff20140404_1.log
-rw-rw-rw- 1 abchost abchost 241315175 Apr 16 10:45 log.pqrs20140403_1
[root#giam20 ~]# cut -f9 -d " " sample.txt | grep -o -E "[0-9]{8}"
20140405
20140404
20140403

pick up files based on dates in ksh script

I have this list of files . Now I will have to pick the latest file based on some condition
3679 Jul 21 23:59 belk_rpo_error_**po9324892**_07212014.log
0 Jul 22 23:59 belk_rpo_error_**po9324892**_07222014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
0 Jul 20 05:50 belk_rpo_error_**po9999992**_07202014.log
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
742 Jul 21 07:30 belk_rpo_error_**po9999991**_07212014.log
0 Jul 23 2014 belk_rpo_error_**po9999991**_07232014.log
For a PATRICULAR Order_No(Marked with ** **)
If the latest file is 0 kB then we will discard it (rest of the files with same Order_no as well)
if the latest file is non Zero then I will take it.(Only the latest one)
Then append the contents in a txt file .
My expected output would be ::
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
I am at my wits end here. I cant seem to figure out how to compare dates in Unix. Any help is very appreciated.
You can try something like:
touch test.txt
for var in ` find . ! -empty -exec ls -r {} \;`
do
cat $var>>test.txt
done
untested
use stat to emit date (epoch time), size and filename.
use awk to filter out zero-length files and extract order number.
sort by order number and date
awk to pick up the last filename for each order number
stat -c $'%Y\t%s\t%n' *.log |
awk -F'\t' -v OFS='\t' '
$2 > 0 {
split($3, a, /_/)
print a[4], $1, $3
}' |
sort -t $'\t' -k1,1 -k2,2n |
awk -F'\t' '
NR > 1 && $1 != prev_order {print filename}
{filename = $3; prev_order = $1}
END {print filename}
'
The sort command might be wrong: In order to group by order number, you might need to sort first by file time then by order number.
If I understand your question, the resulting files need to be concatenated and appended to a file. If the above pipeline is working OK, then pipe into | xargs cat >> something.log

Resources