Cut and sort delimited dates from stdout via pipe - bash

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz

To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

Related

Split a large gz file into smaller ones filtering and distributing content

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles.
Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.
You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'
Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.
Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

awk or shell command to count occurence of value in 1st column based on values in 4th column

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)
Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.
With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms
I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.
For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Unix: Get the latest entry from the file

I have a file where there are name and time. I want to keep the entry only with the latest time. How do I do it?
for example:
>cat user.txt
"a","03-May-13
"b","13-May-13
"a","13-Aug-13
"a","13-May-13
I am using command sort -u user.txt. It is giving the following output:
"a","11-May-13
"a","13-Aug-13
"a","13-May-13
"b","13-May-13
but I want the following output.
"a","13-Aug-13
"b","13-May-13
Can someone help?
Thanks.
Try this:
sort -t, -k2 user.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Explanation:
Sort the entries by the date field in ascending order, pipe the sorted result to awk, which simply uses the first field as a key, so only the last entry of the entries with an identical key will be kept and finally output.
EDIT
Okay, so I can't sort the entries lexicographically. the date need to be converted to timestamp so it can be compared numerically, use the following:
awk -F",\"" '{ cmd=" date --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
If you are using MacOS, use gdate instead:
awk -F",\"" '{ cmd=" gdate --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
I think you need to sort year, month and day.
Can you try this
awk -F"\"" '{print $2"-"$4}' data.txt | sort -t- -k4 -k3M -k2 | awk -F- '{kv[$1]=$2"-"$3"-"$4}END{for(k in kv){print k,kv[k]}}'
For me this is doing the job. I am sorting on the Month and then applying the logic that #neevek used. Till now I am unable to find a case that fails this. But I am not sure if this is a full proof solution.
sort -t- -k2 -M user1.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Can someone tell me if this solution has any issues?
How about this?
grep `cut -d'"' -f4 user.txt | sort -t- -k 3 -k 2M -k 1n | tail -1` user.txt
Explaining: using sort as you have done, get the latest entry with tail -1, extract that date (second column when cutting with a comma delimiter) and then sort and grep on that.
edit: fixed to sort via month.

Find the first or third column

The following command is working as expected. What I need to find is the thread id that is available in the first or third column.
# tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)"
2856057 Connect root#localhost on
111116 5:14:01 2856094 Connect root#localhost on
If the line starts with the date, select the third column i.e. 2856094 or the first column i.e. 2856057
Expected output:
2856057
2856094
Another way to look at it is that you always take the fourth column when counting from the right:
awk '{ print $(NF-3) }'
Otherwise, if the date is really the only reliable indicator, try this:
awk -v Date=$(date "+%y%m%d") '$1 == Date { print $3; next } { print $1 }'
If your data really is that regular (i.e. all the columns are fixed width), then you could use cut:
tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)" | cut -c17-23
This might work for you:
tail -1000 general.log | sed -e '/abc\|slave_user/d;/ Connect.*/!d;s///;s/.* //'
Use the awk inbuilt variable NF to capture the number of fields. If they equal to 6 then print 3 column else print 1st column.
awk 'NF==6{ print $3;next } { print $1 }' INPUT_FILE
Without knowing the format of the file, maybe try:
$ tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)" | awk '{if ($3 == "root#localhost"){print $1;}else{print $3}}'
Or maybe this would work which is simpler:
$ awk '/Connect/ {if ($3 == "root#localhost"){print $1;}else{print $3}}' general.log
I tried. If I'm wrong, or there is a better way, I to will learn it in time. :)
Maybe this using int() ??????
$ awk '/Connect/ {if (!int($3)){print $1;}else{print $3}}' general.log

Resources