Select Records From File by Date - shell

I have a file format like:-
10077083725 06-OCT-17 32 06-OCT-17
10077083725 09-OCT-17 35 09-OCT-17
I want to select records (around 1 million) based on the date conditions in column 4, as in within the last 2 months. 2 solutions:-
1) Convert the format of the date to something like:- 20170610, and then do simple string comparison using awk.
2) Directly do date comparisons.
Can you suggest which one is better. Also, I was unable to find a solution for the second using shell scripts, so any tips there would be useful.

awk solution on MacOSX with coreutils installed. To use this on linux, change gdate to date. My answer is a sort of a combination of both your solutions:
# cat tst.awk
BEGIN{d=conv(d)}
d <= conv($4)
func conv(str) {
cmd = "gdate -d '" str "' +'%Y%m%d'"
res = ((cmd | getline line) > 0 ? line : "")
close(cmd)
return res
}
Use as input:
# cat file
10077083725 06-OCT-17 32 06-OCT-17
10077083725 09-OCT-17 35 09-OCT-17
then:
# awk -v d="-9 days" -f tst.awk file
10077083725 09-OCT-17 35 09-OCT-17

Related

Change names of a columns using a mapping file

I have a file with 3 columns like this:
NC_0001 10 x
NC_0001 11 x
NC_0002 90 y
I want to change the names of the first column using another file .txt that contains the conversion, it's like:
NC_0001 1
NC_0001 1
NC_0002 2
...
So finally I should have:
1 10 x
1 11 x
2 90 y
How can I do that?
P.S. the first file is very huge (50 GB) so I must use a unix command like awk.
awk -f script.awk map_file data_file
NR == FNR { # for the first file
tab[$1] = $2 # create a k/v of the colname and rename value
}
NR != FNR { # for the second file
$1 = tab[$1] # set first column equal to the map value
print # print
}
As a one-liner
awk 'NR==FNR{t[$1]=$2} NR!=FNR{$1=t[$1];print}' map_file data_file
If possible, you should split the first file and run this command on each partition file in parallel. Then, join the results.

Compare two timestamp columns and if difference is greater than 1 hour, trigger email alert(bash)

I have a file that looks like this:
user1,135.4,MATLAB,server1,14:53:59,15:54:28
user2,3432,Solver_HF+,server1,14:52:01,14:54:28
user3,3432,Solver_HF+,server1,14:52:01,15:54:14
user4,3432,Solver_HF+,server1,14:52:01,14:54:36
I want to run a comparison between the last two columns and if the difference is greater than an hour(such as lines 1 and 3) it will trigger something like this:
echo "individual line from file" | mail -s "subject" email#site.com
I was trying to come up with a possible solution using awk, but I'm still fairly new to linux and couldn't quite figure out something that worked.
the following awk scripts maybe is your want
awk 'BEGIN{FS=","}
{a="2019 01 01 " gensub(":"," ","g",$5);
b="2019 01 01 " gensub(":"," ","g",$6);
c = int((mktime(b)-mktime(a))/60)}
{if (c >= 60){system("echo "individual line from file" | mail -s "subject" email#site.com")}}' your_filename
then put the scritps into crontab or other trigger
for example
*/5 * * * * awk_scripts.sh
if you just want check new line . use tail -n filename may be more useful than cat
Here you go: (using gnu awk due to mktime)
awk -F, '{
split($(NF-1),t1,":");
split($NF,t2,":");
d1=mktime("0 0 0 "t1[1]" "t1[2]" "t1[3]" 0");
d2=mktime("0 0 0 "t2[1]" "t2[2]" "t2[3]" 0");
if (d2-d1>3600) print $0}' file
user1,135.4,MATLAB,server1,14:53:59,15:54:28
user3,3432,Solver_HF+,server1,14:52:01,15:54:14
Using field separator as comma to get the second last and last field.
The split the two field inn to array t1 and t2 to get hour min sec
mktime converts this to seconds.
do the math and print only lines with more than 3600 seconds
This can then be piped to other commands.
See how time function are used int gnu awk: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html

How to sum 2 hours to date column in csv file

I have csv file consist of 2 columns, name and date in 24 hours format
Name, log_date
John, 11/29/2017 23:00
And i want to add 2 hours to log date to change date and time to be as below
John, 11/30/2017 01:00
I tried to add it by below command but with no success
awk - F 'NR>1{$4+=(2/24);}1' OFS="," IN.csv > OUT.csv
I get the below output
2017.08
in values of the log date column
So please help
You need a language that has datetime arithmetic. Perl for example:
perl -MTime::Piece -F'/,\s*/' -slane '
$datetime = Time::Piece->strptime($F[1], $fmt);
$F[1] = ($datetime + 7200)->strftime($fmt);
print join ", ", #F
' -- -fmt="%m/%d/%Y %H:%M" <<END
John, 11/29/2017 11:00
END
John, 11/29/2017 13:00
Given your input, there's no way to indicate that the time is 11 PM. How are you supposed to know that?
below is oneliner in python. This is really not a useable code, but I believe you can get idea of using one-liners. This one-liner can be made yet simpler.
python -c "s=r'John, 11/29/2017 13:00';
print(s.replace(s.split(\" \")[-1].split(\":\")[0],str(int(s.split(\" \")[-1].split(\":\")[0])+2)));";
Output
John, 11/29/2017 15:00
Yet, this will not roll over the date like if 23+2 = 25 which should suppose to be 1:00
All you're looking for is documented here.
Using space as a field separator :
{
split($2,D,"/")
split($3,H,":")
# format for mktime is "YYYY MM DD HH MM SS [DST]"
d = D[3] " " D[1] " " D[2]" " H[1] " " H[2] " 00"
t=mktime(d)
t = t + 7200 # add two hours
$2 = strftime("%m/%d/%Y",t)
$3 = strftime("%H:%M",t)
}1
awk -F',' '{if(NR>1){printf("%s, ", $1);system("date -d \"+2 hours " $2 "\" +\"%m/%d/%Y %H:%M\"")}else{print $0}}' IN.csv > OUT.csv

using paste command in a loop

I am using Fedora, and bash to do some text manipulation with the files I have. I am trying to combine a large number of files, each one with two columns of data. From these files, I want to extract the data on the 2nd column of the files, and put it in a single file. Previously, I used the following script:
paste 0_0.dat 0_6.dat 0_12.dat | awk '{print $1, $2, $4}' >0.dat
But this is painfully hard as the number of files gets larger -- trying to do with 100 files. So I looked through the web to see if there's a way to achieve this in a simple way, but come up empty-handed.
I'd like to invoke a 'for' loop, if possible -- for example,
for i in $(seq 0 6 600)
do
paste 0_0.dat | awk '{print $2}'>>0.dat
done
but this does not work, of course, with paste command.
Please let me know if you have any recommendations on how to do what I'm trying to do ...
DATA FILE #1 looks like below (deliminated by a space)
-180 0.00025432
-179 0.000309643
-178 0.000189226
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
DATA FILE #2
-180 0.0002352
-179 0.000423452
-178 0.00019304
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
178 0.0023454268
179 0.002352534
180 0.001504992
First column goes from -180 to 180, with increment of 1.
DESIRED
(n is the # of columns; and # of files)
-180 0.00025432 0.00025123 0.000235123 0.00023452 0.00023415 ... n
-179 0.000223432 0.0420504 0.2143450 0.002345123 0.00125235 ... n
.
.
.
-1 2E-5
0 1.4E-6
1 0.00000
.
.
.
179 0.002352534 ... n
180 0.001504992 ... n
Thanks,
join can get you your desired result.
join <(sort -r file1) <(sort -r file2)
Test:
[jaypal:~/Temp] cat file1
-180 0.00025432
-179 0.000309643
-178 0.000189226
[jaypal:~/Temp] cat file2
-180 0.0005524243
-179 0.0002424433
-178 0.0001833333
[jaypal:~/Temp] join <(sort -r file1) <(sort -r file2)
-180 0.00025432 0.0005524243
-179 0.000309643 0.0002424433
-178 0.000189226 0.0001833333
To do multiple files at once, you can use it with find command -
find . -type f -name "file*" -exec join '{}' +
How about this:
paste "$#" | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
This assumes that you don't run into a limit with paste (check how many open files it can have). The "$#" notation means 'all the arguments given, exactly as given'. The awk script simply prints $1 from each line of pasted output, followed by the even-numbered columns; followed by a newline. It doesn't validate that the odd-numbered columns all match; it would perhaps be sensible to do so, and you could code a vaguely similar loop to do so in awk. It also doesn't check that the number of fields on this line is the same as the number on the previous line; that's another reasonable check. But this does do the whole job in one pass over all the files - for an essentially arbitrary list of files.
I have 100 input files -- how do I use this code to open up these files?
You put my original answer in a script 'filter-data'; you invoke the script with the 101 file names generated by seq. The paste command pastes all 101 files together; the awk command selects the columns you are interested in.
filter-data $(seq --format="0_%g.dat" 0 6 600)
The seq command with the format will list you 101 file names; these are the 101 files that will be pasted.
You could even do without the filter-data script:
paste $(seq --format="0_%g.dat" 0 6 600) | awk '{ printf("%s", $1);
for (i = 2; i < NF; i += 2)
printf(" %s", $i); printf "\n";
}'
I'd probably go with the more general script as the main script, and if need be I'd create a 'one-liner' that invokes the main script with the specific set of arguments currently of interest.
The other key point which might be a stumbling block: paste is not limited to 2 files only; it can paste as many files as you can have open (give or take about 3).
Based on my assumptions that you see in the comments above, you don't need paste. Try this
awk '{
arr[$1] = arr[$1] "\t" $2 };
END {for (x=-180;x<=180;x++) print x "\t" arr[x]
}' *.txt \
| sort -n
Note that we just take all of the values into an array based on the value in the first field, and append values based on the $1 key. After all data has been read in, The END section prints out the key and the value. I've added things like "x=", ":vals= " to help 'explain' what is happening. Remove those for completely clean tab-seperated data. Change '\t' to ':' or '|', or ... shudder ',' if you need to. Change the *.txt to what every your filespec is.
Be aware that all Unix command lines have limitations to the number and size (length of filenames, not the data inside), of filenames that can be processed in 1 invocation. Let us know if you get error messages about that.
The pipe to sort ensures that data is sorted by column1.
With my test data, the output was
-178 0.0001892261 0.0001892262 0.0001892263 0.000189226
-179 0.0003096431 0.0003096432 0.0003096433 0.000309643
-180 0.000254321 0.000254322 0.000254323 0.00025432
178 0.0001892261 0.0001892262 0.0001892263 0.000189226
179 0.0003096431 0.0003096432 0.0003096433 0.000309643
180 0.000254321 0.000254322 0.000254323 0.00025432
Based on 4 files of input.
I hope this helps.
P.S. Welcome to StackOverflow (S.O.) Please remeber to read the FAQs, http://tinyurl.com/2vycnvr , vote for good Q/A by using the gray triangles, http://i.imgur.com/kygEP.png , and to accept the answer that bes solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png
This might work for you:
echo *.dat | sed 's/\S*/<(cut -f2 &)/2g;s/^/paste /' | bash >all.dat

Humanized dates with awk?

I have this awk script that runs through a file and counts every occurrence of a given date. The date format in the original file is the standard date format, like this: Thu Mar 5 16:46:15 EST 2009 I use awk to throw away the weekday, time, and timezone, and then do my counting by pumping the dates into an associative array with the dates as indices.
In order to get the output to be sorted by date, I converted the dates to a different format that I could sort with bash sort.
Now, my output looks like this:
Date Count
03/05/2009 2
03/06/2009 1
05/13/2009 7
05/22/2009 14
05/23/2009 7
05/25/2009 7
05/29/2009 11
06/02/2009 12
06/03/2009 16
I'd really like the output to have more human readable dates, like this:
Mar 5, 2009
Mar 6, 2009
May 13, 2009
May 22, 2009
May 23, 2009
May 25, 2009
May 29, 2009
Jun 2, 2009
Jun 3, 2009
Any suggestions for a way I could do this? If I could do this on the fly when I output the count values that would be best.
UPDATE:
Here's my solution incorporating ghostdog74's example code:
grep -i "E[DS]T 2009" original.txt | awk '{printf "%s %2.d, %s\r\n",$2,$3,$6}' >dates.txt #outputs dates for counting
date -f dates.txt +'%Y %m %d' | awk ' #reformat dates as YYYYMMDD for future sort
{++total[$0]} #pump dates into associative array
END {
for (item in total) printf "%s\t%s\r\n", item, total[item] #output dates as yyyy mm dd with counts
}' | sort -t \t | awk ' #send to sort, then to cleanup
BEGIN {printf "%s\t%s\r\n","Date","Count"}
{t=$1" "$2" "$3" 0 0 0" #cleanup using example by ghostdog74
printf "%s\t%2.d\r\n",strftime("%b %d, %Y",mktime(t)),$4
}'
rm dates.txt
Sorry this looks so messy. I've tried to put clarifying comments in.
Use awk's sort and date's stdin to greatly simplify the script
Date will accept input from stdin so you can eliminate one pipe to awk and the temporary file. You can also eliminate a pipe to sort by using awk's array sort and as a result, eliminate another pipe to awk. Also, there's no need for a coprocess.
This script uses date for the monthname conversion which would presumably continue to work in other languages (ignoring the timezone and month/day order issues, though).
The end result looks like "grep|date|awk". I have broken it into separate lines for readability (it would be about half as big if the comments were eliminated):
grep -i "E[DS]T 2009" original.txt |
date -f - +'%Y %m %d' | #reformat dates as YYYYMMDD for future sort
awk '
BEGIN { printf "%s\t%s\r\n","Date","Count" }
{ ++total[$0] #pump dates into associative array }
END {
idx=1
for (item in total) {
d[idx]=item;idx++ # copy the array indices into the contents of a new array
}
c=asort(d) # sort the contents of the copy
for (i=1;i<=c;i++) { # use the contents of the copy to index into the original
printf "%s\t%2.d\r\n",strftime("%b %e, %Y",mktime(d[i]" 0 0 0")),total[d[i]]
}
}'
I get testy when I see someone using grep and awk (and sed, cut, ...) in a pipeline. Awk can fully handle the work of many utilities.
Here's a way to clean up your updated code to run in a single instance of awk (well, gawk), and using sort as a co-process:
gawk '
BEGIN {
IGNORECASE = 1
}
function mon2num(mon) {
return(((index("JanFebMarAprMayJunJulAugSepOctNovDec", mon)-1)/3)+1)
}
/ E[DS]T [[:digit:]][[:digit:]][[:digit:]][[:digit:]]/ {
month=$2
day=$3
year=$6
date=sprintf("%4d%02d%02d", year, mon2num(month), day)
total[date]++
human[date] = sprintf("%3s %2d, %4d", month, day, year)
}
END {
sort_coprocess = "sort"
for (date in total) {
print date |& sort_coprocess
}
close(sort_coprocess, "to")
print "Date\tCount"
while ((sort_coprocess |& getline date) > 0) {
print human[date] "\t" total[date]
}
close(sort_coprocess)
}
' original.txt
if you are using gawk
awk 'BEGIN{
s="03/05/2009"
m=split(s,date,"/")
t=date[3]" "date[2]" "date[1]" 0 0 0"
print strftime("%b %d",mktime(t))
}'
the above is just an example, as you did not show your actual code and so cannot incorporate it into your code.
Why don't you prepend your awk-date to the original date? This yields a sortable key, but is human readable.
(Note: to sort right, you should make it yyyymmdd)
If needed, cut can remove the prepended column.
Gawk has strftime(). You can also call the date command to format them (man). Linux Forums gives some examples.

Resources