Bash script to extract entries from log file based on dates specified in another file? - bash

I've got a pretty big comma-delimited CSV log file (>50000 rows, let's call it file1.csv) that looks something like this:
field1,field2,MM-DD-YY HH:MM:SS,field4,field5...
...
field1,field2,07-29-10 08:04:22.7,field4,field5...
field1,field2,07-29-10 08:04:24.7,field4,field5...
field1,field2,07-29-10 08:04:26.7,field4,field5...
field1,field2,07-29-10 08:04:28.7,field4,field5...
field1,field2,07-29-10 08:04:30.7,field4,field5...
...
As you can see, there is a field in the middle that is a time stamp.
I also have a file (let's call it file2.csv) that has a short list of times:
timestamp,YYYY,MM,DD,HH,MM,SS
20100729180031,2010,07,29,18,00,31
20100729180039,2010,07,29,18,00,39
20100729180048,2010,07,29,18,00,48
20100729180056,2010,07,29,18,00,56
20100729180106,2010,07,29,18,01,06
20100729180115,2010,07,29,18,01,15
What I would like to do is to extract only the lines in file1.csv that have times specified in file2.csv.
How do I do this with a bash script? Since file1.csv is quite large, efficiency would also be a concern. I've done very simple bash scripts before, but really don't know how to deal with this. Perhaps some implementation of awk? Or is there another way?
P.S. Complication 1: I manually spot checked some of the entries in both files to make sure they would match, and they do. There just needs to be a way to remove (or ignore) the extra ".7" at the end of the seconds ("SS") field in file1.csv.
P.P.S. Complication 2: Turns out the entries in list1.csv are all separated by about two seconds. Sometimes the time stamps in list2.csv fall right in between two of the entries in list1.csv! Is there a way to find the closest match in this case?

Taking advantage of John's answer, you could sort and join the files, printing just the columns you want (or all columns if the case). Please take a look below (note that I'm considering that you're using UNIX, like Solaris, so nawk could be faster than awk, also we don't have gawk that could facilitate even more):
# John's nice code
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
# Sorting times.list file to prepare for the join
sort times.list -o times.list
# Sorting file1.csv
sort -t, -k3,3 file1.csv -o file1.csv
# Finally joining files and printing the rows that match the times
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 1.5......1.50 file1.csv times.list
One special particularity from this method is that you could change it in order to work in several different cases, like with different columns order, and also in cases when the key columns are not concatenated. It would be very hard to do this with grep (using regexp or not)

If you have GNU awk (gawk), you can use this technique.
In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.
awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv
This illustrates a couple of different techniques.
skip record number one to skip the header (using a match is actually better)
instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line

One approach is to use awk to convert the timestamps in file2.csv to file1.csv's format, then use grep -f to search through file1.csv. This should be quite fast as it will only make one pass through file1.csv.
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
grep -f times.list file1.csv
You could combine this all into one line if you wish:
grep -f <(awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv) file1.csv

Related

Is there a way to treat a single column of integers as an array in order to extract certain digits?

I am trying to treat a series of integers as an array in order to extract the "columns" of interest.
My data after extracting a column of integers looks something like:
01010101010
10101010101
00100111100
10111100000
01011000100
If I'm only interested in the 1st, 4th, and 11th integers, I'd like the output to look like this:
010
101
000
110
010
This problem is hard to describe in words, so I'm sorry for the lack of clarity. I've tried a number of suggestions, but many things such as awk's substr() are unable to skip positions (such as the 1st, 4th, and 11th positions here).
You can use the cut command:
cut -c 1,4,11 file
-c selects only characters.
or using (gnu) awk:
awk '{print $1 $4 $11}' FS= file
FS is the field separator which is set to nothing in order capture every single character.
With GNU awk which can use empty string as field separator, you could do:
awk -F '' '{print $1, $4, $11}' OFS='' infile
Could you please try following awk too.
awk '{print substr($0,1,1) substr($0,4,1) substr($0,11,1)}' Input_file

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp
You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.
Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Compare 2 csv files and delete rows - Shell

I have a 2 csv files. One has several columns, the other is just one column with domains. Simplified data of these files would be
file1.csv:
John,example.org,MyCompany,Australia
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
file2.csv:
example.org
google.es
mysite.uk
The output should be
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
I have tried this solution
grep -v -f file2.csv file1.csv >output-file
Found here
http://www.unix.com/shell-programming-and-scripting/177207-removing-duplicate-records-comparing-2-csv-files.html
But since there is no explanation whatsoever about how the script works, and I suck at shell, I cannot tweak it to make it work for me
A solution for this would be highly appreciated, a solution with some explanation would be awesome! :)
EDIT:
I have tried the line that was suppose to work, but for some reason it does not. Here the output from my terminal. What's wrong with this?
Desktop $ cat file1.csv ; echo
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Desktop $ cat file2.csv ; echo
example.org
google.es
mysite.uk
Desktop $ grep -v -f file2.csv file1.csv
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Why grep doesn't remove the line
John,example.org,MyCompany,Australia
The line you posted, works just fine.
$ grep -v -f file2.csv file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
And here's an explanation. grep will search for a given pattern in a given file and print all lines that match. The simplest example of usage is:
$ grep John file1.csv
John,example.org,MyCompany,Australia
Here we used a simple pattern that matches each character, but you can also use regular expressions (basic, extended, and even perl-compatible ones).
To invert the logic, and print only the lines that do not match, we use the -v switch, like this:
$ grep -v John file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
To specify more than one pattern, you can use the option -e pattern multiple times, like this:
$ grep -v -e John -e Lenny file1.csv
Martha,site.com,ThirdCompany,US
However, if there is a larger number of patterns to check for, we might use the -f file option that will read all patterns from a file specified.
So, when we combine all of those; reading patterns from a file with -f and inverting the matching logic with -v, we get the line you need.
One in awk:
$ awk -F, 'NR==FNR{a[$1];next}($2 in a==0)' file2 file1
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
Explained:
$ awk -F, ' # using awk, comma-separated records
NR==FNR { # process the first file, file2
a[$1] # hash the domain to a
next # proceed to next record
}
($2 in a==0) # process file1, if domain in $2 not in a, print the record
' file2 file1 # file order is important

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. VoilĂ !
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

awk script to delete single record, not just group of records

I have an awk command that outputs entries absent from $NEWFILE but found in $OLDFILE:
awk -F "|" 'NR==FNR{a[$4]++}!a[$4]' $NEWFILE $OLDFILE > $OUTFILE
This command works great when all entries for an entity sharing a unique identifier are not found in $NEWFILE. However, it fails when only one entry for the entity, but not all, has been removed from $NEWFILE.
Anyone have a suggestion about how I can tweak this awk command to output all the entries absent from $NEWFILE but found in $OLDFILE, regardless of whether all the entries for an entity are removed?
Sample data: newfile, oldfile
Short and sweet: Use diff. You can diff oldfile newfile | grep '^< ' | cut -b3- to limit the output to what you want.
If I understand you correctly, this is what you want
awk -F "|" 'NR==FNR{a[$1 $2 $3 $4]++}!a[$1 $2 $3 $4]' NEWFILE OLDFILE > OUTFILE
Since NEWFILE don't have the urls present in OLDFILE the unique row identifier is the composite of the four first fields. Because NEWFILE doesn't have those urls a simple diff won't do.
AWK is a line by line interpreter that's reason for only one line being removed and others being in place. You can do two things:
If you can, filter with an expression which is common to lines.
For each line of newfile, run a for loop which will iterate oldfile and do operation for you.
Must you use awk? May we simply employ join instead, which is really what you're doing here, no?
$join -v2 -t'|' -j4 <(sort -t'|' -k4 newfile) <(sort -t'|' -k4 oldfile ) |tee outfile
P-1-01541|22|Professor|University of Alabama at Birmingham|http://www.uab.edu/
P-1-01541|22|Short-Term Scholar|University of Alabama at Birmingham|http://www.uab.edu/
This of course assumes you're joining on column 4, and like most rudimentary joins that you don't want to re-iterate, it must be sorted first.

Resources