Split Laravel Log Files by Date - bash

I've inherited a Laravel system with a large single log file that is currently around 17GB in size, I'm now rotating future log files monthly, however I need to split the existing log by month.
The date is formatted as yyyy-mm-dd hh:mm:ss ("[2018-06-28 13:32:05]"). Does anybody know how I could perform the split using only bash scripting (e.g. through use of awk, sed etc.).
The input file name is laravel.log. I'd like output files to have format such as laravel-2018-06.log.
Help much appreciated.

Since the information you provide is a bit sparse, I will go with the following assumptions :
each log-entry is a single line
somewhere there is always one string of the form [yyyy-mm-dd hh:mm:ss], if there are more, we take the first.
your log-file is sorted in time.
The regex which matches your date is,
\\[[0-9]{4}(-[0-9]{2}){2} ([0-9]{2}:){2}[0-9]{2}\\]
or a bit less strict
\\[[-:0-9 ]{19}\\]
So we can use this in combination with match(s,ere) to get the desired string :
awk 'BEGIN{ere="\\[[0-9]{4}(-[0-9]{2}){2} ([0-9]{2}:){2}[0-9]{2}\\]"}
{ match($0,ere); fname="laravel-"substr($0,RSTART+1,7)".log" }
(fname != oname) { close(oname); oname=fname }
{ print > oname }' laravel.log
As you say that your file is a bit on the large side, you might want to test this first on a subset which covers a couple of months.
$ head -10000 laravel.log > laravel.head.log
$ awk '{...}' laravel.head.log
$ md5sum laravel.head.log
$ cat laravel.*-*.log | md5sum
If the md5sum is not matching, you might have a problem.

Related

How to display the latest line based on the file's name or the line's position in bash

I have a tricky question about how to keep the latest log data as my server reposted it two times
This is the result after I grep from my folder :(i have tons of data, just to keep it simpler)
...
20150630-201427.csv:20150630,CFIIASU,233,96.21786,0.44644,
20150630-201427.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150630-201427.csv:20150630,CFIIASU_CN,68,102.19569,0.10692
20150630-201427.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150630-201427.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150630-201427.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
...
The data actually came from many csv files, I only pick two csv files to make the example, and here are some explainations of this:
the example came from two files 20150630-201427.csv and 20150701-151654.csv, and it has 4 columns which correspond to date, datanme, data_column1, data_column2, data_column3.
these line have the same data date 20150630 and the same dataname CFIIASU,CFIIASU_AU...etc, but the numbers in the fourth and fifth column (which are data_column2 and data_column3) are different.
How could i keep the data of 20150701-151654.csv based on the file's name and data date and apply it on my whole data set?
To make it more clearly. I'd like to keep the lines of "the latest csv" and since the latest csv is corresponding to the file's name, which in this example is 2015070. but when it comes to my whole data set i need to handle with so many 20xxxxxx.csv that i can't check it one by one.
for the example, i made this should end up like this:
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
Thanks in advance.
Your question isn't clear but it sounds like this might be what you're trying to do (print all lines from the last csv mentioned in the input file):
$ tac file | awk -F':' 'NR>1 && $1!=prev{exit} {print; prev=$1}' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
or maybe this (print the last line seen for every 20150630,CFIIASU etc. pair in the input file):
$ tac file | awk -F'[:,]' '!seen[$2,$3]++' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

Create CSV from specific columns in another CSV using shell scripting

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.
The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Need a quick way of removing partial duplicates from a log

I'm using a bash script to grep out some lines from a log file. The basic format of this log file is:
field1: value1, field2=value2, field3=value3,
field4=value4,value5,value6, field5=value7
Sometimes there will be lines in which field1: value1 is identical, but some of the other information is either the same or different. I'd like to filter those lines out, so that I only grep out the first instance of anything that has the same "field1: value1" tuple.
I'd prefer a nice command-line one-liner if you can find something especially simple. I definitely want to keep it in the bash script. This is on linux, so we've got all the command-line tools available.
Thanks!
Using awk:
awk -F, '!arr[$1]++ { print }' LOGFILE
The awk program uses an array to keep a count of the number of times a particular 'field1: value1` string is seen, but only prints the incoming line the first time.

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Resources