Performance issues with bash script - bash

I have written a bash script that is responsible for 'collapsing' a log file. Given a log file of the format:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line
message
that may continue
several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
Collapse the log file to a single lined file with a separator character:
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line; message; that may continue; several lines
21 Oct 2017 12:38:07 [DEBUG] Single line message
The following bash script achieves this goal, but at an excruciatingly slow pace. A 500mb input log may take 30 minutes on an 8 core 32 gb machine.
while read -r line; do
if [ -z "$line" ]; then
BUFFER+=$LINE_SEPERATOR
continue
done
POSSIBLE_DATE='cut -c1-11 <<< $line'
if [ "$PREV_DATE" == "$POSSIBLE_DATE" ]; then # Usually date won't change, big comparison saving.
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
elif [[ "$POSSIBLE_DATE" =~ ^[0-3][0-9]\ [A-Za-z]{3}\ 2[0-9]{3} ]]; then # Valid date.
PREV_DATE="$POSSIBLE_DATE"
if [ -n "$BUFFER" ]; then
echo $BUFFER
BUFFER=""
fi
BUFFER+="$line"
else
BUFFER+="$line"
fi
done
Any ideas how I can optimize this script? It doesn't appear as though the regex is the bottleneck (my first optimization) as now that condition is rarely hit.
Most of the lines in the log file are single lines, so its just a straight up comparison of the first 11 chars, doesn't seem like it should be so computationally expensive?
Thanks.

using awk
It will be much more faster as it won't spawn multiple processes.
$ awk '/^[^0-9]/{ORS="; "} /^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""} END{printf RS}1' file
21 Oct 2017 12:38:03 [DEBUG] Single line message
21 Oct 2017 12:38:05 [DEBUG] Multi line message; that may continue ; several lines;
21 Oct 2017 12:38:07 [DEBUG] Single line message
/^[^0-9]/{ORS="; "} : If line starts with non-digit then set Output Record Separator as ; instead of default \n
/^[0-9]/{$0=(FNR==1)?$0:RS $0; ORS=""}: If it starts with a digit then set ORS="" and prepend RS or \n to the record (with exception of first line i.e FNR==1 where we don't want a newline at the start)

You can use sed
sed ':B;/^[0-9][0-9]* /N;/\n[0-9][0-9]* /!{s/\n/; /;bB};h;s/\n.*//p;x;s/.*\n//;tB' infile
You can adjust the regex '[0-9][0-9]* ' to your need.

Related

Printing the same contiguous lines only once using shell/awk

I have an input as below:
Sep 9 09:22:11
Hello
Hello
Sep 9 10:23:11
Hello
Hello
Hello
Sep 10 11:23:11
I expect the output as below: (the same contiguous lines are replaced by only one line)
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Could anyone help me solving this one fast using shell or awk ?
Using awk you can do this:
awk '$0 != prev; {prev=$0}' file
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Command Breakup:
$0 != prev; # if previous line is not same as current then print it
{prev=$0} # store current line in a variable called prev
To remove repeats of lines, use uniq:
uniq File
With your sample input, for example:
$ uniq File
Sep 9 09:22:11
Hello
Sep 9 10:23:11
Hello
Sep 10 11:23:11
Although its name may imply that uniq concerns itself with unique lines, it does not: it looks for adjacent repeated lines and, by default, removes the repeats.
Just because you asked for shell too, though the given answers are all better solutions -
last=''
while read line
do if [[ "$line" -eq "$last" ]]
then continue
else echo "$line"
last="$line"
fi
done < infile
This is simple, clear, and likely slower than either awk or uniq.

gawk - suppress output of matched lines

I'm running into an issue where gawk prints unwanted output. I want to find lines in a file that match an expression, test to see if the information in the line matches a certain condition, and then print the line if it does. I'm getting the output that I want, but gawk is also printing every line that matches the expression rather than just the lines that meet the condition.
I'm trying to search through files containing dates and times for certain actions to be executed. I want to show only lines that contain times in the future. The dates are formatted like so:
text... 2016-01-22 10:03:41 more text...
I tried using sed to just print all lines starting with ones that had the current hour, but there is no guarantee that the file contains a line with that hour, (plus there is no guarantee that the lines all have any particular year, month, day etc.) so I needed something more robust. I decided trying to convert the times into seconds since epoch, and comparing that to the current systime. If the conversion produces a number greater than systime, I want to print that line.
Right now it seems like gawk's mktime() function is the key to this. Unfortunately, it requires input in the following format:
yyyy mm dd hh mm ss
I'm currently searching a test file (called timecomp) for a regular expression matching the date format.
Edit: the test file only contains a date and time on each line, no other text.
I used sed to replace the date separators (i.e. /, -, and :) with a space, and then piped the output to a gawk script called stime using the following statement:
sed -e 's/[-://_]/ /g' timecomp | gawk -f stime
Here is the script
# stime
BEGIN { tsec=systime(); } /.*20[1-9][0-9] [0-1][1-9] [0-3][0-9] [0-2][0-9][0-6][0-9] [0-6][0-9]/ {
if (tsec < mktime($0))
print "\t" $0 # the tab is just to differentiate the desired output from the other lines that are being printed.
} $1
Right now this is getting the basic information that I want, but it is also printing every like that matches the original expression, rather than just the lines containing a time in the future. Sample output:
2016 01 22 13 23 20
2016 01 22 14 56 57
2016 01 22 15 46 46
2016 01 22 16 32 30
2016 01 22 18 56 23
2016 01 22 18 56 23
2016 01 22 22 22 28
2016 01 22 22 22 28
2016 01 22 23 41 06
2016 01 22 23 41 06
2016 01 22 20 32 33
How can I print only the lines in the future?
Note: I'm doing this on a Mac, but I want it to be portable to Linux because I'm ultimately making this for some tasks I have to do at work.
I'd like trying to accomplish this in one script rather than requiring the sed statement to reformat the dates, but I'm running into other issues that probably require a different question, so I'm sticking to this for now.
Any help would be greatly appreciated! Thanks!
Answered: I had a $1 at the last line of my script, and that was the cause of the additional output.
Instead of awk, this is an (almost) pure Bash solution:
#!/bin/bash
# Regex for time string
re='[0-9]{4}-[0-9]{2}-[0-9]{2} ([0-9]{2}:){2}[0-9]{2}'
# Current time, in seconds since epoch
now=$(date +%s)
while IFS= read -r line; do
# Match time string
[[ $line =~ $re ]]
time_string="${BASH_REMATCH[0]}"
# Convert time string to seconds since epoch
time_secs=$(date -d "$time_string" +%s)
# If time is in the future, print line
if (( time_secs > now )); then
echo "$line"
fi
done < <(grep 'pattern' "$1")
This takes advantage of the Coreutils date formatting to convert a date to seconds since epoch for easy comparison of two dates:
$ date
Fri, Jan 22, 2016 11:23:59 PM
$ date +%s
1453523046
And the -d argument to take a string as input:
$ date -d '2016-01-22 10:03:41' +%s
1453475021
The script does the following:
Filter the input file with grep (for lines containing a generic pattern, but could be anything)
Loop over lines containing pattern
Match the line with a regex that matches the date/time string yyyy-mm-dd hh:mm:ss and extract the match
Convert the time string to seconds since epoch
Compare that value to the time in $now, which is the current date/time in seconds since epoch
If the time from the logfile is in the future, print the line
For an example input file like this one
text 2016-01-22 10:03:41 with time in the past
more text 2016-01-22 10:03:41 matching pattern but in the past
other text 2017-01-22 10:03:41 in the future matching pattern
some text 2017-01-23 10:03:41 in the future but not matching
blahblah 2022-02-22 22:22:22 pattern and also in the future
the result is
$ date
Fri, Jan 22, 2016 11:36:54 PM
$ ./future_time logfile
other text 2017-01-22 10:03:41 in the future matching pattern
blahblah 2022-02-22 22:22:22 pattern and also in the future
This is what I have working now. It works for a few different date formats and on the actual files that have more than just the date and time. The default format that it works for is yyyy/mm/dd, but it takes an argument to specify a mm/dd/yyyy format if needed.
BEGIN { tsec=systime(); dtstr=""; dt[1]="" } /.*[0-9][0-9]:[0-9][0-9]:[0-9][0-9]/ {
cur=$0
if ( fm=="mdy" ) {
match($0,/[0-1][1-9][-_\/][0-3][0-9][-_\/]20[1-9][0-9]/) # mm dd yyyy
section=substr($0,RSTART,RLENGTH)
split(section, dt, "[-_//]")
dtstr=dt[3] " " dt[1] " " dt[2]
gsub(/[0-1][1-9][-\/][0-3][0-9][-\/]20[1-9][0-9]/, dtstr, cur)
}
gsub(/[-_:/,]/, " ", cur)
match(cur,/20[1-9][0-9] [0-1][1-9] [0-3][0-9][[:space:] ]*[0-2][0-9] [0-6][0-9] [0-6][0-9]/)
arr=mktime(substr(cur,RSTART,RLENGTH))
if ( tsec < arr)
print $0
}
I'll be adding more format options as I find more formats, but this works for all the different files I've tested so far. If they have a mm/dd/yyyy format, you call it with:
gawk -f stime fm=mdy filename
I plan on adding an option to specify the time window that you want to see, but this is an excellent start. Thank you guys again, this is going to drastically simplify a few tasks at work ( I basically have to retrieve a great deal of data, often under time pressure depending on the situation ).

Collect info from multiple lines

I need to extract certain info from multiple lines (5 lines every transaction) and make the output as csv file. These lines are coming from a maillog wherein every transaction has its own transaction id. Here's one sample transaction:
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
What I tried is, I made these 5 lines into 1 line and used awk to parse each column - unfortunately, the column count is not uniform.
I'm looking into getting the date/time (line 1, columns 1-3), sender, recipient, and subject (line 3, words after "CLEAN -" to the end of line)
Preferably sed or awk in bash.
Thanks!
Explanation: fileis your file.
The script initializes id and block to empty strings. At first run id takes the value of field nr. 7. After that all lines are added to block until a line doesn't match id. At that point block and id are reinitialized.
awk 'BEGIN{id="";block=""} {if (id=="") id=$6; else {if ($0~id) block= block $0; else {print block;block=$0;id=$6}}}' file
Then you're going to have to process each line of the output.
There are many ways to approach this. Here is one example calling a simple script and passing the log filename as the first argument. It will parse the requested data and save the data separated into individual variables. It simply prints the results at the end.
#!/bin/bash
[ -r "$1" ] || { ## validate input file readable
printf "error: invalid argument, file not readable '%s'\n" "$1"
exit 1
}
while read -r line; do
## set date from line containing from/sender
if grep -q -o 'from=<' <<<"$line" &>/dev/null; then
dt=$(cut -c -15 <<<"$line")
from=$(grep -o 'from=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
sender=${from##*<}
sender=${sender%>*}
fi
## search each line for CLEAN
if grep -q -o 'CLEAN.*$' <<<"$line" &>/dev/null; then
subject=$(grep -o 'CLEAN.*$' <<<"$line")
subject="${subject#*CLEAN - }"
fi
## search line for to
if grep -q -o 'to=<' <<<"$line" &>/dev/null; then
to=$(grep -o 'to=<[a-zA-Z0-9]*#[a-zA-Z0-9]*>' <<<"$line")
to=${to##*<}
to=${to%>*}
fi
done < "$1"
printf " date : %s\n from : %s\n to : %s\n subject: \"%s\"\n" \
"$dt" "$sender" "$to" "$subject"
Input
$ cat dat/mail.log
Nov 17 00:15:19 server01 sm-mta[14107]: tAGGFJla014107: from=<sender#domain>, size=2447, class=0, nrcpts=1, msgid=<201511161615.tAGGFJla014107#server01>, proto=ESMTP, daemon=MTA, tls_verify=NONE, auth=NONE, relay=[100.24.134.19]
Nov 17 00:15:19 server01 flow-control[6033]: tAGGFJla014107 accepted
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - virus.McAfee: CLEAN - Declaration for Shared Parental Leave Allocation System
Nov 17 00:15:19 server01 MM: [Jilter Processor 21 - Async Jilter Worker 9 - 127.0.0.1:51698-tAGGFJla014107] INFO user.log - mtaqid=tAGGFJla014107, msgid=<201511161615.tAGGFJla014107#server01>, from=<sender#domain>, size=2488, to=<recipient#domain>, relay=[100.24.134.19], disposition=Deliver
Nov 17 00:15:20 server01 sm-mta[14240]: tAGGFJla014107: to=<recipient#domain>, delay=00:00:01, xdelay=00:00:01, mailer=smtp, pri=122447, relay=relayserver.domain. [100.91.20.1], dsn=2.0.0, stat=Sent (tAGGFJlR021747 Message accepted for delivery)
Output
$ bash parsemail.sh dat/mail.log
date : Nov 17 00:15:19
from : sender#domain
to : recipient#domain
subject: "Declaration for Shared Parental Leave Allocation System"
Note: if your from/sender is not always going to be in the first line, you can simply move those lines out from under the test clause. Let me know if you have any questions.

Bash script assistance with renaming file using existing parts of filename

I'm looking for help with a bash script to do some renaming of files for me. I don't know much about bash scripting, and what I have read is overwhelming. It's a lot to know/understand for the limited applications I will probably have.
In Dropbox, my media files are named something like:
Photo Jul 04, 5 49 44 PM.jpg
Video Jun 22, 11 21 00 AM.mov
I'd like them to be renamed in the following format: 2015-07-04 1749.ext
Some difficulties:
The script has to determine if AM or PM to put in the correct 24-hour format
The year is not specified; it is safe to assume the current year
The date, minute and second have a leading zero, but the hour does not; therefore the position after the hour is not absolute
Any assistance would be appreciated. FWIW, I'm running MacOS.
Mac OSX
This uses awk to reformat the date string:
for f in *.*
do
new=$(echo "$f" | awk -F'[ .]' '
BEGIN {
split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month)
for (i in month) {
nums[month[i]]=i
}
}
$(NF-1)=="PM" {$4+=12;}
{printf "%s 2015-%02i-%02i %02i%02i.%s",$1,nums[$2],$3,$4,$5,$8;}
')
mv "$f" "$new"
done
After the above was run, the files are now named:
$ ls -1 *.*
Photo 2015-07-04 1749.jpg
Video 2015-06-22 1121.mov
The above was tested on GNU awk but I don't believe that I have used any GNU-specific features.
GNU/Linux
GNU date has a handy feature for interpreting human-style date strings:
for f in *.*
do
prefix=${f%% *}
ext=${f##*.}
datestr=$(date -d "$(echo "$f" | sed 's/[^ ]* //; s/[.].*//; s/ /:/3; s/ /:/3; s/,//')" '+%F %H%M')
mv "$f" "$prefix $datestr.$ext"
done
Here is an example of the script in operation:
$ ls -1 *.*
Photo Jul 04, 5 49 44 PM.jpg
Video Jun 22, 11 21 00 AM.mov
$ bash script
$ ls -1 *.*
Photo 2015-07-04 1749.jpg
Video 2015-06-22 1121.mov
While not a simple parse and reformat for date, it isn't that difficult. The bash string tools of parameter expansion/substring removal are all you need to parse the pieces of the date into a format that date can use to output a new date string in the format for use in a filename. (see String Manipulation ) date -d is used to generate a new date string based on the contents of the original filename.
Note: the following presumes the dropbox filenames are in the format you have specified. (it doesn't care what the first part of the name or extension is as long as it matches the format you have specified) Here is an example of properly isolating the pieces of the filename needed to generate a date in the format specified)
Further, all spaces have been removed from the filename. While you originally showed a space between the day and hours, I will not provide an example of poor practice by inserting a space in a filename. As such, the spaces have been replaced with '_' and '-':
#!/bin/bash
# Photo Jul 04, 5 49 44 PM.jpg
# Video Jun 22, 11 21 00 AM.mov
# fn="Photo Jul 04, 5 49 44 PM.jpg"
fn="Video Jun 22, 11 21 00 AM.mov"
ext=${fn##*.} # determine extension
prefix=${fn%% *} # determine prefix (Photo or Video)
datestr=${fn%.${ext}} # remove extension from filename
datestr=${datestr#${prefix} } # remove prefix from datestr
day=${datestr%%,*} # isolate Month and date in day
ampm=${datestr##* } # isloate AM/PM in ampm
datestr=${datestr% ${ampm}} # remove ampm from datestr
timestr=${datestr##*, } # isolate time in timestr
timestr=$(tr ' ' ':' <<<"$timestr") # translate spaces to ':' using herestring
cmb="$day $timestr $hr" # create combined date/proper format
## create date/time string for filename
datetm=$(date -d "$cmb" '+%Y%m%d-%H%M')
newfn="${prefix}_${datetm}.${ext}"
## example moving of file to new name
# (assumes you handle the path correctly)
printf "mv '%s' %s\n" "$fn" "$newfn"
# mv "$fn" "$newfn" # uncomemnt to actually use
exit 0
Example/Output
$ bash dateinfname.sh
mv 'Video Jun 22, 11 21 00 AM.mov' Video_20150622-1121.mov

Merging CSV files : Appending instead of merging

So basically i want to merge a couple of CSV files. Im using the following script to do that :
paste -d , *.csv > final.txt
However this has worked for me in the past but this time it doesn't work. It appends the data next to each other as opposed to below each other. For instance two files that contain records in the following format
CreatedAt ID
Mon Jul 07 20:43:47 +0000 2014 4.86249E+17
Mon Jul 07 19:58:29 +0000 2014 4.86238E+17
Mon Jul 07 19:42:33 +0000 2014 4.86234E+17
When merged give
CreatedAt ID CreatedAt ID
Mon Jul 07 20:43:47 +0000 2014 4.86249E+17 Mon Jul 07 18:25:53 +0000 2014 4.86215E+17
Mon Jul 07 19:58:29 +0000 2014 4.86238E+17 Mon Jul 07 17:19:18 +0000 2014 4.86198E+17
Mon Jul 07 19:42:33 +0000 2014 4.86234E+17 Mon Jul 07 15:45:13 +0000 2014 4.86174E+17
Mon Jul 07 15:34:13 +0000 2014 4.86176E+17
Would anyone know what the reason behind this is? Or what i can do to force merge below records?
Assuming that all the csv files have the same format and all start with the same header,
you can write a little script as the following to append all files in only one and to take only one time the header.
#!/bin/bash
OutFileName="X.csv" # Fix the output name
i=0 # Reset a counter
for filename in ./*.csv; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 "$filename" > "$OutFileName" # Copy header if it is the first file
fi
tail -n +2 "$filename" >> "$OutFileName" # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done
Notes:
The head -1 or head -n 1 command print the first line of a file (the head).
The tail -n +2 prints the tail of a file starting from the lines number 2 (+2)
Test [ ... ] is used to exclude the output file from the input list.
The output file is rewritten each time.
The command cat a.csv b.csv > X.csv can be simply used to append a.csv and b csv in a single file (but you copy 2 times the header).
The paste command pastes the files one on a side of the other. If a file has white spaces as lines you can obtain the output that you reported above.
The use of -d , asks to paste command to define fields separated by a comma ,, but this is not the case for the format of the files you reported above.
The cat command instead concatenates files and prints on the standard output, that means it writes one file after the other.
Refer to man head or man tail for the syntax of the single options (some version allows head -1 other instead head -n 1)...
Alternative simple answer, this as combine_csv.sh:
#!/bin/bash
{ head -n 1 $1 && tail -q -n +2 $*; }
can be used like this:
pattern="my*filenames*.csv"
combine_csv.sh ${pattern} > result.csv
Thank you so much #wahwahwah.
I used your script to make nautilus-action, but it work correctly only with this changes:
#!/bin/bash
for last; do true; done
OutFileName=$last/RESULT_`date +"%d-%m-%Y"`.csv # Fix the output name
i=0 # Reset a counter
for filename in "$last/"*".csv"; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 "$filename" > "$OutFileName" # Copy header if it is the first file
fi
tail -n +2 "$filename" >> "$OutFileName" # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done

Resources