Parsing entry name from a log - bash

Writing bash parsing scripts is my own personal nightmare, so here I am.
The server log format is below:
197 INFO Thu Mar 27 10:10:32 2014
seq_1_1..JobControl (DSWaitForJob): Waiting for job job_1_1_1 to finish
198 INFO Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (DSWaitForJob): Job job_1_1_1 has finished, status = 3 (Aborted)
199 WARNING Thu Mar 27 10:10:36 2014
seq_1_1..JobControl (#job_1_1_1): Job job_1_1_1 did not finish OK, status = 'Aborted'
From here I need to parse out the string which follows the format:
Job job_name has finished, status = 3 (Aborted)
So from the output above I should get: job_1_1_1
What would the script for that look like if I get this server log as a certain command output?
Thanks xx

Using grep -P:
grep -oP '\w+(?= has finished, status = 3)' file
job_1_1_1

Related

Printing end-of-line field in awk causing formatting issues

I have a log file that contains output from
/bin/df -h| /usr/bin/grep p_log|/usr/bin/awk -v date="$(date)" '{print date,$4,$5}'
which is later sent out using mailx. It arrives in my PC's Outlook as desired with a line per entry and displays with a line end in cat -A:
Wed Mar 16 10:29:01 EDT 2022 291G 95%$
Wed Mar 16 11:29:01 EDT 2022 290G 95%$
Wed Mar 16 12:29:02 EDT 2022 290G 95%$
Adding an additional field to the awk - $6 happens to be the last field in the df output - still displays the same with cat:
Wed Mar 16 11:29:01 EDT 2022 290G 95%$
Wed Mar 16 12:29:02 EDT 2022 290G 95%$
Wed Mar 16 13:29:01 EDT 2022 290G 95% /.p_log$
Wed Mar 16 14:29:02 EDT 2022 290G 95% /.p_log$
But lines are now concatenated when read in Windows/Outlook:
Wed Mar 16 10:29:01 EDT 2022 291G 95%
Wed Mar 16 11:29:01 EDT 2022 290G 95%
Wed Mar 16 12:29:02 EDT 2022 290G 95%
Wed Mar 16 13:29:01 EDT 2022 290G 95% /.p_log Wed Mar 16 14:29:02 EDT 2022 290G 95% /.p_log
I found another post at explains that cat -e (which I have tried, and is encompassed by -A) "displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$". Why then are two lines that display the same control characters in cat being displayed differently in Windows/how best to get the line feed back when printing $6 without messing up the formatting of the log? I presume there are more hidden control characters that cat -A does not display, i.e. that 'all' does not actually mean all.
Further testing: There are header and footer lines - all ending in the same "$"- that do not get concatenated. I tried attaching the content from the end of one of the concatenated lines to a header line and that would indicate that it's the "/" that's causing the problem, but only for mailx.
Looks like I've been barking up the wrong tree; not sure if I should delete this question and open a new one for mailx?
Adding two spaces to the start of each line - https://stackoverflow.com/a/22098987/8823709 - resolved the issue. I don't know why, but it did.

Shell Script to prepare unformatted data

I have text file TEST.txt which has below data which is unformated:
0411 14:30:00 INF[baag.reporting.main.Logss.ExecuteLogsRunnable] Executing cron report Freigabe 14:30 for cron job Freigabe 14:30 for TRE_ClientServiceGroup#TEST.fs, Businesspartner#TEST.fs
0411 14:30:02 INF[baag.reporting.main.Logss.ExecuteLogsRunnable] Freigaben had no results
0411 14:30:02 INF[baag.reporting.main.Logss.ExecuteLogsRunnable] Freigabe 14:30 NOT sent to TRE_ClientServiceGroup#TEST.fs, Businesspartner#TEST.fs since all reports were empty and empty reports should not be send
0411 17:03:14 INF[baag.reporting.db.DataSourceMapFactory] Datasource [itraderdbint] has been added to datasource map
0411 17:03:14 INF[baag.reporting.db.DataSourceMapFactory] Datasource [otc_sv2599] has been added to datasource map
0411 17:03:14 INF[baag.reporting.db.DataSourceMapFactory] Datasource [qlp_devp] has been added to datasource map
0411 17:03:15 INF[baag.reporting.main.Logss.QuarzLogsManager] Added Trigger for QUARTZ that fires next on Tue Apr 13 08:00:00 CEST 2021 for Logs Compliance MAR Crossingprüfung/Frontrunning DI-FR
0411 17:03:15 INF[baag.reporting.main.Logss.QuarzLogsManager] Added Trigger for QUARTZ that fires next on Tue Apr 13 08:20:00 CEST 2021 for Logs Compliance OR Umsatzstatistik DI-FR
0411 17:03:15 INF[baag.reporting.main.Logss.QuarzLogsManager] Added Trigger for QUARTZ that fires next on Mon Apr 12 08:20:00 CEST 2021 for Logs Compliance OR Umsatzstatistik MO
Now i want to create Shell script which will prepare this unformated data into below format and create for example PrepardFile.txt. I want to separate every string with pipe operator. The first part is date format so i want this as complete string. The second part always start with INF[ and ends with ] or we can take the complete part without spaces starting from INF[ and this would be my second string separated as pipe operator. The third part will be the remaining part which would be my third string. I want to add header for better understanding of what does this field value indicate:
DATE_FORMAT|ROW_EXECUTE|ROW_VALUE
0411 14:30:00|INF[baag.reporting.main.Logss.ExecuteLogsRunnable]|Executing cron report Freigabe 14:30 for cron job Freigabe 14:30 for TRE_ClientServiceGroup#TEST.fs, Businesspartner#TEST.fs
0411 14:30:02|INF[baag.reporting.main.Logss.ExecuteLogsRunnable]|Freigaben had no results
0411 14:30:02|INF[baag.reporting.main.Logss.ExecuteLogsRunnable]|Freigabe 14:30 NOT sent to TRE_ClientServiceGroup#TEST.fs, Businesspartner#TEST.fs since all reports were empty and empty reports should not be send
0411 17:03:14|INF[baag.reporting.db.DataSourceMapFactory]|Datasource [itraderdbint] has been added to datasource map
0411 17:03:14|INF[baag.reporting.db.DataSourceMapFactory]|Datasource [otc_sv2599] has been added to datasource map
0411 17:03:14|INF[baag.reporting.db.DataSourceMapFactory]|Datasource [qlp_devp] has been added to datasource map
0411 17:03:15|INF[baag.reporting.main.Logss.QuarzLogsManager]|Added Trigger for QUARTZ that fires next on Tue Apr 13 08:00:00 CEST 2021 for Logs Compliance MAR Crossingprüfung/Frontrunning DI-FR
0411 17:03:15|INF[baag.reporting.main.Logss.QuarzLogsManager]|Added Trigger for QUARTZ that fires next on Tue Apr 13 08:20:00 CEST 2021 for Logs Compliance OR Umsatzstatistik DI-FR
0411 17:03:15|INF[baag.reporting.main.Logss.QuarzLogsManager]|Added Trigger for QUARTZ that fires next on Mon Apr 12 08:20:00 CEST 2021 for Logs Compliance OR Umsatzstatistik MO
I am very new to Shell script and dont know if this possbile to do with the help of shell script.
#Symonds
This response is regarding your comment asking for adding a header section and further explanation.
To add header section, you can use echo and create the PreparedFile.txt first. Then use >> operator to append to the file. You can copy the complete code to a file named Script.sh and then run it using bash Script.sh
#!/bin/bash
echo "DATE_FORMAT|ROW_EXECUTE|ROW_VALUE" > PreparedFile.txt
cat TEST.txt | sed 's/ /|/2' | sed 's/] /]|/1' >> PreparedFile.txt
As far as the explanation you have asked for, you can chain commands using the pipe symbol |. The sed command allows you to substitute occurrences of regular expressions you specify with a replacement. In my first pipeline following cat command, I use s/ /|/2. This means replace the second occurence of blank space with |. You can read more about the sed command usage here.
You can use the below Shell script and see if it helps. It uses sed command and combination of pipes to replace second occurrence of space first and then the closing square bracket.
cat TEST.txt | sed 's/ /|/2' | sed 's/] /]|/1' > PreparedFile.txt

bash variable doubles in value - why?

I have a simple shell script set up to capture images every X seconds. For some reason the value of X seems to double each time through the loop.
#!/bin/bash
# basic setup for time-lapse
SECONDS=1
while true
do
DATE=$(date +"%Y-%m-%d_%H%M%S")
filename=${DATE}_img.jpg
# fswebcam -r 1280x720 --timestamp "%a %Y-%b-%d %H:%M (%Z)" /home/pi/JPGS/$filename
date
echo "pausing for ${SECONDS} seconds"
sleep $SECONDS
date
echo "====="
done
This is the output I get. The value of SECONDS is not manipulated inside the loop, so I'm confused with what is happening here. Also, the original interval was 30 seconds, I changed it to 1 seconds for testing purposes, and the date calls are for testing/debugging too.
Sun Mar 3 17:51:57 CST 2019
pausing for 1 seconds
Sun Mar 3 17:51:58 CST 2019
=====
Sun Mar 3 17:51:58 CST 2019
pausing for 2 seconds
Sun Mar 3 17:52:00 CST 2019
=====
Sun Mar 3 17:52:00 CST 2019
pausing for 4 seconds
Sun Mar 3 17:52:04 CST 2019
=====
Sun Mar 3 17:52:04 CST 2019
pausing for 8 seconds
Sun Mar 3 17:52:12 CST 2019
=====
Sun Mar 3 17:52:12 CST 2019
pausing for 16 seconds
Sun Mar 3 17:52:28 CST 2019
=====
Sun Mar 3 17:52:28 CST 2019
pausing for 32 seconds
Sun Mar 3 17:53:00 CST 2019
=====
Sun Mar 3 17:53:00 CST 2019
pausing for 64 seconds
Sun Mar 3 17:54:04 CST 2019
=====
Sun Mar 3 17:54:04 CST 2019
pausing for 128 seconds
What am I missing here?
This is under a Raspberry Pi
Pick a different name for $SECONDS.
$SECONDS is a built-in shell variable. It expands to the number of seconds since the shell was started.
From the Bash manual:
'SECONDS'
This variable expands to the number of seconds since the shell was
started. Assignment to this variable resets the count to the value
assigned, and the expanded value becomes the value assigned plus the
number of seconds since the assignment.
$SECONDS is actually a special Bash Variable for timing the number of seconds a script has been running. Because it's a timer, it increments automatically every second without the script doing anything. Just change the variable name to something else and you should be fine.

SHELL SCRIPT: Save egrep results into a Variable

Hi I am trying to Save my egrep results into a variable and do a foreach.
However, i keep getting the following error despite with the following type of codes
#!/bin/sh
RESULT1=$(egrep 'Begin|End' $SYNCLOG)
RESULT2=egrep 'Begin|End' $SYNCLOG
RESULT3="egrep 'Begin|End' $SYNCLOG"
Errror
./test.sh: syntax error at line 24: `RESULT=$' unexpected
I am trying to get my egrep results to be saved into the variable.
The egrep will return the following results
File 2:Begin - Date :Fri Jan 10 22:44:47 SGT 2014
File 2:End - Date :Fri Jan 10 22:47:06 SGT 2014
File 3:Begin - Date : Tue Jan 11 22:32:54 SGT 2014
File 3:End - Date : Tue Jan 11 22:34:43 SGT 2014
File 4:Begin - Date : Wed Jan 12 22:46:15 SGT 2014
File 4:End - Date : Wed Jan 12 22:48:23 SGT 2014
File 5:Begin - Date : Thu Jan 13 22:30:31 SGT 2014
File 5:End - Date : Thu Jan 13 22:32:51 SGT 2014
Problem is this shebang of sh:
#!/bin/sh
And use of $(...), which is a BASH syntax.
To fix, you can use this shebang to use bash instead:
#!/bin/bash
Or else use this command substitution syntax in /bin/sh:
RESULT1=`egrep 'Begin|End' $SYNCLOG`
it seems you have backticks somewhere on line 24. Paste your whole script. Above shell script excerpt i.e.
RESULT1=$(egrep 'Begin|End' $SYNCLOG)
Should work.

hadoop multiline mixed records

I would like to parse logfiles produced by fidonet mailer binkd, which are multi-line and much worse - mixed: several instances can write into one logfile, for example:
27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
+ 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
- 27 Dec 16:52:41 [2484] SYS BBSName
- 27 Dec 16:52:41 [2484] ZYZ First LastName
- 27 Dec 16:52:41 [2484] LOC City, Country
- 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
- 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
- 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
+ 27 Dec 16:52:43 [2484] addr: 2:1234/56.78#fidonet
- 27 Dec 16:52:43 [2484] OPT NDA CRYPT
+ 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
+ 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
- 27 Dec 16:52:43 [2484] TRF 0 0
*+ 27 Dec 16:52:43 [1520] done (from 2:456/78#fidonet, OK, S/R: 0/0 (0/0 bytes))*
+ 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
+ 27 Dec 16:52:43 [2484] pwd protected session (MD5)
- 27 Dec 16:52:43 [2484] session in CRYPT mode
+ 27 Dec 16:52:43 [2484] done (from 2:1234/56.78#fidonet, OK, S/R: 0/0 (0/0 bytes))
So the logfile is not only multi-line with unpredictable number of lines per session, but also several records can be mixed in between, like session 1520 has finished in the middle of session 2484.
What would be the right direction in hadoop to parse such a file? Or shall I just parse line-by-line and then merge them somehow into a record later and write those records into a SQL database using another set of jobs later on?
Thanks.
Right direction for Hadoop will be to develop your own input format who's record reader will
read input line by line and produce logical records.
Can be stated - that you actually can do it in mapper also - it might be a bit simpler. Drawback will be that it is not standard packaging of such code for hadoop and thus it is less reusable.
Other direction you mentioned is not "natural" for hadoop in my view. Specifically - why to use all complicated (and expensive) machinery of shuffling to join together several lines which are already in hands.
First of all, parsing the file is not what you are trying to do; you are trying to extract some information from your data.
In your case you can consider multi-step MR job where first MR job will essentially (partially) sort your input by session_id (do some filtering? Some aggregation? Multiple reducers?) and then reducer or next MR job will do actual calculation.
Without explanation of what you are trying to extract from your log files it is hard to give more definitive answer.
Also if your data is small, maybe you can process it without MR machinery at all?

Resources