hadoop multiline mixed records - hadoop

I would like to parse logfiles produced by fidonet mailer binkd, which are multi-line and much worse - mixed: several instances can write into one logfile, for example:
27 Dec 16:52:40 [2484] BEGIN, binkd/1.0a-545/Linux -iq /tmp/binkd.conf
+ 27 Dec 16:52:40 [2484] session with 123.45.78.9 (123.45.78.9)
- 27 Dec 16:52:41 [2484] SYS BBSName
- 27 Dec 16:52:41 [2484] ZYZ First LastName
- 27 Dec 16:52:41 [2484] LOC City, Country
- 27 Dec 16:52:41 [2484] NDL 115200,TCP,BINKP
- 27 Dec 16:52:41 [2484] TIME Thu, 27 Dec 2012 21:53:22 +0600
- 27 Dec 16:52:41 [2484] VER binkd/0.9.6a-173/Win32 binkp/1.1
+ 27 Dec 16:52:43 [2484] addr: 2:1234/56.78#fidonet
- 27 Dec 16:52:43 [2484] OPT NDA CRYPT
+ 27 Dec 16:52:43 [2484] Remote supports asymmetric ND mode
+ 27 Dec 16:52:43 [2484] Remote requests CRYPT mode
- 27 Dec 16:52:43 [2484] TRF 0 0
*+ 27 Dec 16:52:43 [1520] done (from 2:456/78#fidonet, OK, S/R: 0/0 (0/0 bytes))*
+ 27 Dec 16:52:43 [2484] Remote has 0b of mail and 0b of files for us
+ 27 Dec 16:52:43 [2484] pwd protected session (MD5)
- 27 Dec 16:52:43 [2484] session in CRYPT mode
+ 27 Dec 16:52:43 [2484] done (from 2:1234/56.78#fidonet, OK, S/R: 0/0 (0/0 bytes))
So the logfile is not only multi-line with unpredictable number of lines per session, but also several records can be mixed in between, like session 1520 has finished in the middle of session 2484.
What would be the right direction in hadoop to parse such a file? Or shall I just parse line-by-line and then merge them somehow into a record later and write those records into a SQL database using another set of jobs later on?
Thanks.

Right direction for Hadoop will be to develop your own input format who's record reader will
read input line by line and produce logical records.
Can be stated - that you actually can do it in mapper also - it might be a bit simpler. Drawback will be that it is not standard packaging of such code for hadoop and thus it is less reusable.
Other direction you mentioned is not "natural" for hadoop in my view. Specifically - why to use all complicated (and expensive) machinery of shuffling to join together several lines which are already in hands.

First of all, parsing the file is not what you are trying to do; you are trying to extract some information from your data.
In your case you can consider multi-step MR job where first MR job will essentially (partially) sort your input by session_id (do some filtering? Some aggregation? Multiple reducers?) and then reducer or next MR job will do actual calculation.
Without explanation of what you are trying to extract from your log files it is hard to give more definitive answer.
Also if your data is small, maybe you can process it without MR machinery at all?

Related

How to perform date operations in bash

Basically, I want to take a time, and day of the week, in UTC+8, and adjust the datetime object to a given UTC offset, within bash, I don't have any code to show because I'm not sure how to start attempting this in the first place honestly
(I'm writing a custom script for a friend who lives in UTC+8 and want to make the input as easy as possible for them, basically they just give it a time in their timezone, and a day of the week, and it'll tell them what date and time that'll be in a different timezone, for an overarching purpose)
For a reference, look at the section 1 of the manual page for "date":
In your shell, just type: man 1 date
or see the online man page:
https://man7.org/linux/man-pages/man1/date.1.html
One way is to parse the date into the number of seconds since epoch (since 1970), and then convert that number of seconds into the format you want:
For example:
$ date +%s --date='2022-12-27 11:30:17 +008'
1672140137
$ date +%c --date='#1672140137'
Tue 27 Dec 2022 06:22:17 AM EST
or you could also convert to ISO format then back to local time
$ date -Iseconds --date='TZ="GMT" 2022-12-22 11:33:44 +08'
2022-12-21T22:33:44-05:00
$ date --date='2022-12-21T22:33:44-05:00'
Wed 21 Dec 2022 10:33:44 PM EST
I hope this helps you get started with some ideas for converting to/from different timezones.
Also, to help with user input, you can show the current month calendar using cal
$ cal
December 2022
Su Mo Tu We Th Fr Sa
1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
$ date --date='Fri 08:30'
Fri 30 Dec 2022 08:30:00 AM EST
In the above example, I specified "Fri 08:30" which gets set to the next Friday at 08:30 in the morning for my local timezone.

Printing end-of-line field in awk causing formatting issues

I have a log file that contains output from
/bin/df -h| /usr/bin/grep p_log|/usr/bin/awk -v date="$(date)" '{print date,$4,$5}'
which is later sent out using mailx. It arrives in my PC's Outlook as desired with a line per entry and displays with a line end in cat -A:
Wed Mar 16 10:29:01 EDT 2022 291G 95%$
Wed Mar 16 11:29:01 EDT 2022 290G 95%$
Wed Mar 16 12:29:02 EDT 2022 290G 95%$
Adding an additional field to the awk - $6 happens to be the last field in the df output - still displays the same with cat:
Wed Mar 16 11:29:01 EDT 2022 290G 95%$
Wed Mar 16 12:29:02 EDT 2022 290G 95%$
Wed Mar 16 13:29:01 EDT 2022 290G 95% /.p_log$
Wed Mar 16 14:29:02 EDT 2022 290G 95% /.p_log$
But lines are now concatenated when read in Windows/Outlook:
Wed Mar 16 10:29:01 EDT 2022 291G 95%
Wed Mar 16 11:29:01 EDT 2022 290G 95%
Wed Mar 16 12:29:02 EDT 2022 290G 95%
Wed Mar 16 13:29:01 EDT 2022 290G 95% /.p_log Wed Mar 16 14:29:02 EDT 2022 290G 95% /.p_log
I found another post at explains that cat -e (which I have tried, and is encompassed by -A) "displays Unix line endings (\n or LF) as $ and Windows line endings (\r\n or CRLF) as ^M$". Why then are two lines that display the same control characters in cat being displayed differently in Windows/how best to get the line feed back when printing $6 without messing up the formatting of the log? I presume there are more hidden control characters that cat -A does not display, i.e. that 'all' does not actually mean all.
Further testing: There are header and footer lines - all ending in the same "$"- that do not get concatenated. I tried attaching the content from the end of one of the concatenated lines to a header line and that would indicate that it's the "/" that's causing the problem, but only for mailx.
Looks like I've been barking up the wrong tree; not sure if I should delete this question and open a new one for mailx?
Adding two spaces to the start of each line - https://stackoverflow.com/a/22098987/8823709 - resolved the issue. I don't know why, but it did.

Command to sort 3 letters month in AIX

Have a file containing values :
Sep 17 11:07 2016
Jan 03 20:33 2018
Apr 14 11:53 2015
Dec 28 07:28 2017
Aug 10 11:55 2011
Dec 25 17:53 2017
Have sort -M option in other flavors of UNIX OS. But no luck in AIX.
Please help in sorting 3 letter month. Thanks in advance.!

What does [143x40] mean in the output of tmux list-sessions?

I have 4 tmux sessions present. When I use
tmux list-sessions
It shows the sessions with some numbers in the brackets. That is:
t128_1: 1 windows (created Thu Jul 19 12:20:44 2018) [71x38]
t128_2: 1 windows (created Thu Jul 19 12:20:54 2018) [71x38]
t3: 1 windows (created Thu Jul 19 12:19:59 2018) [143x40]
t6: 1 windows (created Thu Jul 19 12:20:27 2018) [71x38]
What does the number [AxB] mean? And why t3 session has a different value than the others? Thanks for any explanation.
That's the size of the terminal (143 columns, 40 rows) the last time a client attached to the session.

How do i print the date and time of specific lines that contain a certain key word using Ruby?

In other words i need Ruby to read from the attached log file and report the dates and time that has the keyword MAY_DAY. I am able to print out all the information but I don't have the slightest idea on how to print out the specific entries.
I am an uber noob and find ruby extremely difficult to understand. I appreciate all help and respectful criticism. Thanks
test.txt
Oct 15 12:54:01 WHERE IS THE LOVIN MAY_DAY
Oct 16 23:15:44 WHAT THE HECK CAN I DO ABOUT IT HUMP_DAY
Oct 16 14:16:09 I LOVE MY BABY GIRL MAY_DAY
Oct 16 08:25:18 CAN WAIT UNTIL MY BABY RECOVERS CRYSTAL_WIFE
Oct 18 17:48:38 I HOPE HE STOP MESSING WITH THESE FOOLISH CHILDREN TONY_SMITH
Oct 19 05:17:58 GAME TIME GO HEAD AND GET ME MAY_DAY
Oct 20 10:23:33 GAMESTOP IS WHERE ITS AT GAME_DAY
Oct 21 03:54:27 WHAT IS GOING ON WITH MY LUNCH HUNGRY_MAN
RestartMonitor.rb
class RestartMonitor
counter = 1
begin
file = File.new("test.txt", "r")
while (line = file.gets)
puts "#{counter}: #{line}"
counter = counter + 1
end
end
When i run the file i get the following results:
Oct 15 12:54:01 WHERE IS THE LOVIN MAY_DAY
Oct 16 23:15:44 WHAT THE HECK CAN I DO ABOUT IT HUMP_DAY
Oct 16 14:16:09 I LOVE MY BABY GIRL MAY_DAY
Oct 16 08:25:18 CAN WAIT UNTIL MY BABY RECOVERS CRYSTAL_WIFE
Oct 18 17:48:38 I HOPE HE STOP MESSING WITH THESE FOOLISH CHILDREN TONY_SMITH
Oct 19 05:17:58 GAME TIME GO HEAD AND GET ME MAY_DAY
Oct 20 10:23:33 GAMESTOP IS WHERE ITS AT GAME_DAY
Oct 21 03:54:27 WHAT IS GOING ON WITH MY LUNCH HUNGRY_MAN
When i run the code i would like it to only display the date and times that have the keyworld MAY_DAY. so the output should be:
Oct 15 12:54:01
Oct 16 14:16:09
Oct 19 05:17:58
One way to do it would be like so (within a block that's iterating over the lines in the file, obviously):
if line.include?('MAY_DAY')
puts line[0..14]
end
Since the date information (which is what you want output) appears in the same position and is the same length in every line, we don't bother doing any parsing of the text for the output - just spit out the first 15 characters.
I'm tempted to try to compress all of this into a single regular expression, but this ought to work. Obviously, you could do something other than print out the date within the conditional, and if you wanted to work with it as a date, you could pass it to DateTime.parse() (just remember to require 'date' first).

Resources