Sequencial log file timestamp check - bash

I've got a large log file and I need to confirm the chronological timestamped entries sequence. I know how to read line 1 and extract the timestamp. I then need to compare it to lines 2 > line last. If there's any that are earlier than line 1 then print the whole line and continue until line last. Then read line 2 and repeat above for lines 3 > line last. Then read line 3 and repeat above for lines 4 > line last. The outside loop reading from the log file is no problem but how do I read the same file again starting at the line n+1 compared to the line number the outside loop is reading please? I.e. If the outside loop has read line 10, how do I get the inside loop to read the file starting at line 11? The log file has 10,000's line and I have several dozen log files to process, so speed is important.
Log file line format is:
Sep 17 16:09:51 2014 blah blah blah…
Sep 17 16:09:52 2014 blah blah blah…
Sep 17 16:09:52 2014 blah blah blah…
Sep 17 15:11:10 2014 blah blah blah…
Sep 17 16:11:10 2014 blah blah blah…
I'm trying to detect entries like line 4.
I can switch to Perl if it's faster.
Should I read the log file into an array to make the internal loop relative read position easy to do, or will the file size make the array size prohibitive?

Related

Hadoop streaming job create huge temp files

I was trying to run hadoop job to do the word shingling, and all my nodes soon get unhealthy state since the storage is used up.
Here is my mapper part:
shingle = 5
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
for i in range(0, len(line)-shingle+1):
print ('%s\t%s' % (line[i:i+shingle], 1))
For my understanding that 'print' will generate temp file on each node which occupy stroage space. If I took a txt file as an example:
cat README.txt |./shingle_mapper.py >> temp.txt
I can see the size of the original and temp file:
-rw-r--r-- 1 root root 1366 Nov 13 02:46 README.txt
-rw-r--r-- 1 root root 9744 Nov 14 01:43 temp.txt
The temp file size is over 7 times of the input file, so I guess this is the reason that each of my node is used up all storage.
My question is do I understand the temp file correctly? If so, is there any better way to reduce the size of temp files (adding additional storage is not an option for me)?

How to print lines extracted from a log file within a specified time range?

I'd like to fetch result, let's say from 2017-12-19 19:14 till the entire day from a log file that looks like this -
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:00.723 Info: Saving /var/opt/MarkLogic/Forests/Meters/00001829
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:01.134 Info: Saved 9 MB at 22 MB/sec to /var/opt/MarkLogic/Forests/Meters/00001829
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:01.376 Info: Merging 19 MB from /var/opt/MarkLogic/Forests/Meters/0000182a and /var/opt/MarkLogic/Forests/Meters/00001829 to /var/opt/MarkLogic/Forests/Meters/0000182c, timestamp=15137318408510140
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:02.585 Info: Merged 18 MB in 1 sec at 15 MB/sec to /var/opt/MarkLogic/Forests/Meters/0000182c
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:05.200 Info: Deleted 15 MB at 337 MB/sec /var/opt/MarkLogic/Forests/Meters/0000182a
/var/opt/MarkLogic/Logs/ErrorLog_1.txt:2017-12-19 19:14:05.202 Info: Deleted 9 MB at 4274 MB/sec /var/opt/MarkLogic/Forests/Meters/00001829
I am new to Unix and familiar with grep command. I tried the below command
date="2017-12-19 [19-23]:[14-59]"
echo "$date"
grep "$date" $root_path_values
but it throws invalid range end error. Any solution ? The date is going to be coming from a variable so it will be unpredictable. Therefore, don't make a command just keeping the example in mind. $root_path_values is a sequence of error files like errorLog.txt, errorLog_1.txt, errorLog_2.txt and so on.
I'd like to fetch result, let's say from 2017-12-19 19:14 till the entire day … The date is going to be coming from a variable …
This is not a job for regular expressions. Since the timestamp has a sensible form, we can simply compare it as a whole, e. g.:
start='2017-12-19 19:14'
end='2017-12-20'
awk -vstart="$start" -vend=$end 'start <= $0 && $0 < end' ErrorLog_1.txt
egrep '2017-12-19 (19|2[0-3])\:(1[4-9]|[2-5][0-9])\:*\.*' path/to/your/file Try this regexp.
In the case if you need pattern in variable:
#!/bin/bash
date="2017-12-19 (19|2[0-3])\:(1[4-9]|[2-5][0-9])\:*\.*"
egrep ${date} path/to/your/file

VIM visual block sort numerically using Vissort

I need help using the Vissort plugin to visual block sort numberically.
The documentation states that I can use the 'VSO n' command to set sort to 'n' for numeric but I'm having no success.
I'm on a windows machine.
The below example only has one column, but in the real world I need to be able to numerically sort on any column within a text file. For now my work around is using '!gsort.exe -k 10 -n' to sort by the 10th column.
After using 'VSO n' and running Vissort, this is how my list is sorted:
1
11
13
15
17
19
2
21
23
25
27
29
3
31
33
35
37
39
You can use GNU sort. First select visual block and then:
:'<,'>!sort -k 1 -n
Apparently, the :VSO option only applies to the :Vissort command, not to :'<,'>B sort.
So, either of these should work:
:VSO n
:'<,'>Vissort
or
:'<,'>B sort n

Hadoop Mapreduce: TextInputFormat: Meaning of position

I am trying to understand the doc which says "The TextInputFormat works as An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text"
What does "position" mean? does it mean the line number in the file?
Given data in a file
dobbs 2007 20 18 15
dobbs 2008 22 20 12
doctor 2007 545525 366136 57313
doctor 2008 668666 446034 72694
Would it produce a map input like this?
(1, "dobbs 2007 20 18 15")
(2, "dobbs 2008 22 20 12")
(3, "doctor 2007 545525 366136 57313")
(4, "doctor 2008 668666 446034 72694")
In TextInputFormat, Keys are the byte offset in the file from the beginning of the file to the line
i.e., for the first line, offset or key will be 0
for the second line the offset or key will be length of first line for the third line offset will be offset of first line + length of first line
No, it will not produce map input as you expects,
(assuming each word is separated by single space) it would rather be something like
(0,dobbs 2007 20 18 15)
(20,dobbs 2008 22 20 12)
(40,doctor 2007 545525 366136 57313)
(71,doctor 2008 668666 446034 72694)

Use cat to combine mp3 files based on filename

I have a large number of downloaded radio programs that consist of 4 mp3 files each. The files are named like so:
Show Name - Nov 28 2011 - Hour 1.mp3
Show Name - Nov 28 2011 - Hour 2.mp3
Show Name - Nov 28 2011 - Hour 3.mp3
Show Name - Nov 28 2011 - Hour 4.mp3
Show Name - Nov 29 2011 - Hour 1.mp3
Show Name - Nov 29 2011 - Hour 2.mp3
Show Name - Nov 29 2011 - Hour 3.mp3
Show Name - Nov 29 2011 - Hour 4.mp3
Show Name - Nov 30 2011 - Hour 1.mp3 and so on...
I have used the cat command to join the files with great success by moving the four files of the same date into a folder and using the wildcard:
cat *.mp3 > example.mp3
The files are all the same bitrate, sampling rate, etc.
What I would like to do is run a script that will look at the file name and combine hours 1-4 of each date and name the file accordingly. Just the show name, the date and drop the 'Hour 1'.
I looked around and found a number of scripts that can be used to move files around based on their names but I'm not adept enough at bash scripting to be able to understand the methods used and adapt them to my needs.
I'm using Ubuntu 14.04.
Many thanks in advance
You can use a bash for loop to find each distinct date name and then construct the expected mp3 names from that.
Because your files have spaces in their names and my solution uses globbing, you'll also have to edit your Internal Field Separator to ignore spaces for the duration of the script.
SAVEIFS=$IFS
IFS=$'\n\b'
for mdy in `/bin/ls *mp3 | cut -d' ' -f'4,5,6' | sort -u`; do
cat *${mdy}*.mp3 > "showName_${mdy}_full.mp3"
done
IFS=$SAVEIFS
This won't alert you if some hours are missing for some particular date. It'll just join together whatever's there for that date.
Note: The comment pointing out that cat probably won't work for these files is spot on. The resulting file will probably be corrupted. You probably want to use something like mencoder or ffmpeg instead. (Check out this thread.)

Resources