I need to extract some information from a log file using a shell script (bash). A line from the log file usually looks like this:
2009-10-02 15:41:13,796| some information
Occasionally, such a line is followed by a few more lines giving details about the event. These additional lines do not have a specific format (in particular they don't start with a timestamp).
I know how to use grep to filter the file based on keywords and expressions. Basically what I'm having trouble with is that sometimes I need to look at specific intervals only. For example I need to look only at the events which happened during the last X minutes. I'm not experienced with shell scripting, but due to the complexity of the time format, this seems to be a rather difficult task for me. On the other hand, I can imagine that this is something not too unusual, so I'm wondering if there are some tools which can make this easier for me or if you can give me some hints on how to tackle this problem?
gawk -F"[-: ]" 'BEGIN{
fivemin = 60 * 60 * 5 #last 5 min
now=systime()
difference=now - fivemin
}
/^20/{
yr=$1
mth=$2
day=$3
hr=$4
min=$5
sec=$5
t1=mktime(yr" "mth" "day" "hr" "min" "sec)
if ( t1 >= difference) {
print
}
}' file
Basically what I'm having trouble with is that sometimes I need to look
at specific intervals only.
You could use date to convert the date signature for you with the %s parameter:
%s seconds since 1970-01-01 00:00:00 UTC
With it we can make a small demonstration:
#!/bin/bash
timespan_seconds=300 # 5 minutes
time_specified=$(date +"%s" -d "2010-08-25 14:54:40")
let time_now=$(date +"%s")
let time_diff=($time_now - $timespan_seconds)
if [ $time_specified -ge $time_diff ]; then
echo "Time is within range"
fi
Note that this doesn't address future time.
You might want to take a look at my Python program which extracts data from log files based on a range of times. The specification of dates is not yet implemented (it is designed to look at roughly the most recent 24 hours). The time format that it expects (e.g. Jan 14 04:10:13) looks a little different than what you want, but that could be adapted. I haven't tested it with non-timestamped lines, but it should print everything within the specified range of times.
This will give you some usage information:
timegrep.py --help
Related
I want to be able to separate data by weeks, and the week is stated in a specific field on every line and would like to know how to use grep, cut, or anything else that's relevant JUST on that field the week is specified in while still being able to save the rest of the data that's being given to me. I need to be able to pipe the information into it via | because that's how the rest of my program needs it to be.
as the output gets processed, it should look something like this
asset.14548.extension 0
asset.40795.extension 0
asset.98745.extension 1
I want to be able to sort those names by their week number while still being able to keep the asset name in my output because the number of times that asset shows up is counted up, but my problem is I can't make my program smart enough to take just the "1" from the week number but smart enough to ignore the "1" located in the asset name.
UPDATE
The closest answer I found was
grep "^.........................$week" ;
That's good, but it relies on every string being the same length. Is there a way I can have it start from the right instead of the left? Because if so then that'd answer my question.
^ tells grep to start checking from the left and . tells grep to ignore whatever's in that space
I found what I was looking for in some documentation. Anchor matches!
grep "$week$" file
would output this if $week was 0
asset.14548.extension 0
asset.40795.extension 0
I couldn't find my exact question or a closely similar question with a simple answer, so hopefully it helps the next person scratching their head on this.
I have a bash script that I use to call a java class and I pass two arguments to this java class. The first argument ($1) is the string that I pass and it contains someone's name. The second argument ($2) is the previous month as a two digit number (also passed in by the user).
So the java class is called like this:
java -DCONFIG_DIR=... com.example.myapp.grades.gradingProcess $1 $2
However, now, I don't want the user to pass in the second argument and instead, I want the script to determine the month.
Can I do something like this?
month=`date +'%m' -d 'last month'`
java -DCONFIG_DIR=... com.example.myapp.grades.gradingProcess $1 $month
And when I run my script, it'll be something like this: ./myscript.sh 'John'
and not pass in a two-digit month since I'm already doing it inside the script?
Or is that not the correct way to go about it?
Sorry if this seems like an elementary question, I'm still trying to get used to bash scripts.
Thank you?
If you are looking for how to supply a default value in the shell, there is an operator for that.
month=${2-$(date -d 'last month' +%m)}
java -stuff "$1" "$month"
Now, if there was a value in $2, month will be set to that; otherwise, the default will be used. The notation ${variable-value} supplies the value of variable or, if it is unset, the text value. (There is also ${variable:-value} which produces value if variable is set but empty as well.)
(This could be inlined into the java command line, even, though using a variable to break it up is probably better for legibility.
java -stuff "$1" "${2-$(date -d 'last month' +%m)}"
Notice also how you basically always put user-supplied variables in double quotes.)
A few thoughts: time zone, leading zero, and leveraging Java rather than bash.
Time zone
Determining the current month requires a current date. Determining the current date requires a time zone.
For any given moment the date varies around the globe. For example, a few minutes after midnight in Paris is a new day while still “yesterday” in Montréal.
When you do not specify a time zone, a current default time zone is implicitly applied. Be aware that the default can change at any moment. And depending on the default means your code is relying on an externality outside your direct control. And it means results will vary.
So around the ending/beginning of the month you could be getting the wrong month number determining on the time zone in play.
Leading zero can mean octal in Java
The %m you are using produces two digits. Single digit month numbers will have a leading zero. Ex: 09
Be aware that in some situations in Java a number with a leading zero is interpreted as a octal number (base 8) rather than a decimal number (base 10).
Let Java do the work
I suggest it makes more sense to let the Java class do the work of determining the previous month. The bash line should be passing the desired/expected time zone rather than the month, if anything.
ZoneId zoneId = ZoneId.of( "America/Montreal" ) ;
int previousMonthNumber = LocalDate.now( zoneId ).minusMonths( 1 ).getMonthValue() ;
Tip: In Java, even better to use an object from the Month enum rather than a mere integer. Makes your code more self-documenting, type-safe, and guarantees valid values.
Month month = LocalDate.now( zoneId ).minusMonths( 1 ).getMonth() ;
I am trying to find a solution (Korn Shell Script) to my problem of splitting a long line of text into a multi-line paragraph. The script will run on AIX 5.3
The text will be a maximum of 255 Characters long and is read from a Oracle table column field of VARCHAR2 Type.
I would like to split it into 10 lines of minimum 20 and maximum 30 Characters per line and at the same time ensuring that the words don't get split between 2 lines.
I have tried and so far, I have achieved the ability to split within the SQL Query itself by using multiple SUBSTR calls but that does not solve my problem of not having the same word split across two lines and hence hoping to see if this can be solved within the Shell Script.
So for example, if the input variable is
Life is not about searching for the things that could be found. It's about letting the unexpected happen. And finding things you never searched for. Sometimes, the best way to move ahead in life is to admit that you've enough.
Output should be
Life is not about searching for
the things that could be found.
It's about letting the unexpected
happen. And finding things you
never searched for. Sometimes, the
best way to move ahead in life is
to admit that you've enough.
Appreciate if someone could guide me. Can this be achieved using sed or awk? Or something else.
How about this?
echo "Life is not about searching for the things that could be \
found. It's about letting the unexpected happen. And finding things \
you never searched for. Sometimes, the best way to move ahead in life \
is to admit that you've enough" |
fmt -w 30
Result:
Life is not about searching
for the things that could be
found. It's about letting
the unexpected happen.
And finding things you never
searched for. Sometimes,
the best way to move ahead
in life is to admit that
you've enough
One way using awk:
awk '{for(i=1;i<=NF;i++){printf("%s%s",$i,i%6?" ":"\n")}}'
Test:
$ echo "$line" | awk '{for(i=1;i<=NF;i++){printf("%s%s",$i,i%6?" ":"\n")}}'
Life is not about searching for
the things that could be found.
It's about letting the unexpected happen.
And finding things you never searched
for. Sometimes, the best way to
move ahead in life is to
admit that you've enough.
Don't you guys know about "man" ?
man fmt
gives a page, the top has
/usr/bin/fmt [ -Width ] [ File ... ]
thus:
fmt -20 < /etc/motd
*******************************************************************************
*
*
*
* * Welcome to AIX
Version
6.1!
*
*
*
*
* * Please see the
README file in
/usr/lpp/bos for
information
pertinent to *
* this release of
the AIX Operating
System.
*
*
*
*
*
*******************************************************************************
I am currently working on a script that processes and combines several different files, and for the one part, it is necessary that I find the difference between two different times in order to determine a "total" amount of time that someone has worked. the times themselves are in the following format
34:18:00,40:26:00,06:08:00
with the first one being start time, second end time, third total time. Although this one is displayed correctly, there are some entries that need to be double checked and corrected (the total time is not correct based on the start/end time). I have found several different solutions in other posts but most of them also include dates and such too (most of them using awk), I am not experienced with awk so am not sure how to go about removing the date portion from those examples. I have also heard that I could convert the times to unix epoch time, but I was just curious if there were any other ways to accomplish this, thanks!
Something like this might help you:
#!/bin/bash
time2seconds() {
a=( ${1//:/ } )
echo $((${a[0]}*3600+${a[1]}*60+${a[2]}))
}
seconds2time() {
printf "%.2d:%.2d:%.2d" $(($1/3600)) $((($1/60)%60)) $(($1%60))
}
IFS=, read start stop difference <<< "34:18:00,40:26:00,06:08:00"
printf "Start=%s\n" "$start"
printf "Stop=%s\n" "$stop"
printf "Difference=%s (given in file: %s)\n" $(seconds2time $(($(time2seconds $stop)-$(time2seconds $start)))) "$difference"
Output is:
Start=34:18:00
Stop=40:26:00
Difference=06:08:00 (given in file: 06:08:00)
Note: there's nothing that checks if the times are in a valid format, I don't know how reliable your data are.
In a machine with AIX without PERL I need to filter records that will be considered duplicated if they have the same id and if they were registered between a period of four hours.
I implemented this filter using AWK and work pretty well but I need a solution much faster:
# Generar lista de Duplicados
awk 'BEGIN {
FS=","
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8]
Any suggestions? Are there ways to improve the environment (preloading the file or someting like that)?
The input file is already sorted.
With the corrections suggested by jj33 I made a new version with better treatment of dates, still maintaining a low profile for incorporating more operations:
awk 'BEGIN {
FS=",";
SECSPERMINUTE=60;
SECSPERHOUR=3600;
SECSPERDAY=86400;
split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " ");
split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " ");
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8] 2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) {
d2m = d2m + 1;
}
d2y = DAYSTOYEAR[ y - 1999 ];
return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY);
}
'
This sounds like a job for an actual database. Even something like SQLite could probably help you reasonably well here. The big problem I see is your definition of "within 4 hours". That's a sliding window problem, which means you can't simply quantize all the data to 4 hour segments... you have to compute all "nearby" elements for every other element separately. Ugh.
If your data file contains all your records (i.e. it includes records that do not have dupicate ids within the file) you could pre-process it and produce a file that only contains records that have duplicate (ids).
If this is the case that would reduce the size of file you need to process with your AWK program.
How is the input file sorted? Like, cat file|sort, or sorted via a single specific field, or multiple fields? If multiple fields, what fields and what order? It appears the hour fields are a 24 hour clock, not 12, right? Are all the date/time fields zero-padded (would 9am be "9" or "09"?)
Without taking into account performance it looks like your code has problems with month boundaries since it assumes all months are 30 days long. Take the two dates 2008-05-31/12:00:00 and 2008-06-01:12:00:00. Those are 24 hours apart but your code produces the same time code for both (63339969600)
I think you would need to consider leap years. I didn't do the math, but I think during a leap year, with a hard code of 28 days for feb, a comparison of noon on 2/29 and noon on 3/1 would result in the same duplicate time stamp as before. Although it looks like you didn't implement it like that. They way you implemented it, I think you still have the problem but it's between dates on 12/31 of $leapyear and 1/1 of $leapyear+1.
I think you might also have some collisions during time changes if your code has to handle time zones that handle them.
The file doesn't really seem to be sorted in any useful way. I'm guessing that field $1 is some sort of status (the "OK" you're checking for). So it's sorted by record status, then by DAY, then MONTH, YEAR, HOURS, MINUTES, SECONDS. If it was year,month,day I think there could be some optimizations there. Still might be but my brain's going in a different direction right now.
If there are a small number of duplicate keys in proportion to total number of lines, I think your best bet is to reduce the file your awk script works over to just duplicate keys (as David said). You could also preprocess the file so the only lines present are the /OK/ lines. I think I would do this with a pipeline where the first awk script only prints the lines with duplicate IDs and the second awk script is basically the one above but optimized to not look for /OK/ and with the knowledge that any key present is a duplicate key.
If you know ahead of time that all or most lines will have repeated keys, it's probably not worth messing with. I'd bite the bullet and write it in C. Tons more lines of code, much faster than the awk script.
On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.
On my Mac, which has GNU sort, it's:
sort -k 8 < input.txt > output.txt
to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style time_t timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the time_t for you from the components.
#AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:
cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt
That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.