Grep large number of patterns from a huge log file - shell

I have a shell script which is invoked every hour via cron job and to search through the asterisk logs and provide me the unique ids for a call which ended with cause 31.
while read ref
do
cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> cause_temp.log
done < callref.log
The issue is that the while loop is too slow and for accuracy I have included 4 while loops like mentioned above to perform various checks.
callref.log file consists of call identifier values and every hour it will have about 50-90 thousand values and the script take about 45-50 minutes to complete the execution and email me the report.
It would be of great help if I would be able to cut down the execution time of the loops. Since the size of sample.log file is about 20 GB and for each loop the file is opened and search is performed, I figured that the while loop is the bottleneck here.
Have done the research and found some useful links like
Link 1 Link 2
But the solutions suggested I cannot implement or do not know how to. Any suggestion would be helpful. Thanks
Since sample.log consists of sensitive information I would not be able to share any logs, but below are some sample logs which I got from the internet.
Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"503"<sip:503#192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:03:13 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"502"<sip:502#192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"1737245082"<sip:1737245082#192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"100"<sip:100#192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Jun 27 18:09:47 host asterisk[31774]: ERROR[27910]: chan_zap.c:10314 setup_zap: Unable to register channel '1-2'
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:414 __load_resource: chan_zap.so: load_module failed, returning -1
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:554 load_modules: Loading module chan_zap.so failed!
the file callref.log consists of a list of lines which looks like -
C-001ec22d
C-001ec23d
C-001ec24d
C-001ec31d
C-001ec80d
Also the desired output of the above while loop looks like C-001ec80d
Also my main concern is to make the while loop run faster. Like load all the values of callref.log in an array and search for all the values simultaneously in a single pass of sample.log if possible.

Since you could not produce adequate sample logs to test against even when requested, I whipped up some test material myself:
$ cat callref.log
a
b
$ cat sample.log
a 1
b 2
c 1
Using awk:
$ awk 'NR==FNR { # hash callrefs
a[$1]
next
}
{ # check callrefs from sample records and output when match
for(l in a)
if($0 ~ l && $0 ~ 1) # 1 is the static string you look for along a callref
print l
}' callref.log sample.log
a 1
HTH

I spent a day building a test framework and testing variations of different commands and I think you already have the fastest one.
Which leads me to think that if you are to get better performance you should look into a log digesting framework, like ossec (where your log samples came from) perhaps splunk. Those may be too clumsy for your wishes. Alternatively you should consider designing and building something in java/C/perl/awk better suited to parsing.
Running your existing script more frequently will also help.
Good luck! If you like I can box up the work I did and post it here, but I think its overkill.
as requested;
CalFuncs.sh: a library I source in most of my scripts
#!/bin/bash
LOGDIR="/tmp"
LOG=$LOGDIR/CalFunc.log
[ ! -d "$LOGDIR" ] && mkdir -p $(dirname $LOG)
SSH_OPTIONS="-o StrictHostKeyChecking=no -q -o ConnectTimeout=15"
SSH="ssh $SSH_OPTIONS -T"
SCP="scp $SSH_OPTIONS"
SI=$(basename $0)
Log() {
echo "`date` [$SI] $#" >> $LOG
}
Run() {
Log "Running '$#' in '`pwd`'"
$# 2>&1 | tee -a $LOG
}
RunHide() {
Log "Running '$#' in '`pwd`'"
$# >> $LOG 2>&1
}
PrintAndLog() {
Log "$#"
echo "$#"
}
ErrorAndLog() {
Log "[ERROR] $# "
echo "$#" >&2
}
showMilliseconds(){
date +%s
}
runMethodForDuration(){
local startT=$(showMilliseconds)
$1
local endT=$(showMilliseconds)
local totalT=$((endT-startT))
PrintAndLog "that took $totalT seconds to run $1"
echo $totalT
}
genCallRefLog.sh - generates fictitious callref.log size depending on argument
#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
echo "genCallRefLog.sh requires an integer of the number of lines to pump out of callref.log"
exit 1
fi
file="callref.log"
[ -f "$file" ] && rm -f "$file" # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of callref.log"
for (( a=i ; a < j; a++ ))
do
printf 'C-%08x\n' "$a" >> $file
done
genSampleLog.sh generates fictitious sample.log size depending on argument
#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
echo "genSampleLog.sh requires an integer of the number of lines to pump out of sample.log"
exit 1
fi
file="sample.log"
[ -f "$file" ] && rm -f "$file" # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of sample.log"
for (( a=i ; a < j; a++ ))
do
printf 'Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: C-%08x got hangup request, cause 31\n' "$a" >> $file
done
and finally the actual test script I used. Often I would comment out the building scripts as they only need to run when changing the log size. I also typically would only run one testing function at a time and record the results.
test.sh
#!/bin/bash
source "./CalFuncs.sh"
targetLogFile="cause_temp.log"
Log "Starting"
checkTargetFileSize(){
expectedS="$1"
hasS=$(cat $targetLogFile | wc -l)
if [ "$expectedS" != "$hasS" ] ; then
ErrorAndLog "Got $hasS but expected $expectedS, when inspecting $targetLogFile"
exit 244
fi
}
standard(){
iter=0
while read ref
do
cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> $targetLogFile
done < callref.log
}
subStandardVarient(){
iter=0
while read ref
do
cat sample.log | grep 'got hangup request, cause 31' | grep -o "$ref" >> $targetLogFile
done < callref.log
}
newFunction(){
grep -f callref.log sample.log | grep 'got hangup request, cause 31' >> $targetLogFile
}
newFunction4(){
grep 'got hangup request, cause 31' sample.log | grep -of 'callref.log'>> $targetLogFile
}
newFunction5(){
#splitting grep
grep 'got hangup request, cause 31' sample.log > /tmp/somefile
grep -of 'callref.log' /tmp/somefile >> $targetLogFile
}
newFunction2(){
iter=0
while read ref
do
((iter++))
echo "$ref" | grep 'got hangup request, cause 31' | grep -of 'callref.log' >> $targetLogFile
done < sample.log
}
newFunction3(){
iter=0
pat=""
while read ref
do
if [[ "$pat." != "." ]] ; then
pat="$pat|"
fi
pat="$pat$ref"
done < callref.log
# Log "Have pattern $pat"
while read ref
do
((iter++))
echo "$ref" | grep 'got hangup request, cause 31' | grep -oP "$pat" >> $targetLogFile
done < sample.log
#grep: regular expression is too large
}
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
numLines="100000"
Log "testing algorithms with $numLines in each log file."
setupCallRef(){
./genCallRefLog.sh $numLines
}
setupSampleLog(){
./genSampleLog.sh $numLines
}
setupCallRef
setupSampleLog
runMethodForDuration standard > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration subStandardVarient > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction > /dev/null
checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction2 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction3 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction4 > /dev/null
# checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction5 > /dev/null
checkTargetFileSize "$numLines"
The above shows that the existing method was always faster than anything I came up with. I think someone took care to optimize it.

Related

Creating file takes time in bash

I have a bash script in which I am doing string substitutions by taking input values different source files to create one complete string record. I have to create 5L such records in a file in 5mins on-the-go(records need to be written to the file as soon as it is created), however the script is very slow (20k records in 5mins). Below is the script I used.
#!/bin/bash
sampleRecod="__TIME__-0400 INFO 639582 truefile?apikey=__API_KEY__json||__STATUS__|34|0||0|0|__MAINSIZE__|1|"
count=0;
license_array=(`cat license.txt | xargs`)
status_array=(`cat status.json | xargs`)
error_array=(`cat 403.json | xargs`)
finalRes="";
echo $(date +"%Y-%m-%dT%H:%M:%S.%3N")
while true;do
time=$(date +'%Y-%m-%dT%T.%3N')
line=${license_array[`shuf -i 0-963 -n 1`]}
status=${status_array[`shuf -i 0-7 -n 1`]}
responseMainPart=$(shuf -i 100-999 -n 1)
if [ $status -eq 403 ] || [ $status -eq 0 ]
then
responseMainPart=${error_array[`shuf -i 0-3 -n 1`]}
fi
result=$(echo "$sampleRecod" | sed "s/__TIME__/$time/g")
result=$(echo "$result" | sed "s/__KEY__/$line/g")
result=$(echo "$result" | sed "s/__STATUS__/$status/g")
result=$(echo "$result" | sed "s/__MAIN_SIZE__/$responseMainPart/g")
finalRes+="${result} \n";
count=$((count+1))
if [ $count -eq 1000 ]
then
#echo "got count";
count=0;
echo -e $finalRes >> new_data_1.log;
finalRes="";
fi
done
echo -e $finalRes >> new_data_1.log;
echo $(date +"%Y-%m-%dT%H:%M:%S.%3N")
Can anyone suggest how can I optimize this?? The files I am retrieving values do not have many lines as well.
I have tried replacing shuf with sed but still not much help.

How to send TCPDUMP output to remote script via CURL

So I have a script here that is taking a TCPDUMP output. We are trying to send (2) variables to a PHP script over the web ($SERVER). The filename header is created and contains both $FILETOSEND which is the filename and filedata. The actual data for the filedata variable is coming from a file called 1 (the data is formatted as you can tell). I am having issues with the section that calls out #send common 10 sec dump.
I am trying to CURL the file 1 and I am doing so by using curl --data "$(cat 1)" $SERVER
The script isn't sending the file 1 at all, mostly just sends the filename and no file data. Is there a problem with the way I am sending the file? Is there a better way to format it?
while true; do
sleep $DATASENDFREQ;
killall -9 tcpdump &> /dev/null
if [ -e $DUMP ]; then
mv $DUMP $DUMP_READY
fi
create_dump
DATE=`date +"%Y-%m-%d_%H-%M-%S"`
FILETOSEND=$MAC-$DATE-$VERSION
# we write fileheader to the file. 2 vars : filename, filedata.
FILEHEADER="filename=$FILETOSEND&filedata="
echo $FILEHEADER > 2
# change all colons to underscores for avoiding Windows filenames issues
sed -i 's/:/_/g' 2
# delete all newlines \n in the file
tr -d '\n' < 2 > 1
# parsing $DUMP_READY to awk.txt (no header in awk.txt)
awk '{ if (NF > 18 && $10 == "signal") {print "{\"mac\": \""$16"\",\"sig\": \""$9"\",\"ver\": \""$8"\",\"ts\": \""$1"\",\"ssid\": \""$19"\"}" }}' $DUMP_READY > awk.txt
sed -i 's/SA://g' awk.txt
sed -i 's/&/%26/g' awk.txt
cat awk.txt >> 1
sync
# send $OFFLINE
if [ -e $OFFLINE ]; then
curl -d $OFFLINE $SERVER
if [ $? -eq "0" ]; then
echo "status:dump sent;msg:offline dump sent"
rm $OFFLINE
else
echo "status:dump not sent;msg:offline dump not sent"
fi
fi
# send common 10 secs dump
curl --data "$(cat 1)" $SERVER
if [ $? -eq "0" ]; then
echo "status:dump sent"
else
cat 1 >> $OFFLINE
echo "status:dump not sent"
fi
if [ -e $DUMP_READY ]; then
rm -f $DUMP_READY 1 2 upload_file*
fi

how to speed up checking if file exists in bash

I'm new at Bashing and wrote a code to check my photos files but find it very slow and gets a few empty returns checking 17000+ photos. Is there any way to use all 4 cpus running this script and so speed it up
Please help
#!/bin/bash
readarray -t array < ~/Scripts/ourphotos.txt
totalfiles="${#array[#]}"
echo $totalfiles
i=0
ii=0
check1=""
while :
do
check=${array[$i]}
if [[ ! -r $( echo $check ) ]] ; then
if [ $check = $check1 ]; then
echo "empty "$check
else
unset array[$i]
ii=$((ii + 1 ))
fi
fi
if [ $totalfiles = $i ]; then
break
fi
i=$(( i + 1 ))
done
if [ $ii -gt "1" ]; then
notify-send -u critical $ii" files have been deleted or are unreadable"
fi
It's a filesystem operation so multiple cores will hardly help.
Simplification might:
while read file; do
i=$((i+1)); [ -e "$file" ] || ii=$(ii+1));
done < "$HOME/Scripts/ourphotos.txt"
#...
Two points:
you don't need to keep the whole file in memory (no arrays needed)
$( echo $check ) forks a proces. You generally want to avoid forking and execing in loops.
This is an old question, but a common problem lacking an evidence-based solution.
awk '{print "[ -e "$1" ] && echo "$2}' | parallel # 400 files/s
awk '{print "[ -e "$1" ] && echo "$2}' | bash # 6000 files/s
while read file; do [ -e $file ] && echo $file; done # 12000 files/s
xargs find # 200000 files/s
parallel --xargs find # 250000 files/s
xargs -P2 find # 400000 files/s
xargs -P96 find # 800000 files/s
I tried this on a few different systems and the results were not consistent, but xargs -P (parallel execution) was consistently the fastest. I was surprised that xargs -P was faster than GNU parallel (not reported above, but sometimes much faster), and I was surprised that parallel execution helped so much — I thought that file I/O would be the limiting factor and parallel execution wouldn't matter much.
Also noteworthy is that xargs find is about 20x faster than the accepted solution, and much more concise. For example, here is a rewrite of OP's script:
#!/bin/bash
total=$(wc -l ~/Scripts/ourphotos.txt | awk '{print $1}')
# tr '\n' '\0' | xargs -0 handles spaces and other funny characters in filenames
found=$(cat ~//Scripts/ourphotos.txt | tr '\n' '\0' | xargs -0 -P4 find | wc -l)
if [ $total -ne $found ]; then
ii=$(expr $total - $found)
notify-send -u critical $ii" files have been deleted or are unreadable"
fi

bash script and greping with command line

new to bash scripting so just wondering if i am doing this code right at all. im trying to search /etc/passwd and then grep and print users.
usage ()
{
echo "usage: ./file.sk user"
}
# test if we have two arguments on the command line
if [ $# != 1 ]
then
usage
exit
fi
if [[ $# < 0 ]];then
usage
exit
fi
# Search for user
fullname=`grep $1 /etc/passwd | cut -f 5 -d :`
firstname=`grep $1 /etc/passwd | cut -f 5 -d : | cut -f 1 -d " "`
#check if there. if name is founf: print msg and line entry
not sure as how to this or if im doing this right...
am i doing this right?
grep $1 /etc/passwd | while IFS=: read -r username passwd uid gid info home shell
do
echo $username: $info
done
This might work for you:
fullname=$(awk -F: '/'$1'/{print $5}' /etc/passwd)
firstname=${fullname/ *}
You're on the right track.
But I think the 2nd if [[ $# < 0 ]] .... fi block doesn't get you much. Your first test case gets the situation right, 'This script requires 1 argument or quits'.
Also, I don't see what you need firstname for, so a basic test is
case "${fullname:--1}" in
-[1] ) printf "No userID found for input=$1\n" ; exit 1 ;;
* )
# assume it is OK
# do what every you want after this case block
;;
esac
You can of course, duplicate this using "${firstname}" if you really need the check.
OR as an equivalent if ... fi is
if [[ "${fullname}" == "" ]] ; then
printf "No userID found for input=$1\n" ; exit 1
fi
note to be more efficient, you can parse ${fullname} to get firstname without all the calls to grep etc, i.e.
firstname=${fullname%% *}
Let me know if you need for me to explain :--1} and %% *} variable modifiers.
I hope this helps.
Instead of this:
fullname=`grep $1 /etc/passwd | cut -f 5 -d :`
firstname=`grep $1 /etc/passwd | cut -f 5 -d : | cut -f 1 -d " "`
Try this:
fullname=$(cut -f5 -d: /etc/passwd | grep "$1")
if [[ $? -ne 0 ]]; then
# not found, do something
fi
firstname=${fullname%% *} # remove the space and everything after
Note that I changed my answer to cut before grep so that it doesn't get false positives if some other field matches the full name you are searching for.
You can simply by reading your input to an array and then printing out your desired fields, something like this -
grep $1 /etc/passwd | while IFS=: read -a arry; do
echo ${arry[0]}:${arry[4]};
done
Test:
jaypal:~/Temp] echo "root:*:0:0:System Administrator:/var/root:/bin/sh" |
while IFS=: read -a arry; do
echo ${arry[0]}:${arry[4]};
done
root:System Administrator

bash gnu parallel help

its about
http://en.wikipedia.org/wiki/Parallel_(software)
and very rich manpage http://www.gnu.org/software/parallel/man.html
(for x in `cat list` ; do
do_something $x
done) | process_output
is replaced by this
cat list | parallel do_something | process_output
i am trying to implement that on this
while [ "$n" -gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
while reading documentation for gnu parallel
i found out that functions are not supported so getlinks wont be used in parallel
best i have found so far is
seq 30 | parallel -n 4 --colsep ' ' echo {1} {2} {3} {4}
makes output
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
25 26 27 28
29 30
while loop mentioned above should go like this if I am right
end=`cat done1 |wc -l`
seq $end -1 1 | parallel -j+4 -k
#(all exept getlinks function goes here, but idk how? )|
# everytime it finishes do
getlinks $nextUrls
thx for help in advance
It seems what you want is a progress meter. Try:
cat done1 | parallel --eta wget
If that is not what you want, look at sem (sem is an alias for parallel --semaphore and is normally installed with GNU Parallel):
for i in `ls *.log` ; do
echo $i
sem -j+0 gzip $i ";" echo done
done
sem --wait
In your case it will be something like:
while [ "$n" -gt 0 ]
do
percentage=${"scale=2;(100-(($n / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${n}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
THE_URL=`getlinks $nextUrls`
sem -j10 wget $THE_URL
#save n
echo $n > currentLine
let "n--"
let "end=`cat done1 |wc -l`"
done
sem --wait
echo All done
Why does getlinks need to be a function? Take the function and transform it into a shell script (should be essentially identical except you need to export environmental variables in and you of course cannot affect the outside environment without lots of work).
Of course, you cannot save $n into currentline when you are trying to execute in parallel. All files will be overwriting each other at the same time.
i was thinking of makeing something more like this, if not parallel or sam something else because parallel does not supprot funcitons aka http://www.gnu.org/software/parallel/man.html#aliases_and_functions_do_not_work
getlinks(){
if [ -n "$1" ]
then
lynx -image_links -dump "$1" > src
grep -i ".jpg" < src > links1
grep -i "http" < links1 >links
sed -e 's/.*\(http\)/http/g' < links >> done1
sort -f done1 > done2
uniq done2 > done1
rm -rf links1 links src done2
fi
}
func(){
percentage=${"scale=2;(100-(($1 / $end) * 100))"|bc -l}}
#get url from line specified by n from file done1
nextUrls=`sed -n "${$1}p" < done1`
echo -ne "${percentage}% $n / $end urls saved going to line 1. current: $nextUrls\r"
# function that gets links from the url
getlinks $nextUrls
#save n
echo $1 > currentLine
let "$1--"
let "end=`cat done1 |wc -l`"
}
while [ "$n" -gt 0 ]
do
sem -j10 func $n
done
sem --wait
echo All done
My script has become really complex, and i do not want to make a feature unavailable with something i am not sure it can be done
this way i can get links with full internet traffic been used, should take less time that way
tryed sem
#!/bin/bash
func (){
echo 1
echo 2
}
for i in `seq 10`
do
sem -j10 func
done
sem --wait
echo All done
you get
errors
Can't exec "func": No such file or directory at /usr/share/perl/5.10/IPC/Open3.p
m line 168.
open3: exec of func failed at /usr/local/bin/sem line 3168
It is not quite clear what the end goal of your script is. If you are trying to write a parallel web crawler, you might be able to use the below as a template.
#!/bin/bash
# E.g. http://gatt.org.yeslab.org/
URL=$1
# Stay inside the start dir
BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
URLLIST=$(mktemp urllist.XXXX)
URLLIST2=$(mktemp urllist.XXXX)
SEEN=$(mktemp seen.XXXX)
# Spider to get the URLs
echo $URL >$URLLIST
cp $URLLIST $SEEN
while [ -s $URLLIST ] ; do
cat $URLLIST |
parallel lynx -listonly -image_links -dump {} \; wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and do { $seen{$1}++ or print }' |
grep -F $BASEURL |
grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
mv $URLLIST2 $URLLIST
done
rm -f $URLLIST $URLLIST2 $SEEN

Resources