How to remove more than 1 fields from one file using unix script - shell

I have one file which has 2 '-' (hyphen or minus) symbol as fields. It has got 21 fields. I can count the position of those fields and those are $2 and $3. How would I remove such fields using unix shell script. Sample data is given below:
192.168.1.223 - - [15/Jul/2015:16:54:07 +0530] "GET / HTTP/1.1" 403 4954 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2453.0 Safari/537.36"
192.168.1.223 - - [15/Jul/2015:16:54:08 +0530] "GET /icons/apache_pb.gif HTTP/1.1" 200 2326 "http://192.168.1.232/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2453.0 Safari/537.36"
I want to remove '-' at $2 and $3 position.
But would like to add a point. These positions I can figure out since it is known to me. What if I do not know positions and no of fields are more. I just want to automate the system so that the code will search it in the line and remove it.
In short I want to write a script which will check each field of the given file and will remove those fields which are kind of junk characters like '-'(hyphen or minus).

The following works on my (Linux) machine to remove columns 2 and 3:
cut -d ' ' --complement -f 2,3
I don't know how standard/portable the --complement option is.
On the other hand, if you want to remove fields consisting of - no matter where they appear, try:
perl -pe 's/ -(?= )//g'

Related

Downloading a working - local version of a website without js/css version names

Is there a way to wget local version of a website without its version names of js/css? What I used to get the site is below:
wget --mirror --page-requisites --convert-links --adjust-extension --compression=auto --reject-regex "/search|/rss" --no-if-modified-since --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" http://www.example.com
But it crawled the files with it's version names so my js file looks like this:
frontend.min.js#ver=2.5.11
Instead of
frontend.min.js
Also, source code has the same thing:
../jquery/frontend.min.js?ver=2.5.11
I would like to evade that and have it save without version names/info.
You can try removing --page-requisites if you don't need things such as pictures or interactive elements. Removing this will cause wget to not download any CSS or JS files.

mod_rewrite -> rewrite rule replacing application context part in original url

Please note that I have asked this question on "server fault", nothing turned up there, so posting on SO.
I have requirement where user will type url like http://example.com/Welcome in browser, but I need to send it to http://myip.com/someapp/next.html, so I wrote virtual host in httpd.conf file:
<VirtualHost *:80>
ProxyRequests off
ProxyPreserveHost On
RewriteEngine On
RewriteRule "^/(.*)" "http://myip.com/someapp/next.html" [P]
ProxyPassReverse "/" "http://myip.com/someapp/next.html"
</VirtualHost>
This is working partly. When user enter url http://example.com/Welcomeit is being replaced with http://example.com. How can I keep complete url (http://example.com/Welcome)? would appreciate any help.
EDIT:
Here is the access_log surrounding the call
184.180.123.46 - - [06/Dec/2017:21:10:41 +0000] "GET /myip.com/images/logo.png HTTP/1.1" 200 6697 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:10:51 +0000] "POST /myip/somepage HTTP/1.1" 200 7 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:10:52 +0000] "GET / HTTP/1.1" 200 3391 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:11:01 +0000] "GET / HTTP/1.1" 200 3391 "http://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"

Pattern matching and with multiple parameters in shell script

We face a complicated issue in Apache web server running n Linux where intermittently Apache gives 5XX error for some of the URLs and and that too not continuously. Its like starts with few requests and grows in timely manner. The issue resolves once we restart the Apache.
We are trying to fix the issue but we need a work around till the time where we need to put a script to monitor the access log of Apache server and whenever the issue occurs we have to restart the Apache.
We thought a shell script like tailing the log and grep all 5xx errors to a separate file and another shell script which will be triggered by cron will check the file if the error is repeated for number of times within a mentioned time.
My problem is the uRLs are not always same and so I have to grep the file which has the all 5XX errors and need to see if URLs are repeated and time also.
Can anyone suggest me some logic how i can filter the errors like. I tried to be clear but not sure if this is correct way of explaining the issue.
The logs are bit modified with values but format is same.
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/barebone.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&minifierBundleId=javascript.barebone.files&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/event-touch/event-touch-min.js&/html/js/aui/event-move/event-move-min.js HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/liferay/available_languages.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/widget-stack/assets/skins/sam/widget-stack.css HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
Are you 100% sure a restart fix the 500 errors ? If so, this line in the crontab should do:
tail -n 100 /var/log/apache2/error.logs | awk '{if ($9 >= 500) {nb += 1}} END {if (nb > 10) {exit 1}}' /var/log/apache2/access.log || service apache2 restart
It means that if there's more than 10 errors in the last 100 lines: restart. You may change the values for your specific problem.
First think I can think is: upgrade your Apache if it's not up to date.

Bash expr index command

I am trying to get the index position using Bash 'expr index".
e.g.
$ echo `expr index "Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]" Mozilla`
I am trying to get the index position of the word "Mozilla", and then get the substring using index value.
The result I got back is 4. Is it the period after Info caus the issue? How do I fix this issue?
I followed the Advanced Bash scripting guide www.tldp.org/LDP/abs/html/‎. See section Table B-5. String Operations
expr index "$string" $substring Numerical position in $string of first character in $substring* that matches [0 if no match, first character counts as position 1]
I tried with something simple, and it works.
I am running bash in cygwin.
$ ./bash --version
GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Thanks.
In general, you shouldn't be using expr index unless you have a very good reason to.
For instance, let's say you want to get the browser name.
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
# strip everything up to and including the first instance of 'Browser['
browser="${s#*Browser[}"
# strip everything after the first ']', again, inclusive
browser="${browser%%]*}"
# ...and show the result...
echo "$browser"
This would return:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0
If you really do want to know how many characters precede Mozilla, well, you can do that too:
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
# strip everything after the first instance of 'Mozilla'
prefix=${s%%Mozilla*}
# count number of characters in the string
index=${#prefix}
# ...and show the result...
echo "$index"
This should return 61.
For the "why" and "how" of the above examples, see BashFAQ #73.
To split by | separators, by contrast, I'd personally choose to use read, as documented in BashFAQ #1:
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
IFS='|' read -r _ _ browser _
echo "$browser"
...which would emit...
Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]
The expr index command searches through your first string looking the the first occurrence of any character from your second string. In this case, it is recognizing that the 'o' in the characters 'Mozilla' matches the 4th character in "Info.out..."
This using this as a test to see what happens. It will return 4 as the first match for 'd':
echo `expr index "abcdefghijklmnopqrstuvwxyz" xyzd`
This one should do what you want:
echo "Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]" | grep -o -b Mozilla
The echo puts your string into stdout, so it can be piped into grep.
The -b prints the byte offset of the string shown.
The -o ensures that only the matching portion gets printed.
GNU expr does not match a substring using index; rather, it looks for the first occurrence of any character from the second string in the first. Your example returns 4 because the 4th character of the string is "o", the first character in "Mozilla" that is found in "Info.out...".
There is no built-in function of this kind in either bash or expr, but you can indirectly get the index of a given substring by first removing the substring and everything after it from the original string, then computing the remaining length.
string="Info.out..."
substring=Mozilla
tmp=${string%%$substring*}
index=${#tmp}

processing a pcap/dmp file for time-to-live, user-agent, and OS

I'm trying to generate a report of the number of clients/devices behind a given NAT gateway using the techniques discussed in this paper.
Basically I need to write a script which looks for both 'User-Agent' and 'Time to live' at the same time:
grep " User-Agent:" *.txt
grep " Time to live:" *.txt
Those are exactly how the lines are formatted in my output files and I'm happy having the text to the end of the line. They work separately but I haven't been successful in combining them.
My most recent attempts have been:
egrep -w ' User-Agent:'|' Time to live:' ../*.txt
grep ' User-Agent:' ../*.txt' && grep ' Time to live:' ../*.txt
(I've been manually exporting text format files from Wireshark, if anyone has a suggestion for doing that via script I would be most grateful, I have a HUGE number of files to do this for.)
I looked for a similar thread but I didn't find one, if one already exists (as I expect) I apologize, whether someone can supply me a link to assistance or provide it I would be most grateful.
EDIT: I thought I should mention, the two phrases I'm looking for are on lines separated by other data so a solution would need to search for both in an example like so:
User-Agent:
blahblahblah:
halbhalbhalb:
Time to live:
egrep ' User-Agent:| Time to live:' ../*.txt gives me:
desktop:~/Documents/scripts$ ./pcap_ttl_OS_useragent
../scripttextfile1p.txt: Time to live: 128
../scripttextfile1p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
../scripttextfile2p.txt: Time to live: 55
../scripttextfile3p.txt: Time to live: 128
../scripttextfile3p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
egrep ' User-Agent:| Time to live:' ../*.txt
should work.
I don't think the -w is getting you any "functionality".
Also, you want to quote the whole "extended" regular expression, inluding the | alternation character as one string.
Finally, it's not clear if your leading white space for each field is the result of a tab char or a group of spaces. That would affect the correct text string to put into the search patterns. To confirm white-space type, i like to use
grep 'User-Agent' ../*.txt | head -1 | cat -vet
will show ether
..... User-Agent ....
OR
.....^IUser-Agent .....
The ^I being the representation for the tab character.
IHTH

Resources