Bash expr index command - bash

I am trying to get the index position using Bash 'expr index".
e.g.
$ echo `expr index "Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]" Mozilla`
I am trying to get the index position of the word "Mozilla", and then get the substring using index value.
The result I got back is 4. Is it the period after Info caus the issue? How do I fix this issue?
I followed the Advanced Bash scripting guide www.tldp.org/LDP/abs/html/‎. See section Table B-5. String Operations
expr index "$string" $substring Numerical position in $string of first character in $substring* that matches [0 if no match, first character counts as position 1]
I tried with something simple, and it works.
I am running bash in cygwin.
$ ./bash --version
GNU bash, version 4.1.10(4)-release (i686-pc-cygwin)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Thanks.

In general, you shouldn't be using expr index unless you have a very good reason to.
For instance, let's say you want to get the browser name.
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
# strip everything up to and including the first instance of 'Browser['
browser="${s#*Browser[}"
# strip everything after the first ']', again, inclusive
browser="${browser%%]*}"
# ...and show the result...
echo "$browser"
This would return:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0
If you really do want to know how many characters precede Mozilla, well, you can do that too:
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
# strip everything after the first instance of 'Mozilla'
prefix=${s%%Mozilla*}
# count number of characters in the string
index=${#prefix}
# ...and show the result...
echo "$index"
This should return 61.
For the "why" and "how" of the above examples, see BashFAQ #73.
To split by | separators, by contrast, I'd personally choose to use read, as documented in BashFAQ #1:
s="Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]"
IFS='|' read -r _ _ browser _
echo "$browser"
...which would emit...
Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]

The expr index command searches through your first string looking the the first occurrence of any character from your second string. In this case, it is recognizing that the 'o' in the characters 'Mozilla' matches the 4th character in "Info.out..."
This using this as a test to see what happens. It will return 4 as the first match for 'd':
echo `expr index "abcdefghijklmnopqrstuvwxyz" xyzd`
This one should do what you want:
echo "Info.out.2014-02-08:INFO|SID:sXfzRjbmKbwX7jyaW1sog7n|Browser[Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0]" | grep -o -b Mozilla
The echo puts your string into stdout, so it can be piped into grep.
The -b prints the byte offset of the string shown.
The -o ensures that only the matching portion gets printed.

GNU expr does not match a substring using index; rather, it looks for the first occurrence of any character from the second string in the first. Your example returns 4 because the 4th character of the string is "o", the first character in "Mozilla" that is found in "Info.out...".
There is no built-in function of this kind in either bash or expr, but you can indirectly get the index of a given substring by first removing the substring and everything after it from the original string, then computing the remaining length.
string="Info.out..."
substring=Mozilla
tmp=${string%%$substring*}
index=${#tmp}

Related

Scrapy is provides no output with xpath selector

This is the code I am trying to run in scrapy shell to get the headline of the article from dailymail.co.uk.
headline = response.xpath("//div[#id='js-article-text']/h2/text()").extract()
$ scrapy shell "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
Set up an user-agent with your request and it should work :
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0" "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
response.xpath("//div[#id='js-article-text']/h2/text()").extract()
Output :
Shia LaBeouf reveals his heavily tattoo torso as he goes shirtless for a run in hot pink shorts

How to remove more than 1 fields from one file using unix script

I have one file which has 2 '-' (hyphen or minus) symbol as fields. It has got 21 fields. I can count the position of those fields and those are $2 and $3. How would I remove such fields using unix shell script. Sample data is given below:
192.168.1.223 - - [15/Jul/2015:16:54:07 +0530] "GET / HTTP/1.1" 403 4954 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2453.0 Safari/537.36"
192.168.1.223 - - [15/Jul/2015:16:54:08 +0530] "GET /icons/apache_pb.gif HTTP/1.1" 200 2326 "http://192.168.1.232/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2453.0 Safari/537.36"
I want to remove '-' at $2 and $3 position.
But would like to add a point. These positions I can figure out since it is known to me. What if I do not know positions and no of fields are more. I just want to automate the system so that the code will search it in the line and remove it.
In short I want to write a script which will check each field of the given file and will remove those fields which are kind of junk characters like '-'(hyphen or minus).
The following works on my (Linux) machine to remove columns 2 and 3:
cut -d ' ' --complement -f 2,3
I don't know how standard/portable the --complement option is.
On the other hand, if you want to remove fields consisting of - no matter where they appear, try:
perl -pe 's/ -(?= )//g'

Strings manipulation and variables in bash script

I try to download something with wget using for loop in bash script:
When i'm not using variables everything work fine, when i assign it into variables i have 500 server error. This is strange for me, because this is only copy-paste.
What i'm trying to do is take number from loop i and paste it into body.
Here is my code:
#!/bin/bash
for i in {1..5}
do
STR="some_static_stuff_before"$i"some_static_suff_after"
echo $STR
wget -O ready/page$i.aspx --header="Host: www.something.com" --header="Pragma: no-cache" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" --header="Accept-Language: en-en" --header="Accept-Encoding: gzip, deflate" --header="Content-Type: application/x-www-form-urlencoded" --header="Origin: http://something.com" --header="Connection: keep-alive" --header="User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.73.11 (KHTML, like Gecko) Version/7.0.1 Safari/537.73.11" --header="Referer: http://www.something.com/something.aspx" --header="Cookie: ASP.NET_SessionId=u5cmt0figi4bvs40a30gnwsa; __utma=20059042.38323768.1389369038.1389710153.1389780868.6; __utmb=20059042.2.10.1389780868; __utmc=20059042; __utmz=20059042.1389627823.2.2.utmcsr=something.com|utmccn=(referral)|utmcmd=referral|utmcct=/something.aspx" --post-data='"$STR"' http://something.com/something.aspx
done
And when i paste object directly to --post-data there is no problem with download content.
I've tried --post-data= "/"$STR/"" and --post-data='"$STR"' and still not working.
You single-quoted the variable reference (in addition to double-quoting it), which prevents substitution of the variable value.
Instead of
--post-data='"$STR"'
use
--post-data="$STR"

processing a pcap/dmp file for time-to-live, user-agent, and OS

I'm trying to generate a report of the number of clients/devices behind a given NAT gateway using the techniques discussed in this paper.
Basically I need to write a script which looks for both 'User-Agent' and 'Time to live' at the same time:
grep " User-Agent:" *.txt
grep " Time to live:" *.txt
Those are exactly how the lines are formatted in my output files and I'm happy having the text to the end of the line. They work separately but I haven't been successful in combining them.
My most recent attempts have been:
egrep -w ' User-Agent:'|' Time to live:' ../*.txt
grep ' User-Agent:' ../*.txt' && grep ' Time to live:' ../*.txt
(I've been manually exporting text format files from Wireshark, if anyone has a suggestion for doing that via script I would be most grateful, I have a HUGE number of files to do this for.)
I looked for a similar thread but I didn't find one, if one already exists (as I expect) I apologize, whether someone can supply me a link to assistance or provide it I would be most grateful.
EDIT: I thought I should mention, the two phrases I'm looking for are on lines separated by other data so a solution would need to search for both in an example like so:
User-Agent:
blahblahblah:
halbhalbhalb:
Time to live:
egrep ' User-Agent:| Time to live:' ../*.txt gives me:
desktop:~/Documents/scripts$ ./pcap_ttl_OS_useragent
../scripttextfile1p.txt: Time to live: 128
../scripttextfile1p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
../scripttextfile2p.txt: Time to live: 55
../scripttextfile3p.txt: Time to live: 128
../scripttextfile3p.txt: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0\r\n
egrep ' User-Agent:| Time to live:' ../*.txt
should work.
I don't think the -w is getting you any "functionality".
Also, you want to quote the whole "extended" regular expression, inluding the | alternation character as one string.
Finally, it's not clear if your leading white space for each field is the result of a tab char or a group of spaces. That would affect the correct text string to put into the search patterns. To confirm white-space type, i like to use
grep 'User-Agent' ../*.txt | head -1 | cat -vet
will show ether
..... User-Agent ....
OR
.....^IUser-Agent .....
The ^I being the representation for the tab character.
IHTH

How to specify command parameters in one variable?

In the test script so many times I use the command "curl". In order to optimize the code, I want options of "curl" to carry out in a global variable.
I read the terms of use of "curl", and it says that to pass a parameter that contains spaces must be to frame it in single quotes.
But it is not working.
$ curl_options="-i -L -k -S --connect-timeout 30 --user-agent 'Opera/9.80 (Windows NT 6.1; WOW64) Presto/2.12.388 Version/12.14'"
$ curl $curl_options "http://google.com"
Output Result:
curl: (6) Couldn't resolve host'' Opera '
curl: (6) Couldn't resolve host '(Windows'
curl: (6) Couldn't resolve host 'NT'
curl: (6) Couldn't resolve host '6 .1; '
curl: (6) Couldn't resolve host 'WOW64)'
curl: (6) Couldn't resolve host 'Presto'
curl: (6) Couldn't resolve host 'Version'
In bash, you should use an array. This way, you don't need to worry whether a space in the string is part of an option, or separating two options:
curl_options=( ... )
curl_options+=( "--user-agent" "Opera/9.80 (Windows NT 6.1; WOW64) Presto/2.12.388 Version/12.14")
curl "${curl_options[#]}" "http://google.com"
If you can't use arrays (e.g., they aren't available in the shell you are using), you'll have to fall back to using eval:
$ curl_options="-i -L -k -S --connect-timeout 30 --user-agent 'Opera/9.80 (Windows NT 6.1; WOW64) Presto/2.12.388 Version/12.14'"
$ eval "curl $curl_options http://google.com"
This isn't ideal, since you need to be very careful about how you set the value of curl_options, because eval has no knowledge about what the value represents. The shell merely interpolates the value into the string passed to eval, and eval executes it. Typos can have unintended consequences.

Resources