How to change/give user agent definition to phantomjs from cmd - amazon-ec2

How can I give user agent definition to phantomjs, I am currently running using following command on aws server ec2 instance
phantomjs --web-security=no --ssl-protocol=any --ignore-ssl-errors=true driver.js http://example.com

You can set a user agent in PhantomJS only in script (driver.js in your example). The documentation about it: http://phantomjs.org/api/webpage/property/settings.html
If you want to pass user agent to PhantomJS in a command line, you can use a parameter. In the script you can take the parameter and set it as a user agent. You can try an example below:
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
var userAgent = system.args[1];
page.settings.userAgent = userAgent;
console.log('user agent: ' + page.settings.userAgent);
phantom.exit();
Running it as follows:
$ phantomjs ua.js "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120
Safari/537.36"
you will get output:
user agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36

Related

Scrapy is provides no output with xpath selector

This is the code I am trying to run in scrapy shell to get the headline of the article from dailymail.co.uk.
headline = response.xpath("//div[#id='js-article-text']/h2/text()").extract()
$ scrapy shell "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
Set up an user-agent with your request and it should work :
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0" "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
response.xpath("//div[#id='js-article-text']/h2/text()").extract()
Output :
Shia LaBeouf reveals his heavily tattoo torso as he goes shirtless for a run in hot pink shorts

Downloading a working - local version of a website without js/css version names

Is there a way to wget local version of a website without its version names of js/css? What I used to get the site is below:
wget --mirror --page-requisites --convert-links --adjust-extension --compression=auto --reject-regex "/search|/rss" --no-if-modified-since --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" http://www.example.com
But it crawled the files with it's version names so my js file looks like this:
frontend.min.js#ver=2.5.11
Instead of
frontend.min.js
Also, source code has the same thing:
../jquery/frontend.min.js?ver=2.5.11
I would like to evade that and have it save without version names/info.
You can try removing --page-requisites if you don't need things such as pictures or interactive elements. Removing this will cause wget to not download any CSS or JS files.

mod_rewrite -> rewrite rule replacing application context part in original url

Please note that I have asked this question on "server fault", nothing turned up there, so posting on SO.
I have requirement where user will type url like http://example.com/Welcome in browser, but I need to send it to http://myip.com/someapp/next.html, so I wrote virtual host in httpd.conf file:
<VirtualHost *:80>
ProxyRequests off
ProxyPreserveHost On
RewriteEngine On
RewriteRule "^/(.*)" "http://myip.com/someapp/next.html" [P]
ProxyPassReverse "/" "http://myip.com/someapp/next.html"
</VirtualHost>
This is working partly. When user enter url http://example.com/Welcomeit is being replaced with http://example.com. How can I keep complete url (http://example.com/Welcome)? would appreciate any help.
EDIT:
Here is the access_log surrounding the call
184.180.123.46 - - [06/Dec/2017:21:10:41 +0000] "GET /myip.com/images/logo.png HTTP/1.1" 200 6697 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:10:51 +0000] "POST /myip/somepage HTTP/1.1" 200 7 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:10:52 +0000] "GET / HTTP/1.1" 200 3391 "http://example.com/Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
184.180.123.46 - - [06/Dec/2017:21:11:01 +0000] "GET / HTTP/1.1" 200 3391 "http://example.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"

Extract string from the URL

I want to extract a string from the URL. I have the following URL :
https://optim.actiontec.com/aei-api/users/R1RCQTY1MjA1MDYyMDc%3D
I want to extract the string after users/ and store it in a variable that I can use. I tried using the regular expression extractor but it did not work.
My second issue is extracting stuff from request headers. I dont know much about it but can we extract stuff from Request headers?
This is my request header -
GET /aei-api/main/R1RCQTY1MjA1MDYyMDc%3D HTTP/1.1
Host: optim.actiontec.com
Connection: keep-alive
Accept: application/json, text/plain, */*
X-Token: 03a580b082140ee7
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
X-User-Id: R1RCQTY1MjA1MDYyMDc=
Referer: https://optim.actiontec.com/
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
And I am looking to extract X-User-Id.I want to use the header value and pass it to other headers. Is it possible?
Thanks
The Regex configuration would be like this:
Field to check: Request Header
Reference Name: var
Regular Expression: X-User-Id: (\w+.)
Template: $1$
Match No: 1
Screenshot showing regex test:
For Regex test see here: https://regex101.com/r/MMhn3i/1/

Pattern matching and with multiple parameters in shell script

We face a complicated issue in Apache web server running n Linux where intermittently Apache gives 5XX error for some of the URLs and and that too not continuously. Its like starts with few requests and grows in timely manner. The issue resolves once we restart the Apache.
We are trying to fix the issue but we need a work around till the time where we need to put a script to monitor the access log of Apache server and whenever the issue occurs we have to restart the Apache.
We thought a shell script like tailing the log and grep all 5xx errors to a separate file and another shell script which will be triggered by cron will check the file if the error is repeated for number of times within a mentioned time.
My problem is the uRLs are not always same and so I have to grep the file which has the all 5XX errors and need to see if URLs are repeated and time also.
Can anyone suggest me some logic how i can filter the errors like. I tried to be clear but not sure if this is correct way of explaining the issue.
The logs are bit modified with values but format is same.
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/barebone.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&minifierBundleId=javascript.barebone.files&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/event-touch/event-touch-min.js&/html/js/aui/event-move/event-move-min.js HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/liferay/available_languages.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/widget-stack/assets/skins/sam/widget-stack.css HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
Are you 100% sure a restart fix the 500 errors ? If so, this line in the crontab should do:
tail -n 100 /var/log/apache2/error.logs | awk '{if ($9 >= 500) {nb += 1}} END {if (nb > 10) {exit 1}}' /var/log/apache2/access.log || service apache2 restart
It means that if there's more than 10 errors in the last 100 lines: restart. You may change the values for your specific problem.
First think I can think is: upgrade your Apache if it's not up to date.

Resources