Scrapy is provides no output with xpath selector - xpath

This is the code I am trying to run in scrapy shell to get the headline of the article from dailymail.co.uk.
headline = response.xpath("//div[#id='js-article-text']/h2/text()").extract()
$ scrapy shell "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"

Set up an user-agent with your request and it should work :
scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0" "https://www.dailymail.co.uk/tvshowbiz/article-8257569/Shia-LaBeouf-revealed-heavily-tattoo-torso-goes-shirtless-run-hot-pink-shorts.html"
response.xpath("//div[#id='js-article-text']/h2/text()").extract()
Output :
Shia LaBeouf reveals his heavily tattoo torso as he goes shirtless for a run in hot pink shorts

Related

Downloading a working - local version of a website without js/css version names

Is there a way to wget local version of a website without its version names of js/css? What I used to get the site is below:
wget --mirror --page-requisites --convert-links --adjust-extension --compression=auto --reject-regex "/search|/rss" --no-if-modified-since --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" http://www.example.com
But it crawled the files with it's version names so my js file looks like this:
frontend.min.js#ver=2.5.11
Instead of
frontend.min.js
Also, source code has the same thing:
../jquery/frontend.min.js?ver=2.5.11
I would like to evade that and have it save without version names/info.
You can try removing --page-requisites if you don't need things such as pictures or interactive elements. Removing this will cause wget to not download any CSS or JS files.

Why do I get scrapy response empty?

I started
scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798
Next step
In [5]: response
Out[5]: <405 https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798>
After inspected page element,and copied XPath
In [6]: response.xpath('//*[#id="ad-title"]').extract()
Out[6]: []
Copy outerHTML
<h1 itemprop="name" id="ad-title">Brand New Modern Studio Flat £1056pcm | All Bills Included | In Willesden Green area</h1>
Image view response
Why?
Try to set the user agent to something more realistic, such as: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0.
Some websites do some basic validation on the user agent and redirect you to some special page if they detect something weird.
scrapy shell -s USER_AGENT='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0' https://www.gumtree.com/p/property-to-rent/brand-new-modern-studio-flat-%C2%A31056pcm-all-bills-included-in-willesden-green-area/1303463798
>>> response.xpath('//*[#id="ad-title"]').extract()
['<h1 itemprop="name" id="ad-title">Brand New Modern Studio Flat £1056pcm | All Bills Included | In Willesden Green area</h1>']
>>>

How to change/give user agent definition to phantomjs from cmd

How can I give user agent definition to phantomjs, I am currently running using following command on aws server ec2 instance
phantomjs --web-security=no --ssl-protocol=any --ignore-ssl-errors=true driver.js http://example.com
You can set a user agent in PhantomJS only in script (driver.js in your example). The documentation about it: http://phantomjs.org/api/webpage/property/settings.html
If you want to pass user agent to PhantomJS in a command line, you can use a parameter. In the script you can take the parameter and set it as a user agent. You can try an example below:
var webPage = require('webpage');
var system = require('system');
var page = webPage.create();
var userAgent = system.args[1];
page.settings.userAgent = userAgent;
console.log('user agent: ' + page.settings.userAgent);
phantom.exit();
Running it as follows:
$ phantomjs ua.js "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120
Safari/537.36"
you will get output:
user agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36

Pattern matching and with multiple parameters in shell script

We face a complicated issue in Apache web server running n Linux where intermittently Apache gives 5XX error for some of the URLs and and that too not continuously. Its like starts with few requests and grows in timely manner. The issue resolves once we restart the Apache.
We are trying to fix the issue but we need a work around till the time where we need to put a script to monitor the access log of Apache server and whenever the issue occurs we have to restart the Apache.
We thought a shell script like tailing the log and grep all 5xx errors to a separate file and another shell script which will be triggered by cron will check the file if the error is repeated for number of times within a mentioned time.
My problem is the uRLs are not always same and so I have to grep the file which has the all 5XX errors and need to see if URLs are repeated and time also.
Can anyone suggest me some logic how i can filter the errors like. I tried to be clear but not sure if this is correct way of explaining the issue.
The logs are bit modified with values but format is same.
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/barebone.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&minifierBundleId=javascript.barebone.files&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/event-touch/event-touch-min.js&/html/js/aui/event-move/event-move-min.js HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /html/js/liferay/available_languages.jsp?browserId=other&themeId=expressportal_WAR_expressportaltheme&colorSchemeId=01&minifierType=js&languageId=en_US&b=6200&t=1462268846000 HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
x.x.x.x, y.y.y.y - - [11/May/2016:08:29:05 +0800](0) "HTTPS" "GET /combo/?browserId=other&minifierType=&languageId=en_US&b=6200&t=1462268846000&/html/js/aui/widget-stack/assets/skins/sam/widget-stack.css HTTP/1.1" 502 319 "https://myportal.test.com/web/guest/home" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
Are you 100% sure a restart fix the 500 errors ? If so, this line in the crontab should do:
tail -n 100 /var/log/apache2/error.logs | awk '{if ($9 >= 500) {nb += 1}} END {if (nb > 10) {exit 1}}' /var/log/apache2/access.log || service apache2 restart
It means that if there's more than 10 errors in the last 100 lines: restart. You may change the values for your specific problem.
First think I can think is: upgrade your Apache if it's not up to date.

curl to bing web site in bash from list

I want to get host of a web site like filehippo.com:
What I tried is:
#!/bin/bash
AGENT='Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1'
page=1
for line in $(cat /tmp/IpList.txt)
do
REQUEST="http://www.bing.com/search?q=ip%3a108.168.208.206&qs=n&pq=ip%3a108.168.208.206&sc=0-0&sp=-1&sk=&first=1&FORM=PERE"
curl $REQUEST --user-agent "$AGENT"
let page=page+10
done
What I want:
I want to search in pages and get result, In this case I have 1 page but some of my servers have more than 1 pages..
Thank you

Resources