scrapy being possibly blocked by site - xpath

I've been trying to scrape text off of this site http://www.ewtn.com/daily-readings/?date=2017-11-26
when I type from the shell
>response.xpath('//text()').extract()
I am having trouble accessing the following html info
<span id="cur-date">Sunday, November 26, 2017</span>
which would be Sunday, November 26, 2017
<div class="reading-type">First Reading</div>
which would be First Reading
I do get almost everything else on the page though - it seems like scrapy is being blocked

Thanks goes to Markus for putting me in the right direction! I used scrapy-webdriver to let me render JavaScript in PhantomJS in order to parse it with Scrapy... since Scrapy does not run JavaScript and this site seems to be injecting it directly into the browser to be converted into HTML... PhantomJS is a headless browser that does the JavaScript running for Scrapy.

Related

Browser Rendered Page Appears Different When Scraped

opening this page https://www.barchart.com/futures/quotes/ESH20/options
in Nokogiri doesn't have the same elements as in the rendered page inside the browser.
How can I access the same source code as seen in the Browser DevTools from the scraper library?
this element in particular <div class="bc-datatable"...
is required an headless browser to get the right page code first?
That data is coming from a JSON endpoint:
So luckily no headless browser required this time.

page.evaluate not working but page.exeute_script showing results

I have been using phantomjs to render canvas element on my page using ThreeJs.
Earlier I was building the page myself, but that did not gave option of adding a background image through system url.
Now after that I started using a localhost url, which did not worked when I used page.evaluate();
BUT to my surprise when I use Ruby/Watir browser with selenium to do the same operation using execute_script method, it works.
I want to know what is it doing differenty that I implement in phantomjs script instead of having watir/selenium etc.
Thanks in advance.

Ajax + pushState bug in Chrome

I've encountered a strange bug in Chrome 19. I implemented a full-AJAX website (every non-external link is opened via AJAX request) with pushState support. I transmit the HTML snippets in AJAX via JSON format.
When I leave my site via an external link and then go back, Chrome renders cached data for that URL - the problem is, he caches the JSON content and shows that, instead of full web-page.
This is reproducible by these steps (UPDATE: I removed AJAX functionality on my website since then, so this bug does no longer appear):
Open http://beta.mirtes.cz/
Click on the second date link (16. 6. 2012 next to "It all began with a strange e-mail"). This page (you are now at http://beta.mirtes.cz/it-all-began-with-a-strange-e-mail) is loaded via AJAX.
Click on "It all began with a strange e-mail". You are redirected to an external website.
Click "Back" in Chrome after the page is completely loaded.
I try to send all AJAX responses with Cache-Control: no-cache, but with no effect.
Firefox 12 works OK.
I came with a workaround - I perform AJAX request with additional dummy GET parameter - ?ajax=1. This way the browser can recognize the difference between usual HTML content and JSON. It doesn't have any impact on the user, the parameter is visible only in Firebug.

Loading Ajax Response Data with Adsense Codes inside

I'm 10000000% sure that this question has been asked before, however, the majority of the responses that I came across were from back in 2005, 2006 and so on. Not to mention, almost all of the questions themselves were too general. Therefore, I'm asking this so that for anyone else needs to find this out, then they won't need to dig through about 50 webpages to get an idea.
My question is simply that I have a webpage that has Google Ads embedded into the HTML of the website. The website was first developed as a static HTML site where each link reloaded a new page. Nevermind the backend technology of the website - the website itself produces purely dynamic content. The website is close to completion and now a fully-ajax listener has been added to all the links. When any of the links are clicked, JavaScript takes over, parses the link and sets that using popstate or the hashbang. The page itself is then queried to the server via AJAX and the content is updated using document.getElementByID('container').innerHTML=ajax.responseText; This way, there is almost a 100% method of accessing content that was replaced by AJAX.
This all works fine, but the responseText itself may, WILL contain Google Ads, and I was just wondering how to display them as if it were a static page. Clearly this doesn't work. Here are the options that I've come across:
Use an IFrame:
An IFrame seems to be an effective way to load the content; just stick the adsense codes a simple adsense.html iframe file and let the browser and
directly into page, it isn't possible
it's against their TOS
there is document.write() omitted in ajax request
Your chance is:
Create simple iframe
<iframe src="advert.html"></iframe>
and in advert.html, add your advert code
It's then loaded fine without problems.
Good luck

Strange problem with Google Maps and Ajax in Google Chrome and Safari

I am developing web-application using Google Maps API and ASP.NET Ajax. Here is my JavaScript-code for PageLoad:
map.openInfoWindowHtml(map.getCenter(),'Hello, <b>world</b>!');
First run is successful. But after execution some ASP.NET Ajax-function we have strange effect: In Internet Explorer, Mozilla Firefox and Opera everything is good, but in Google Chrome and Safari text with html-tags is invisible. In other words in Google Chrome we have text: “Hello, !”
I want to make the application that would normally in Google Chrome and Safari too. How can I do it?
Update:
String "Hello, <b>world</b>, <strong>world</strong>, <span style='font-weight: bold;'>World</span>, <a href='http://ya.ru'>Link</a>." transform to "Hello, , , , . " (I examined the DOM). Words really are disappearing.
I observed this strange effect on any Ajax-function with request to server.
Update2:
Many thanks to Koobz for many leading questions. They helped me a more detailed understanding of the problem.
First of all, full description of actions:
Load the page. GMap have several markers with dblclick-event in JavaScript. Dblclick event exec marker.openInfoWindowHtml(/My text/). /My text/ is located in JavaScript of my Page.
I double-click on the marker. I see a infoWindow with a normal formatting
Exec __doPostBack (starndard ASP.NET PostBack)
In server side JavaScript is updated with same
Server return some information with /My text/ to my page
I doouble-click on the marker. I see a infoWindow with a wrong formatting.
An interesting fact, which puts me in embarrassing:
I try set to “Hello, <b>world</b>, <b>test</b>”
Before Ajax function in all browser I have: “Hello, world, <b>test</b>”
After Ajax function in Google Chrome and Safari: "Hello, ,test"
After Ajax function in Mozilla, Opera and IE: “Hello, world, <b>test</b>”
What Chrome and Safari have features that may cause such behavior? Now I can write separately necessary infoWindow-text for each browser. But I would like to find a normal way to solve my problem.
Hit ctrl+shift+j to open up your chrome developer tools.
Set a breakpoint right before the function call that breaks everything.
Attempt to reproduce the bug.
After the break point hits, step through your code until the text disappears.
Set a breakpoint after the text first disappeared.
Repeat this process. Refine your breakpoints until you've narrowed down where the bug is occurring.
It could be any number of things. I have no idea what kind of Ajax things you're doing. Are you dynamically updating content on your page after doing the request? It's possible that this update code is modifying dom elements that it shouldn't be. Tracing through using the methodology above will help nail it if this is the case.
If you're using jQuery, maybe one of your selectors is out of whack and it's eating up content. Chrome is very good and I'm hesitant to believe it's a javascript bug or anything like that.
Validate your HTML. If you're traversing the dom, invalid markup might result in chrome "seeing" a different picture than other browsers. Just look for broken tags and ignore all the other things a validator complains about.
Wild guess: but the way it's stripping out HTML might point to some kind of XSS protection. Is the Ajax script that's returning the new HTML code on another domain than the one the visitor is viewing?
Some info here:
http://groups.google.com/group/chromium-dev/browse_thread/thread/d2931d7b670a1722/d56bdfccfcef677f
Do you see the problem with any html tags in the info window? As an experiment, try:
<span style="font-weight: bold;">World</span>.
I am wondering if there is a unclosed bold tag somewhere in the DOM?
I am messing around with this problem, but I haven't been able to reproduce it. Having a look at what the Ajax function does would be helpful.
Try this:
map.openInfoWindowHtml(map.getCenter(),'Hello, <strong>world</strong>!');
the strong tag is more standards compliant, worth a shot
As others have said, you need to post your code.

Resources