I'm writing a crawler to get the content from a website which uses AJAX.
There is a "show more" button at the bottom of the page, and my origin approach is to use Selenium.PhantomJS to pretend a web browser but it works in some website and some don't.
I'm wondering if there is some way i can directly get the underly JSON file of the AJAX action. Please give me some details, thanks.
By the way, I'm using Python.
I understand this is less of a python than a scraping problem in general (and I understand you meant "scraping" instead of "crawling" as a scraper reads/parses/processes one page whereas a crawler processes multiple pages and they're relation to each other).
You can get the JSON file immediately given you know it's URL. If you don't (for example because the URL changes from time to time), you might need to search through javascript files on the page manually to find out how the URL is generated.
Once you know the JSON file's URL, it's quite simple. As you already seem to know how to get the HTML of the "main" page, you can use your existing code to get the JSON file.
I'm not familiar with PhantomJS, but I reckon it's easier to get the JSON file immediately instead of simulating an AJAX request (if that's even possible with Phantom).
Related
I've installed Ajaxy Live Search Plugin v3.0.7 in my wordpress, but when searching something like:
<script>alert('123');</script>
I can see the alert on my browser.
Using firebug I can see that when typing on the search box I'm calling
/wp-admin/admin-ajax.php
sending these parameters:
action = ajaxy_sf
search = false
sf_value = <script>alert('123');</script>
How can I avoid this problem on my search box?
Thanks in advance!
XSS is an output encoding problem. It's not about what parameters are sent in the request, the vulnerability manifests itself when a response is made (either to the same request or a different one).
In any response, output should be encoded according to the context. Depending on how your plugin works, it may either create partial html on the server or use DOM mainpulation in the browser while sending AJAX requests.
If it's creating partial html and inserting that into the page as the search results, the view that generates the partial html should be fixed by adding output encoding (ie. htmlspecialchars() in php, but there are other options too).
If it's just an ajax request and the page dom is manipulated in javascript, the application should make sure to only insert variables as text nodes and not whole dom subtrees with potential script nodes.
Either way, I think it should be done by the plugin. If it's not written correctly and vulnerable to XSS, pretty much the only way to fix it is to fix the plugin itself.
I would like to extract data from this web page using R:
I guess the web page is loaded via server using Ajax or something similar. Moreover I would like to save also the data available on the following pages, the ones that I can see when I press the NEXT button at the bottom of the data table.
Thanks a lot for any tips.
domenico
It appears to be generated using a JavaScript blob. You could look at rvest, which is an R library for just this problem (web scraping). If that doesn't work, RCUrl's GetURL function definitely grabs the script contents (although it's ugly as sin and you'll be wanting to grep it. Mind you, all automatically generated HTML is ugly).
Using a TWebBrowser in Delphi, I'm showing a website that uses JSON to update its content. To get the source code of the newly updated content and load it into a Memo, I believe I have to get the URLs of the GET requests. Unfortunately, these are always different and generated with an encrypted Javascript. Is there any way to list the URLs the GET requests go to in a similar way like FireBug does in its console view?
Thanks a bunch!!!
I want to save the html generated by a javascript on a website.
When I run the javascript, it returns me the html ready, with a button that link to the chrome printer, to save as pdf. I want to save this html genrated as a PDF, but I can't do it.
I've spent days triyng almos everything, PDFKit with Nokogiri Parsing, searched for a chrome printer API, etc, but nothing made it. Does anyone knows how can I do that?
Using phantomjs and rasterize.js can convert it.
Then just run the command
phantomjs rasterize.js $URL_OR_PATH $PDF_OUT_FILENAME Letter
Based on the JavaScript you're running, figure out the URL it calls, along with whatever variables it adds to the GET/POST request, then use OpenURI or an HTTP client of some sort to request that file. Pass that to Nokogiri, and parse out the URL for the file.
The alternate is to use one of the WATIR gems to drive a browser, and access the file that way. Then you can retrieve the HTML, or have the browser retrieve the file, and get it off the disk when its done.
I didn't understood the second solution you proposed, can you explain more?
Sometimes developers use Ajax to retrieve HTML and insert it into a page, or directly manipulate the page's HTML using JavaScript.
You can ask a Watir-driven browser to give you the current HTML and then parse it using Nokogiri or another XML parser, to retrieve things that are part of the HTML DOM at that moment. From there you can save that to disk and have the Watir-driven browser read it and render it. Then it's a matter of figuring out how to get the browser to print to PDF, or grab a snapshot of the screen to turn it into a PDF.
I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.