I'm trying to download some data from a webpage that is dynamically generated, so using wget doesn't work. The page is http://gaceta.diputados.gob.mx/SIL/Legislaturas/Listados.html I want to download the list shown for each of the options that can be selected in the field "Legislatura" once downloaded I can process the data in ruby.
Just wanted to know what is the best way to download this, and if posible to select each of the options and download.
You can use the Web Inspector in Safari or Chrome or the Firebug extension in Firefox to look at how the data is loaded. The page is doing an AJAX POST request to a Perl script for this website, and the data is return as XML.
I would use cURL to grab the data.
You could use http://watir.com/ or webrat to simulate what you would do to view the data then use Nokogiri to parse the HTML.
Related
I'm updating some old CasperJS code that downloads a CSV report. The web interface recently changed. The old version had a link tag I could grab and then use casper.download() to retrieve the file.
However, the new version appears to be an Angular app and the download button triggers a handleDownload() function that does something under the hood, which results in a popup dialog in my browser.
Is there some way to intercept this dialog or otherwise extract the URL from the actual file?
A few options:
You can see what URL is requested (F12 > Network in Chrome). You could then try to deduce the URL.
Look at what handleDownload does - the logic should be available to
you. You may be able to pull data there.
Hard to help without seeing the code.
I am trying to scrape the video links on the web page, https://www.tokopedia.com/chocoapple/ready-stock-bnib-iphone-128gb-7-plus-jet-black-garansi-apple-1-tahun-10?src=topads
There are links, which are getting generated through "webyclip" service which loads the data after the page is loaded. I want the updated HTML source of the page after all the JavaScripts and AJAX are loaded (Similar when we do "Inspect element" on a browser). How to get it done through the chromedp package (https://github.com/knq/chromedp). It is a headless browser for GoLang. Please help. I am a newbie in web scraping.
EDIT: It is not similar to the another question mentioned in the link. As this is specific to chromedp package. The one in the comments ask for how to/ what to use to scrape dynamic contents.
After many attempts, Finally, I found the way and solved my query.
You can check my GitHub repository for this query.
Thank you.
Using a TWebBrowser in Delphi, I'm showing a website that uses JSON to update its content. To get the source code of the newly updated content and load it into a Memo, I believe I have to get the URLs of the GET requests. Unfortunately, these are always different and generated with an encrypted Javascript. Is there any way to list the URLs the GET requests go to in a similar way like FireBug does in its console view?
Thanks a bunch!!!
I want to save the html generated by a javascript on a website.
When I run the javascript, it returns me the html ready, with a button that link to the chrome printer, to save as pdf. I want to save this html genrated as a PDF, but I can't do it.
I've spent days triyng almos everything, PDFKit with Nokogiri Parsing, searched for a chrome printer API, etc, but nothing made it. Does anyone knows how can I do that?
Using phantomjs and rasterize.js can convert it.
Then just run the command
phantomjs rasterize.js $URL_OR_PATH $PDF_OUT_FILENAME Letter
Based on the JavaScript you're running, figure out the URL it calls, along with whatever variables it adds to the GET/POST request, then use OpenURI or an HTTP client of some sort to request that file. Pass that to Nokogiri, and parse out the URL for the file.
The alternate is to use one of the WATIR gems to drive a browser, and access the file that way. Then you can retrieve the HTML, or have the browser retrieve the file, and get it off the disk when its done.
I didn't understood the second solution you proposed, can you explain more?
Sometimes developers use Ajax to retrieve HTML and insert it into a page, or directly manipulate the page's HTML using JavaScript.
You can ask a Watir-driven browser to give you the current HTML and then parse it using Nokogiri or another XML parser, to retrieve things that are part of the HTML DOM at that moment. From there you can save that to disk and have the Watir-driven browser read it and render it. Then it's a matter of figuring out how to get the browser to print to PDF, or grab a snapshot of the screen to turn it into a PDF.
I was trying to parse some HTML content from a site. Nokogiri works perfectly for content loaded the first time.
Now the issue is how to fetch that content which is loaded using AJAX. For example, there is a "see more" link and more items are fetched using AJAX, or consider a case for AJAX-based tabs.
How can I fetch that content?
You won't be able to parse anything that requires a JavaScript runtime to produce that content using Nokogiri. Nokogiri is a HTML/XML parser, not a web browser.
PhantomJS on the other hand is a web browser, albeit a special kind of browser ;) Take a look at that and have a play.
It isn't completely clear what you want to do, but if you are trying to get access to additional HTML that is loaded by AJAX, then you will need to study the code, figure out what URL is being used for the AJAX request, whether any session IDs or cookies have been set, then create a new URL that reproduces what AJAX is using. Request that, and you should get the new content back.
That can be difficult to do though. As #Nuby said, Mechanize could be good help, as it is designed to manage cookies and sessions for you in the background. Mechanize uses Nokogiri internally so if you request a page from Mechanize, you can use Nokogiri searches against it to drill down and extract any particular JavaScript strings. They'll be present as text, so then you can use regex or substring matches to get at the particular parameters you need, then construct the new URL and ask Mechanize to get it.