I'm trying to extract a data from this website, somehow I get nothing out of any text I'm trying to get.
I'm using Xidel to scrape the data.
xidel -e '//span[#class="main-price"]/text()' 'https://www.tokopedia.com/emas/harga-hari-ini'
**** Retrieving (GET): https://www.tokopedia.com/emas/harga-hari-ini ****
**** Processing: https://www.tokopedia.com/emas/harga-hari-ini/ ****
It should at least returning Rp or some numbers. But i'm not sure why it returning null. The other website i'm trying was just fine.
The target website is one of those sites in which the content is dynamically loaded using javascript. A simple way to confirm it is to go to the page, view it, then disable javascript in your browser and reload the page. In the case of this particular page, you'll see it's entirely blank.
There are a couple of ways to handle it, but unless I'm sorely mistaken, xidel isn't one of them. Start by taking a look at this.
Related
I'm writing a crawler to get the content from a website which uses AJAX.
There is a "show more" button at the bottom of the page, and my origin approach is to use Selenium.PhantomJS to pretend a web browser but it works in some website and some don't.
I'm wondering if there is some way i can directly get the underly JSON file of the AJAX action. Please give me some details, thanks.
By the way, I'm using Python.
I understand this is less of a python than a scraping problem in general (and I understand you meant "scraping" instead of "crawling" as a scraper reads/parses/processes one page whereas a crawler processes multiple pages and they're relation to each other).
You can get the JSON file immediately given you know it's URL. If you don't (for example because the URL changes from time to time), you might need to search through javascript files on the page manually to find out how the URL is generated.
Once you know the JSON file's URL, it's quite simple. As you already seem to know how to get the HTML of the "main" page, you can use your existing code to get the JSON file.
I'm not familiar with PhantomJS, but I reckon it's easier to get the JSON file immediately instead of simulating an AJAX request (if that's even possible with Phantom).
I need to do some screen scraping on a web page where the content I need is generated by AJAX. On the initial page there is a table with 4 tabs. When you click on any of the tabs the content of the table changes. I need the content from the 3rd tab only.
I have used the google chrome 'Inspect Element' tool to see what the requests and post data was and I can get the information I need when I put the information (session id and a lot of other cookie data as well as post data) from the inspect element result into a PHP curl request. But this only works for the 30 minutes that the session lasts. Does anyone know of a way I can get to this information?
I wont reproduce the code here but I will point you to the answer.
Its within this book:
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=dp_ob_image_bk
A must buy for someone doing what your doing.
In the end I used htmlunit to get the content I needed. I also found the HTMLUnit Scripter very useful to help generate the Java code required.
Read Chain
$(".ReadChainDL").colorbox();
When I click the Read Chain it runs the ajax and loads the colorbox, but the colorbox is blank! I can confirm that ajax is running and pulling the correct content via firebug. It's just not populating the generated ajax content into the colorbox. I've confirmed via firebug that colorbox is blank.
I've tried it in Firefox and Chrome
It's got to be the URL; your code works fine. Proof: http://jsfiddle.net/HP8tN/
I think there are two main possibilities:
There's no filename, so maybe Colorbox doesn't know what content type to use. Or maybe the URL is wrong, or the target has the wrong content type. I think this is the most likely option. Try $(".ReadChainDL").colorbox({photo: true}); if it's a photo. Otherwise, check out the Content Type section in the documentation.
Colorbox is supposed to figure out whether you've passed it a URL or a jQuery-style XPath selector. The 10/12/2012 could be messing up whatever logic it uses to recognize a URL. This seems unlikely, since you've confirmed that something is coming back, but it's worth a try. Try 10%2F12%2F2012 instead.
Can you post the content that is being returned by the ajax call? jQuery may not be able to append it to your document if there are issues with it being invalid or malformed.
Try validating your content.
I couldn't really word the title very well, but here's my problem: I've got a webpage that reads from a database each time the user clicks a button, the content is then replaced for part of the page.
Because it is an ajax load, everything is done in the background, and so the URL stays the same. This wasn't be a problem at all until I realised that I will want to have a different Facebook comments box for each set of content that is loaded - so if someone comments, it is posted to their facebook profile, people click on the link and are then taken to different content.
So... what I need is some way of referencing each set of content, and I've found a site that does exactly that (I'm sure there are a lot of them).
Here's the link.
Each set of content has a different 'hash code' (because I don't know the actual name for it) which is appended to the URL - in this case the code is "#1922934", this allows people to post links to it that specific set of content on Facebook etc. - and also allows a different Facebook comment box for each set of content.
Does anyone know how such a set-up can be achieved or how these 'hash codes' work?
Here's a document from wikipedia on it.
[http://en.wikipedia.org/wiki/Fragment_identifier][1]
The main idea is that URI fragments are used because they don't cause a page reload. They also can be used to refer to anchors on a web page.
What I would do is on page load use JavaScript to read the URI fragment (location.hash) then make a request to your server to load the comments etc. The URI fragment cannot be read by a server and is only found through a client (browser)
Sounds like you want something like SammyJS.
I would like to set Firebug to automatically display results of a POST or GET ajax request.
Is there such an option to automatically open the response tab? When I am debugging multiple requests it is a bit tedious to click on each one. I've looked through the general options and also done some Google searching.
http://img109.imageshack.us/img109/3279/firebug.jpg
The simple solution for me was to just start using FireLogger instead. Simpler and cleaner.
Cya later FirePHP!!
image http://img809.imageshack.us/img809/7227/firelogger4phpmainshot.png