In this answer to another question, a user was able to locate the URL for a part of this page which is loaded via javascript. So far, I have been unable to duplicate this simple feat: In particular, I have looked at the page source, but have been unable to locate the code to which the responder refers:
<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals e,type:"once"});
</script>
I have searched the page source for JS which is written locally on the page's HTML file and also within JS files referenced in the source file. What have I overlooked?
TL;DR the Q&A referred to is over 10 years old, and the website has changed since then.
There are a couple of things you may be missing. Firstly, the Stack Overflow question you are referring to is over 10 years old. The page it refers to strictly doesn't exist anymore. If you use Chrome or Firefox developer tools, you will notice that requesting the page link gives you an http 301, indicating the page has moved permanently, and directs you to a different page which presumably looks similar to the page from 10 years ago. The current version of the page is built using angular js, which was not in widespread use in 2009 (in fact, I think it was only created in 2009). The reason that you can't find the ajax request anywhere is that this is no longer how the page is constructed.
You could probably still replicate the feat of scraping the page, but now it would be much harder. You would have to do it by requesting and parsing the JSON that is now used to populate the page's fields, e.g. from https://api.vanguard.com/rs/ire/01/ind/fund/VWIUX/expense.jsonp?callback=angular.callbacks._w&planId=null . However, this will only work if you have the correct cookies in the header, etc.
So the good news is, you aren't failing to follow the method given in the answer. The bad news is that the answer is obsolete.
Related
I'm building a news scraper for a project, and I found my way through most of the sites, but one is giving me the headache, because whenever I try to bulk-extract the articles contents, most of html of the links won't load.
I even tried in python, same obsolete results.
My question is:
how can I set a "wait until content is loaded"? I am reading that some Ajax thing may be needed to load first.
I think what you are looking for are the Selenium Nodes. They are particularly targeted for extracting data from Ajax-based websites, where content is loaded via JavaScript code.
You can find some example workflows e.g. here:
https://nodepit.com/server/seleniumnodes.com
Where I'm at: I've read Google's documentation regarding it's AJAX crawling, and I've searched around a bit in this website and others, but I'm quite confused, as it seems that all problems address the same issue: AJAX crawing with hashbangs?
I've developed an app which, among other purposes, let's the user search for locations worldwide, using an AJAX searcher quite similar to Google's, but my app uses exclusively the question mark in AJAX, instead of hashbang. Due to compatibility issues, changing it to the hashbang is not an option.
Not only am I largely confused by the fact that I could not find anyone else using the question mark instead of the hashbang, I'm also wondering if there is any documentation regarding my issue: how to let google bot crawl all my AJAX content when I'm using the question mark instead of a hashbang in my AJAX app.
The AJAX crawling schema was created explicitly for applications and websites using hashbang (#!) in the URL structure, because the fragment part of the URLs only exist on the client side; the URL rewriting in the specs, i.e. from #! to ?_escaped_fragment_= is meant to solve that.
Since most of the web is already making use of Javascript in a way or other, we (Google) needed a better solution, so we started executing Javascript in the pages we crawled and effectively render every page, just like a normal browser would. To quote our blogpost, Understanding web pages better:
In order to solve this problem, we decided to try to understand pages by executing JavaScript. It’s hard to do that at the scale of the current web, but we decided that it’s worth it. We have been gradually improving how we do this for some time. In the past few months, our indexing system has been rendering a substantial number of web pages more like an average user’s browser with JavaScript turned on.
You can also see what we "see" using Fetch as Google in Search Console (former Webmaster Tools); read more about the feature in our post titled Rendering pages with Fetch as Google
Before you do anything else, please try to fetch a few pages from your site with Fetch as Google. You might not have to do anything at all, it might actually work out of the box. And the good news is that it's not only Google that's rendering pages!
I have developed a website using angularjs and web api.
The problem is that the ajax rendered content is not crawable by google. And no one can find the website using google search.
After reading many articles regarding this issue, including:
This one with all links of explanation going out,
Google ajax crawling protocol, and also stack over flow question, I couldn't find the proper solution. Those that mention asp.net solutions, are talking about mvc, and I need only the simple REST by web api, other articles are not talking about asp.net.
Is there any simple explanation?
I'm the one who asked this same question long ago, so I will answer from my experience:
Firstly, if all your content are accessible via unique URIs (including the hashbang if you use it), modern search engines should index it just fine. In fact Google can index javascript generated content now. You can try that via the Google Webmaster tool and see how your site is indexed.
Secondly, there are libraries that help you to serve parsed content to search engines if you need to, but in my case I didn't bother much with it since Google is indexing js nicely.
I've seen others ask this question, and maybe I'm missing something or this is outdated, but I don't see why AngularJS needs to be an issue with SEO.
Say you have a landing page and it has a bunch of links. Assuming you're using html5 mode in AngularJS (and I'm not sure that's 100% necessary) and something like ng-route then the links on the landing page can work both as "angular" (JavaScript) links and "old school" (full page load) links.
If you're a human user you can click a link and it will do angular magic and adjust the content without loading the full page. Ok, all fine.
But if you instead copy the link and paste it in a new tab or new browser, it will still work - assuming you've set up routes correctly.
I'm not an SEO expert by any stretch of the imagination, but as I understand it, having links that load pages and having those pages have real and useful content is the core of SEO, and done this way, AngularJS should work fine. The key thing to check is if you copy and paste the link (not just click it) that it works.
How does Google Suggest work? How does it manage to update the web page on the client so quickly, based on information in a distant Google database? Why does the web page not look ‘jumpy’ if it is being frequently updated?
It uses AJAX.
When you are writing your query, it searches for the 10 most requested words matching yours. Then it writes minified JSON on an invisible DIV element. Fast, but still resource intensive.
Try to install Firebug on Firefox or use the Developer Console on Chrome, open the console and start writing "Youtube" or whatever you want. You will see the minified JSON responses.
Good luck :D
In addition to the front-end handling others have talked about, which jQuery is a great example of, you might also be interested in how they approach the idea on the backend. Dr. Peter Norvig has written about how to create a spelling corrector, where similar approaches could be used to find close matches.
The whole page is not being updated. Only parts of it are using AJAX - Asynchronous Javascript and XML. Ajax requests can be made in Javascript, and the page updated when the response comes back.
A far more interesting question is how does Google actually search 10bn+ documents in a teeny tiny fraction of a second :)
We have web applications elgifto.com, roadbrake.com in which we used AJAX at many places, especially to update major portions of a page. All the important functionality of elgifto.com was implemented using AJAX. Now we realize a few issues due to AJAX implementation.
All the content implemented using
AJAX is not available to the SEO
bots and it is hurting the page rank
of our site.
Users will not be able to bookmark
some of the pages as they are always
available through AJAX.
When we want to direct the user from
one page through an anchor link to
another page having AJAX, we find it
difficult.
So now we are thinking of removing AJAX for these pages and use it only for small functionality such as something similar to marking a question as favorite in SO. So before going ahead and removing, we want to know expert's opinion on this. Thanks.
The problem is not "AJAX" per se, but your implementation of it. Just as a for instance, you can fix the 'bookmark' problem like google maps does it: provide a generated link for each state of your webapp.
SEO can befixed by supplying various of these state-links to the crawlers, either organically trough links in your site, or by supplying a list (sitemap).
If you implement 2, you can fix 1 and 3 with those links.
In the end you must figure out if the effort is worth it, and if you are not overusing AJAX ofcourse, but the statements you've made are not set in stone at all.
I'm costantly developing ajax based websites, with no problems for SEO at all. You just have to use it in the best possible way.
For example, I have a website with normal links pointing to normal webpages (PHP pages), this for normal navigation if a user doesn't have JS enabled. But if a user has JS enabled, a script will change the links behavior, only fetching the content of the page needed.
This way you still have phisycal separated webpages with all their content, which will be indexed as normal.