I am currently playing around with different scraping techniques and found out, that it can get pretty complicated quickly when a lot of javascript is involved.
I had some success with HTMLUnit which seems to interpret javascript rather well, but I am looking for a more lightweight solution.
So the problem I am facing now is: I want to retrieve the results of a specific page, which is generated by an ajax call by a click on a certain button.
The call itself is rather simple, just a HTTP Post to a certain URL with a few parameters submitted in the post body. The problem I have now is that the server complains when I submit the HTTP Post to the ajax function without really opening the containing site.
What I basically do for testing is:
curl -v -d "AJAXREQUEST=..." https://myhost/ajaxurl
An what I get is:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Ajax-Response" content="true" />
<meta name="Ajax-Expired" content="View state could't be restored - reload page ?" />
</head>
</html>
The server is running JSF 1.2. What do I have to do, to get the results from the AJAX call? I am not really a JSF expert...
If I had to guess, JSF doesn't have a session associated with the request being sent with curl and therefore the objects associated with the page don't exist. For curl look at http://curl.haxx.se/docs/httpscripting.html section 10, cookies. You would have to pull the page, get the cookies then do the http post with the cookies (starts being a lot of work with curl).
However I would instead suggest looking at Selenium, which has a IDE that generates Java to interact with JavaScript.
Related
I am working with a legacy site that grabs some XML content via AJAX, constructs a block of HTML code with it, and then appends it to a blank div. The XML makes heavy use of Arabic text.
It seems to work fine in all browsers except Chrome. In Chrome, page loading will die at the point of appending the string to the div. When I remove the Arabic text from the XML, the page loads just fine.
The HTML being generated has the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
and the XML has this encoding tag:
<?xml version="1.0" encoding="UTF-8"?>
Here is a sample of the XML that is being passed:
<segment>
<content>السَّلامُ عَلَيْكُم.</content>
<linked>true</linked>
<glossWord>السَّلامُ عَلَيْكُم</glossWord>
<glossTrans>Hello. (Literally "Peace be upon you").</glossTrans>
<glossExpl>This is a very commonly used greeting. It works for any time of the the day. It can also be used to mean 'goodbye'.</glossExpl>
</segment>
Interesting tidbit, when I went to create this question in Chrome, pasting the above into the form ALSO broke Chrome, and the browser froze solid. I had to reopen and submit it in Firefox. If this is a bug in Chrome it would be nice to be able to find a way to work around it, as I don't really like the idea of telling people, "Don't use X browser" to access a site.
Had a similar issue, and it turned out to be Google Translate in Chrome on 10.6.8 having issues when I used multiple languages/characters. I got around this by adding the class "notranslate" onto html elements that I didn't want Google Translate to bomb out on.
To quickly see if this works for you, add the class "notranslate" to your body and see if the page stops hanging. Hope this works for you!
I find this kind of odd I haven't been able to find any information on someone with a similar issue. Anyway, I've integrated Spring Security with GWT, and it appears to work correctly...for the most part. I'm having a caching issue with the main html page in IE and Chrome.
I've separated out Spring Security login to a login.jsp that redirects to my Application.html page (the GWT page), and when I first start the app and access the page, it appears to be working fine in all browsers. I get directed to the login page, because I'm not authenticated.
The issue is that in Chrome or IE, if i close the browser after a successful login, and directly browse back to that Application.html URL, it still renders as if I'm authenticated. I look in my console, and the log statements for spring security verify I am not authenticated. The moment i hit f5 to refresh the page, I get directed back to the login.jsp url.
I'm lead to believe this is some caching issue because when I close the browser and reopen to the html page, even though it renders like I'm logged in, the console log statements say I'm not, and if I run in debug mode, the OnModuleLoad() in Application.java never gets hit.
Finally, this appears to work properly in firefox...If anyone has seen this issue or has any advice of where I need to look to fix, I would greatly appreciate the assistance.
I've encountered a similar problems with a web app that I've been working on. I attempted to prevent the browser from caching the page by adding these tags to the page:
<meta http-equiv="Cache-Control" content="no-cache">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="0">
Unfortunately, this wasn't enough to prevent caching for all browsers. I finally ended up converting the page to a JSP page and adding these statements to the top:
<%
response.setHeader("Cache-Control", "no-cache");
response.setHeader("Pragma", "no-cache");
response.setDateHeader("Expires", 0);
%>
I haven't been able to reproduce the problem in Firefox, Chrome or Safari since I made the change. I haven't tested the page with Internet Explorer yet.
I wanted to know how can I use Facebook Like button on my Ajax web application, that will capture changes in the Open Graph tags for both the og:title and the og:url. I already created a Facebook app and got an API ID.
What I want to know is the code that I need to put on my website in order for Facebook to capture the changes that I've made to the meta tags which contains that title and url information (ie. og:title, og:url).
I followed the instructions on Facebook without success. Furthermore, I want to know how can I locally test the Like button to see that it grabs the data from the Open Graph tags properly.
Also worth mentioning that I've a JQuery code that automatically alters the Open Graph meta tags to include the relevant information for the current Ajax changed page.
Thanks.
You will need to have a separate url for each different page that you want to allow people to like. I would recommend actually pointing the like button to the physical pages you're trying to return via the og:url tag. To refresh the data that Facebook stores about a given url, pass that url into the linter at http://developers.facebook.com/tools/lint.
i created a rotator file for facebook share on my dynamic ajax website.
rotator.asp code sample:
<html>
<% lang=request("lang")
id=request("id")
..some sql to get data...
ogTitle=....
ogImage=....
originalUrl=....
%>
<head>
<meta property="og:title" content="<%=ogTitle%>" />
<meta property="og:image" content="<%=ogImage%>" />
.....
......
<meta http-equiv="refresh" content="0; url=<%=origialUrl%>" />
//dont use redirect.. facebook dont allow 302...
</head>
<body></body>
</html>
for example xxx.com/#!/en/153 page will share xxx.com/rotator.asp?lang=en&id=153
Kotaku has launched a new design without hashbangs. Their site still clearly uses ajax requests, but somehow it is still found through Google and the content shows up in the pagesource. How do they do it? Their text seems to be contained inside a script type=text/javascript, but I don't understand what effect that has, or why they would do that.
(of course, the first page request may just trigger a static, serverside constructed response. But check other articles, it does load json through an ajax request. No page refresh)
Have a look at this site for example:
http://kotaku.com/5800326/read-some-of-new-tomb-raider-game-right-now
No hashes, a very well formed URL and it appears in Google. I have read the Google Ajax guide, and as far as I understand it, Google only requests an html snapshot iff you use #! inside your url.
For your convenience, I have made a screenshot that shows how the text looks inside the Chrome debugger: (what does "ganjaAjaxContent" mean?)
If you search for this article, it is the first match in Google:
Google search for Kotaku article
Being able to do ajax without having to worry about Google search would be excellent.
Kotaku and the other Gawker sites are doing a number of things for SEO:
Submitting XML sitemaps for all of their content
http://kotaku.com/sitemap_today.xml
http://kotaku.com/sitemap.xml
Correct use of title and description tags for Google and Facebook
<title>Read Some of New Tomb Raider Game Right Now</title>
<meta name="fragment" content="!">
<meta name="title" content="Read Some of New Tomb Raider Game Right Now" />
<meta name="description" content="Upcoming Tomb Raider reboot doesn't have a release date yet, but website Siliconera apparently has the game's script and published what's reportedly an excerpt from it. Check it out. [Siliconera]" />
<meta property="og:title" content="Read Some of New Tomb Raider Game Right Now" />
<meta property="og:description" content="Upcoming Tomb Raider reboot doesn't have a release date yet, but website Siliconera apparently has the game's script and published what's reportedly an excerpt from it." />
Displaying HTML post content when Javascript is turned off (inspect the <div class="post-body quick-post"></div> element)
So you're right, Google's first visit loads the semantic, accessible serverside-constructed page. WHile Google can crawl hashbang pages, it doesn't need to, because all of the pages are indexed via the sitemap.xml
Hope this answers all of your questions.
p.s. having said all this, hashbangs are still bad for the web
http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch
http://isolani.co.uk/blog/javascript/BreakingTheWebWithHashBangs
http://blog.benward.me/post/3231388630
Maybe it has been asked somewhere, but I am trying to find my question and I am not able to find any answer.
Here's my question:
I am developing a web application and because of some major JavaScript issue in IE8, I need the user to run "Google Chrome Frame" (To enhance the speed of the web page). I was impressed that my page was working 100% fine until the time it was supposed to be refreshing and it wasn't refreshing (Ajax getJSON request using jQuery).
The problem is that it does not request the new data on the server, but it looks like it goes in the cache for the answer of that request and then return the same thing every time instead of new data.
I don't really know how to explain it, but it just does not update. Also, when I hit F5 on the page, it does not update the page, it keeps the old page (even if I hit CTRL-F5 or any other normal force-refresh button). To have the changes, I actually need to close the browser (IE8) and re-open it so it can take the new changes.
Is there anyone who know how I could disable the cache when Google Chrome Frame is active?
The meta tag I use is :
<meta http-equiv="expires" content="0">
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv="cache-control" content="no-cache, must-revalidate">
<META HTTP-EQUIV="X-UA-COMPATIBLE" CONTENT="CHROME=1">
If you need any more details, don't hesitate to ask.
An old CGI trick would have been to encode the date as a parameter onto the request so the URL changes with each request. That generally stops any caching on the URL.
So you'd have url?01102010134532 if you encoded date and time down to miliseconds.
If I understand your requirement properly, you'd have to do this in JQuery / JS and would need to modify the parameter on the URL after each AJAX request was made, so the next one would be different to the previous