DocumentNode.SelectSingleNode return null - html-agility-pack

I want to get the time of this url "https://www.toutiao.com/a6619068128406028804/" with the HtmlAgilityPack, my code is as following:
string url = "https://www.toutiao.com/a6619068128406028804/"
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
HtmlNode node_time= doc.DocumentNode.SelectSingleNode("/html/body/div[1]/div[2]/div[2]/div[1]/div[1]/span[2]");
time = node_time.InnerText.Trim();
node_time is always being null, how can I get the content of the time tag?

The problem is not the xPath selector, it's the fact that those elements are rendered client side. If you look at the actual initial get request (can do this in chrome/fiddler/ext) you see that those elements are not there. However there is a "articleInfo" json object inside of the "BASE_DATA" json string that is sent back. Normally you want to parse out that string and then deserialize it, then you have a structured object to grab data from. I normally use visual studio paste as classes feature but this seems kind of complicated for that and is mostly outside of the scope of your issue with this.
Also to note the object does get loaded into javascript but you cannot access that with HAP, if you were using headless browsers you could access that object directly using the execute javascript features.
So basically you can either parse out the json string manually or switch to something like a headless browser where the javascript is actually executed.

Related

How to parse a string as a DOM element in Golang using Colly

I'm new with Go and I am using it with Colly to scrape a website but I am having some problems with the noscript tag because it is not getting parsed just returned as a string so I want to transform that string into a colly HtmlElement to be able to query it as a normal tag.
How can I do that?
The website I want to scrape is the Chrome Web Store
I have not really found a good way to create a HTMLElement, however it is possible that you can convert to a Document object, and do the same query execution (this has nothing to do with gocolly however)
var doc,_ = goquery.NewDocumentFromReader(strings.NewReader("<p><a>Your element</a></p>"))
doc.Filter("your selector here")

How to get a HTTPRequest JSON response without using any kind of template?

I am new to Django but i am advanced programmer in other frameworks.
What i intend to do:
Press a form button, triggering Javascript that fires a Ajax request which is processed by a Django View (creates a file) that return plain simple JSON data (the name of the file) - and that is appended as a link to a DOM-Element named 'downloads'.
What i achieved so far instead:
Press the button, triggering js that fires a ajax request which is process by a Django view (creates a file) that return the whole page appended as a duplicate to the DOM-Element named 'downloads' (instead of simple JSON data).
here is the extracted code from the corresponding Django view:
context = {
'filename': filename
}
data['filename'] = render_to_string(current_app+'/json_download_link.html', context)
return HttpResponse(json.dumps(data), content_type="application/json")
I tried several variants (like https://stackoverflow.com/a/2428119/850547), with and without RequestContext object; different rendering strats.. i am out of ideas now..
It seems to me that there is NO possibility to make ajax requests without using a template in the response.. :-/ (but i hope i am wrong)
But even then: Why is Django return the main template (with full DOM) that i have NOT passed to the context...
I just want JSON data - not more!
I hope my problem is understandable... if you need more informations let me know and i will add them.
EDIT:
for the upcoming questions - json_download_link.html looks like this:
Download
But i don't even want to use that!
corresponding jquery:
$.post(url, form_data)
.done(function(result){
$('#downloads').append(' Download CSV')
})
I don't understand your question. Of course you can make an Ajax request without using a template. If you don't want to use a template, don't use a template. If you just want to return JSON, then do that.
Without having any details of what's going wrong, I would imagine that your Ajax request is not hitting the view you think it is, but is going to the original full-page view. Try adding some logging in the view to see what's going on.
There is no need to return the full template. You can return parts of template and render/append them at the frontend.
A template can be as small as you want. For example this is a template:
name.html
<p>My name is {{name}}</p>
You can return only this template with json.dumps() and append it on the front end.
What is your json_download_link.html?
assuming example.csv is string
data ={}
data['filename'] = u'example.csv'
return HttpResponse(simplejson.dumps(data), content_type="application/json")
Is this what you are looking for?

Need to `location.href` value in JSTL(or JSP)

All I need is just <script>location.href</script> value in JSTL(or JSP).
Just same as the web browswers display. But it's not that easy.
I'm using Tiles2 and request.getRequestURL() shows tiles base jsp location like /WEB-INF/tiles/base.jsp...
And I found <%=request.getAttribute("javax.servlet.forward.request_uri")%> shows what I want.
https://stackoverflow.com/a/11387378/411615
But still it's not enough. it's not showing protocol and domain.
And I searched again.
http://${pageContext.request.localName}<%=request.getAttribute("javax.servlet.forward.request_uri")%>
It looks dirty and does not contain parameters. I would make it like this...
Map<String, String[]> parameters = request.getParameterMap();
for(String parameter : parameters.keySet()) {
...
}
But it's weird. In javascript it's so simple, but in JSP it's very tough.
Is there any neat way?
My environment isSpring 3.1, Tiles2.
The Request object is all you get on the server side. There's no way to know what was in the original href link because the browser sends the fully assembled URL to the server. Your options are 1) use some javascript trickery on the client side to send the original href as a parameter or header value, or 2) poke around in the Request object to get what you need. 2) is better.

Web Programming with AJAX, Problem with caching (I think)

Web programmer here - using AJAX (HTML, CSS, JavaScript, AJAX, PHP, MySQL), but for some reason Internet Explorer is acting up (surprise surprise).
AJAX is updating query results on the HTML page, via a PHP script that queries a MySQL Database.
Everything is working fine, except when I use Internet Explorer 8.0 .
There are several php scripts, which allow for the data to be ordered according to certain criteria, and for testing purposes I have attached the mktime field (current time, in the format HH:MM:SS) to the beginning of the results for each query.
When I use IE, these times appear to remain constant, whereas with ALL other browsers these times are correct and display the current time.
I think the issue has something to do with caching or something along those lines anyway.
Any thoughts or suggestions welcome...
Here is an article on the caching issue.
If your request is a GET change it to a POST, this will prevent the results being cached.
GET requests are cached in IE; switch it to a POST request and it won't be cached anymore.
Instead of switching to POST, which can be ugly if you're not really using it to update or create content, you should append a random number to the query string, as in http://domain.com/ajax/some-request?r=123456. If this number is unique for every request you won't have caching problems.
What I have done is, I have kept the "GET" and added new dummy query parameter to the querystring as follows,
./BaseServlet?sname=3d_motor&calcdir=20110514&dummyParam=datetime
I set dummyParam a value of date object in the javascript so that every time the url is generated browser will treat it as a new url and fetch new (fresh) results.
var d = new Date();
url = url + '&dummyParam='+d.valueOf();
So instead of generating some random numbers this is easy way!

How can I prevent IE Caching from causing duplicate Ajax requests?

We are using the Dynamic Script Tag with JsonP mechanism to achieve cross-domain Ajax calls. The front end widget is very simple. It just calls a search web service, passing search criteria supplied by the user and receiving and dynamically rendering the results.
Note - For those that aren’t familiar with the Dynamic Script Tag with JsonP method of performing Ajax-like requests to a service that return Json formatted data, I can explain how to utilise it if you think it could be relevant to the problem.
The service is WCF hosted on IIS. It is Restful so the first thing we do when the user clicks search is to generate a Url containing the criteria. It looks like this...
https://.../service.svc?criteria=john+smith
We then use a dynamically created Html Script Tag with the source attribute set to the above Url to make the request to our service. The result is returned and we process it to show the results.
This all works fine, but we noticed that when using IE the service receives the request from the client Twice. I used Fiddler to monitor the traffic leaving the browser and sure enough I see two requests with the following urls...
Request 1: https://.../service.svc?criteria=john+smith
Request 2: https://.../service.svc?criteria=john+smith&_=123456789
The second request has been appended with some kind of Id. This Id is different for every request.
My immediate thought is it was something to do with caching. Adding a random number to the end of the url is one of the classic approaches to disabling browser caching. To prove this I adjusted the cache settings in IE.
I set “Check for newer versions of stored pages” to “Never” – This resulted in only one request being made every time. The one with the random number on the end.
I set this setting value back to the default of “Automatic” and the requests immediately began to be sent twice again.
Interestingly I don’t receive both requests on the client. I found this reference where someone is suggesting this could be a bug with IE. The fact that this doesn’t happen for me on Firefox supports this theory.
Can anyone confirm if this is a bug with IE? It could be by design.
Does anyone know of a way I can stop it happening?
Some of the more vague searches that my users will run take up enough processing resource to make doubling up anything a very bad idea. I really want to avoid this if at all possible :-)
I just wrote an article on how to avoid caching of ajax requests :-)
It basically involves adding the no cache headers to any ajax request that comes in
public abstract class MyWebApplication : HttpApplication
{
protected MyWebApplication()
{
this.BeginRequest += new EventHandler(MyWebApplication_BeginRequest);
}
void MyWebApplication_BeginRequest(object sender, EventArgs e)
{
string requestedWith = this.Request.Headers["x-requested-with"];
if (!string.IsNullOrEmpty(requestedWith) && requestedWith.Equals(”XMLHttpRequest”, StringComparison.InvariantCultureIgnoreCase))
{
this.Response.Expires = 0;
this.Response.ExpiresAbsolute = DateTime.Now.AddDays(-1);
this.Response.AddHeader(”pragma”, “no-cache”);
this.Response.AddHeader(”cache-control”, “private”);
this.Response.CacheControl = “no-cache”;
}
}
}
I eventually established the reason for the duplicate requests. As I said, the mechanism I chose to use for making Ajax calls was with Dynamic Script Tags. I build the request Url, created a new Script element and assigned the Url to the src property...
var script = document.createElement(“script”);
script.src = https://....”;
Then to execute the script by appending it to the Document Head. Crucially, I was using the JQuery append function...
$(“head”).append(script);
Inside the append function JQuery was anticipating that I was trying to make an Ajax call. If the type of element being appended is a Script, then it executes a special routine that makes an Ajax request using the XmlHttpRequest object. But the script was still being appended to the document head, and being executed there by the browser too. Hence the double request.
The first came direct from the script – the one I intended to happen.
The second came from inside the JQuery append function. This was the request suffixed with the randomly generated query string argument in the form “&_=123456789”.
I simplified things by preventing the JQuery library side effect. I used the native append function...
document.getElementByTagName(“head”).appendChild(script);
One request now happens in the way I intended. I had no idea that the JQuery append function could have such a significant side effect built in.
See www.enhanceie.com/redir/?id=httpperf for further discussion.

Resources