How get data from javascript using HtmlUnit? - htmlunit

How get data from javascript using HtmlUnit ?
Title: total shoots
screen html code
public static void getElements() {
try (
final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("some URL");
final HtmlDivision div = page.getHtmlElementById("in-game-stats");
System.out.println(div.getTextContent());
} catch (IOException e) {
e.printStackTrace();
}
}
what else ?

First of all you have to find the script element. Because you script tag has no id attribute doing something like 'page.getHtmlElementById' is not the right way. HtmlUnit offers many different ways for find elements. As starting point have a look at the documentation (http://htmlunit.sourceforge.net/gettingStarted.html).
Next step will be to get the javascript from the HtmlStript. If the script code is embedded inside the script tag you can simply use asXml().

There is no direct method to process javascript objects , but you can parse tag and have it parsed by self.
Use (HtmlElement).asText();
to extract data from div tag.

Related

PDF returned as byte array through ajax. Either getting ERR_INVALID_URL or blank pdf appears depending on parameters

Code from spring controller that is sending the byte array (pdf) back to ajax call
response.setHeader("Pragma", "no-cache");
response.setHeader("Cache-control", "private");
response.setDateHeader("Expires", 0);
response.setContentType("application/pdf");;
if (pdf != null) {
response.setContentLength(pdf.length);
ServletOutputStream out;
try {
out = response.getOutputStream();
out.write(pdf);
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This pdf is added dynamically to a dialog that can be opened later. "data" contains the byte array
var obj = $('<object width=100% height=100% style="float:left;"'
+ ' type="application/pdf"'
+ ' data="data:application/pdf,'
+ escape(data)
+ '"></object>');
$('#adjDialog'+rapRow).append(obj);
The result of this is it appears the pdf is converted using a chrome extension but the data attribute is not properly set. I see an empty PDF in a PDF viewer and when I inspect it with developer tools I can find this markup (didn't write this, generated by chrome)
<embed id="plugin" type="application/x-google-chrome-pdf" src="data:"
stream-url="blob:chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/E4E5DEA1-
DDF5-4300-9C43-8634ED843E50" headers="" background-color="0xFF525659" top-
toolbar-height="56">
^data is empty but I would expect it should be the url I provided.
If I add ";base64" to data attribute nothing loads and I get
GET data:application/pdf;base64,%25PDF-1.4%0A%25%uFFFD%uFFFD%uFFFD%uFFFD%0A1%20…431343532353232343430323633%3E%5D%3E%3E%0Astartxref%0A84512%0A%25%25EOF%0A net::ERR_INVALID_URL
I've been taking hints from several different code examples on stack overflow but nothing seems quite like mine. I'm not using any PHP, no database, and I'm not saving the file on the server
I fixed this by just having the html attribute "data" point to the http request that built the pdf response instead of doing an ajax call separately and trying to patch in the result.

How to successfully parse images from within content:encoded tags using SAX?

I am a trying to parse and display images from a feed that has the imgage URL inside tags. An example is this:
*Note>> http://someImage.jpg is not a real image link, this is just an example. This is what I have done so far.
public void startElement(String uri, String localName, String qName, Attributes atts) {
chars = new StringBuilder();
if (qName.equalsIgnoreCase("content:encoded")) {
if (!atts.getValue("src").toString().equalsIgnoreCase("null")) {
feedStr.setImgLink(atts.getValue("src").toString());
Log.d(TAG, "inside if " + feedStr.getImgLink());
} else {
feedStr.setImgLink("");
Log.d(TAG, feedStr.getImgLink());
}
}
}
I believe this part of my programming needs to be tweaked. First, when qName is equal to "content:encoded" the parsing stops. The application just runs endlessly and displays nothing. Second, if I change that initial if to anything that qName cannot equal like "purplebunny" everything works perfect, except there will be no images. What am I missing? Am I using atts.getValue properly? I have used log to see what comes up in ImgLink and it is null always.
You can store the content:encoded data in a String. Then you can extract image by this library Jsoup
Example:
Suppose content:encoded raw data stored in Description variable.
Document doc = Jsoup.parse(Description);
Element image =doc.select("img").first();
String url = image.absUrl("src");

windows phone c# check for valid url and replace foreach item in list

I am getting a list of objects in Windows Phone, and show them in a listbox with databinding.
some image urls are not valid, so after every object is added in the list, i run the following code to check and replace, if not valid
private void CheckLinkUrl(Person p)
{
Uri filePath = new Uri(p.img_url);
string correct = p.img_url;
HttpWebRequest fileRequest = HttpWebRequest.CreateHttp(filePath);
fileRequest.Method = "HEAD";
fileRequest.BeginGetResponse(result =>
{
HttpWebRequest resultInfo = (HttpWebRequest)result.AsyncState;
HttpWebResponse response;
try
{
response = (HttpWebResponse)resultInfo.EndGetResponse(result);
}
catch (Exception e)
{
p.img_url = "http://somethingelse.com/image.jpg";
}
}, fileRequest);
}
the problem is that it is very slow, it takes sometimes 2 minutes+ to load every image (although the UI remains responsive, and everything else is displayed immediately in the listbox, apart from the images)
am I doing something wrong? can i get it to run faster?
EDIT:
I tried using the imagefailed event and replace the link, no improvement at the speed of loading the pics
What I have done to avoid this problem in my application is, I have loaded the items with a default Image, The image source is binded to a property in my result item of type ImageSource. By default it returns the default image. After processing or download completion the imagesource value changes to the new Image triggering the NotifyPropertyChanged event and hence it is automatically reflected on the UI. I hope it helps you.

Failing to extract content by xpath using HtmlUnit

I'm trying to extract the title from this Maltese news page
http://www.maltarightnow.com/Default.asp?module=news&at=Inawgurat+%26%23289%3Bnien+%26%23289%3Bdid+f%27Marsalforn&t=a&aid=99839603&cid=19
using the following XPath
html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1
(Ain't pretty but this Xpath was generated by Chrome and makes sense since there's a lack of element Ids).
I'm extracting the title programatically using HTMLUnit in Java. Here's the code. I've extracted news content and article date using the same code (obviously with a different xpath).
public static void main (String[] args) {
WebClient webClient = new WebClient();
HtmlPage page = null;
try {
page = webClient.getPage("http://www.maltarightnow.com/?module=news&at=Inawgurat+%26%23289%3Bnien+%26%23289%3Bdid+f%27Marsalforn&t=a&aid=99839603&cid=19");
} catch (FailingHttpStatusCodeException | IOException e) {
}
String text = ((DomElement)page.getFirstByXPath("html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1")).asText();
System.out.println(text);
}
However it's giving a null pointer for the mentioned xpath in
((DomElement)page.getFirstByXPath("html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1")).asText();
The DomElement is not being found and I'm sure it's there, Chrome created the XPath after all.
What could be the cause of this?
Thanks in advance
It is not that easy. You should:
See the text HTMLUnit is actually creating with Page.asXml()
Correct the XPath you're traversing to match whatever HTMLUnit is outputting in the previous step

Using webclient with htmlAgilityPack on wp7 to get html generated from javascript

i want to get the time schedule on
http://www.21cineplex.com/playnow/sherlock-holmes-a-game-of-shadows,2709.htm
first,
i have tried using webclient with htmlAgilityPack and get to the table id = "table-theater" but appearently the html generated from java script so the table innetHTML is empty.
public void LoadMovieShowTime(string MovieLink)
{
WebClient MovieShowTimeclient = new WebClient();
MovieShowTimeclient.DownloadStringAsync(new Uri(MovieLink));
MovieShowTimeclient.DownloadStringCompleted += new DownloadStringCompletedEventHandler(MovieShowTimeclient_DownloadStringCompleted);
}
void MovieShowTimeclient_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(e.Result);
var node = doc.DocumentNode.Descendants("div").First()
.Elements("div").Skip(1).First()
.Elements("div").Skip(1).First()
.Element("div")
.Elements("table").FirstOrDefault(table => table.Attributes["class"].Value == "table-theater");
}
Is it possible to get the data using webclient on windows phone? or is there any pssible way to get it using another method?
second,
i have tried to get the time schedule from mobile site which is
http://m.21cineplex.com/gui.list_schedule?sid=&movie_id=11SHGO&find_by=1&order=1
but the return ask me to enable cookies. im new to this, i find that there is a way to extend webclien ability by overriding the webRequest cookies, but cant find any reference how to use it.
thanks, for any reply and help :)
Just because the table is generated in JavaScript does not mean the WebBrowser control will not render it. Ensure that IsScriptEnabled is set to true, this will ensure that the JavaScript that renders the table is executed. You can then 'scrape' the results.

Resources