Failing to extract content by xpath using HtmlUnit - xpath

I'm trying to extract the title from this Maltese news page
http://www.maltarightnow.com/Default.asp?module=news&at=Inawgurat+%26%23289%3Bnien+%26%23289%3Bdid+f%27Marsalforn&t=a&aid=99839603&cid=19
using the following XPath
html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1
(Ain't pretty but this Xpath was generated by Chrome and makes sense since there's a lack of element Ids).
I'm extracting the title programatically using HTMLUnit in Java. Here's the code. I've extracted news content and article date using the same code (obviously with a different xpath).
public static void main (String[] args) {
WebClient webClient = new WebClient();
HtmlPage page = null;
try {
page = webClient.getPage("http://www.maltarightnow.com/?module=news&at=Inawgurat+%26%23289%3Bnien+%26%23289%3Bdid+f%27Marsalforn&t=a&aid=99839603&cid=19");
} catch (FailingHttpStatusCodeException | IOException e) {
}
String text = ((DomElement)page.getFirstByXPath("html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1")).asText();
System.out.println(text);
}
However it's giving a null pointer for the mentioned xpath in
((DomElement)page.getFirstByXPath("html/body/table/tbody/tr[2]/td/table/tbody/tr[4]/td/table/tbody/tr[1]/td[1]/table/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr[1]/td/h1")).asText();
The DomElement is not being found and I'm sure it's there, Chrome created the XPath after all.
What could be the cause of this?
Thanks in advance

It is not that easy. You should:
See the text HTMLUnit is actually creating with Page.asXml()
Correct the XPath you're traversing to match whatever HTMLUnit is outputting in the previous step

Related

How get data from javascript using HtmlUnit?

How get data from javascript using HtmlUnit ?
Title: total shoots
screen html code
public static void getElements() {
try (
final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("some URL");
final HtmlDivision div = page.getHtmlElementById("in-game-stats");
System.out.println(div.getTextContent());
} catch (IOException e) {
e.printStackTrace();
}
}
what else ?
First of all you have to find the script element. Because you script tag has no id attribute doing something like 'page.getHtmlElementById' is not the right way. HtmlUnit offers many different ways for find elements. As starting point have a look at the documentation (http://htmlunit.sourceforge.net/gettingStarted.html).
Next step will be to get the javascript from the HtmlStript. If the script code is embedded inside the script tag you can simply use asXml().
There is no direct method to process javascript objects , but you can parse tag and have it parsed by self.
Use (HtmlElement).asText();
to extract data from div tag.

How to successfully parse images from within content:encoded tags using SAX?

I am a trying to parse and display images from a feed that has the imgage URL inside tags. An example is this:
*Note>> http://someImage.jpg is not a real image link, this is just an example. This is what I have done so far.
public void startElement(String uri, String localName, String qName, Attributes atts) {
chars = new StringBuilder();
if (qName.equalsIgnoreCase("content:encoded")) {
if (!atts.getValue("src").toString().equalsIgnoreCase("null")) {
feedStr.setImgLink(atts.getValue("src").toString());
Log.d(TAG, "inside if " + feedStr.getImgLink());
} else {
feedStr.setImgLink("");
Log.d(TAG, feedStr.getImgLink());
}
}
}
I believe this part of my programming needs to be tweaked. First, when qName is equal to "content:encoded" the parsing stops. The application just runs endlessly and displays nothing. Second, if I change that initial if to anything that qName cannot equal like "purplebunny" everything works perfect, except there will be no images. What am I missing? Am I using atts.getValue properly? I have used log to see what comes up in ImgLink and it is null always.
You can store the content:encoded data in a String. Then you can extract image by this library Jsoup
Example:
Suppose content:encoded raw data stored in Description variable.
Document doc = Jsoup.parse(Description);
Element image =doc.select("img").first();
String url = image.absUrl("src");

Clicking Open Word Document tries to reconnect to Controller action while downloading

I have a requirement to download a file from the server, but open inline if possible. I'm currently doing;
Response.AddHeader("Content-Disposition", string.Format("inline; filename={0}", documentFileName));
result = new FileStreamResult(new FileStream(documentFilePath, FileMode.Open), "application/msword");
I've put application/msword in there right now, because that's what I'm having a problem with. When I click Open on the word document, it's as if the document makes multiple calls back to the action, but there is no session and no database so it crashes. When the user is running this, they see a long hang, the "Downloading" dialog finally appears in word and they have to cancel it. The document is there and is valid but this is not desirable.
Pdfs, pngs etc. download fine. Can anybody explain this behavior, and give me some hints as to how I fix it?
Update:
The action basically looks like;
[HttpPost]
public FileResult View(int id, int source)
{
var document = GetDocumentFromDatabase(id, source);
documentFilePath = Path.Combine( documentsDirectory, document.Name);
documentName = document.Name;
Response.AddHeader("Content-Disposition", string.Format("inline; filename={0}", documentFileName));
result = new FileStreamResult(new FileStream(documentFilePath, FileMode.Open), "application/msword");
return result;
}
I've trimmed it down, as I can't share the specifics, but the full idea is there.
Answer:
I have a lookup of available content-types, in there I have defined whether the file is inline or attachment, and when I detect a word document, I set it to attachment. No more error. PDF opens in the browser still because I set it to inline.
I use:
public ActionResult GetAttachment(int id)
{
var attachment = _repository.GetAttachByID(id);
if (attachment != null)
{
Response.AppendHeader("Content-Disposition",string.Format("inline; filename={0}",attachment.FileName));
return File(attachment.File, attachment.MimeType, attachment.FileName);
}
else
{
return null;
}
}
Regards

Using webclient with htmlAgilityPack on wp7 to get html generated from javascript

i want to get the time schedule on
http://www.21cineplex.com/playnow/sherlock-holmes-a-game-of-shadows,2709.htm
first,
i have tried using webclient with htmlAgilityPack and get to the table id = "table-theater" but appearently the html generated from java script so the table innetHTML is empty.
public void LoadMovieShowTime(string MovieLink)
{
WebClient MovieShowTimeclient = new WebClient();
MovieShowTimeclient.DownloadStringAsync(new Uri(MovieLink));
MovieShowTimeclient.DownloadStringCompleted += new DownloadStringCompletedEventHandler(MovieShowTimeclient_DownloadStringCompleted);
}
void MovieShowTimeclient_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(e.Result);
var node = doc.DocumentNode.Descendants("div").First()
.Elements("div").Skip(1).First()
.Elements("div").Skip(1).First()
.Element("div")
.Elements("table").FirstOrDefault(table => table.Attributes["class"].Value == "table-theater");
}
Is it possible to get the data using webclient on windows phone? or is there any pssible way to get it using another method?
second,
i have tried to get the time schedule from mobile site which is
http://m.21cineplex.com/gui.list_schedule?sid=&movie_id=11SHGO&find_by=1&order=1
but the return ask me to enable cookies. im new to this, i find that there is a way to extend webclien ability by overriding the webRequest cookies, but cant find any reference how to use it.
thanks, for any reply and help :)
Just because the table is generated in JavaScript does not mean the WebBrowser control will not render it. Ensure that IsScriptEnabled is set to true, this will ensure that the JavaScript that renders the table is executed. You can then 'scrape' the results.

Use HtmlUnit to search google

The following code is an attempt to search google, and return the results as text or html.
The code was almost entirely copied directly from code snippets online, and i see no reason for it to not return results from the search. How do you return google search results, using htmlunit to submit the search query, without a browser?
import com.gargoylesoftware.htmlunit.WebClient;
import java.io.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import java.net.*;
public class GoogleSearch {
public static void main(String[] args)throws IOException, MalformedURLException
{
final WebClient webClient = new WebClient();
HtmlPage page1 = webClient.getPage("http://www.google.com");
HtmlInput input1 = page1.getElementByName("q");
input1.setValueAttribute("yarn");
HtmlSubmitInput submit1 = page1.getElementByName("btnK");
page1=submit1.click();
System.out.println(page1.asXml());
webClient.closeAllWindows();
}
}
There must be some browser detection that changes the generated HTML, because when inspecting the HTML with page1.getWebResponse().getContentAsString(), the submit button is named btnG and not btnK (which is not what I observe in Firefox). Make this change, and the result will be the expected one.
I've just checked this. It's actually 2 ids for 2 google pages:
btnK: on the google home page (where there's 1 long textbox in the middle of the screen). This time the button's id = 'gbqfa'
btnG: on the google result page (where the main textbox is on top of the screen). This time the button's id = 'gbqfb'

Resources