Get blank value by using XPath - xpath

I'm using following code, but it just code blank string when i tried to get the link of images. Please help to fix it;
link page : http://doisong.vnexpress.net/tin-tuc/suc-khoe/cuu-song-benh-nhan-ngung-tim-tac-mach-vanh-3035416.html
Code worked for me, when i get the content (in tag p)
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='fck_detail width_common']/p/#*"))
{
content = content + node.InnerText;
}
Code when trying get the link of image (tag img)
foreach (HtmlAgilityPack.HtmlNode node2 in doc.DocumentNode.SelectNodes("//div[#class='fck_detail width_common']/table/tbody/tr/td/img/#src"))
{
string l1 = node2.InnerText;
}
they only return a blank string. Please refer the XML structure in below link
www.flickr.com/photos/37903269#N05/14821937739/

The img-tag has no inner text, so you won't get any result with this XPath.
It's just a guess, but try the following:
foreach (HtmlAgilityPack.HtmlNode node2 in
doc.DocumentNode.SelectNodes("//div[#class='fck_detail width_common']/table/tbody/tr/td/img[#src]"))
{
string l1= img.Attributes["src"].Value;
}
For a detailed explanation have a look at a similar question here:
Parse Image src with agility pack

Related

how to get value of a tag that has no class or id in html agility pack?

I am trying to get the text value of this a tag:
67 comments
so i'm trying to get '67' from this. however there are no defining classes or id's.
i've managed to get this far:
IEnumerable<HtmlNode> commentsNode = htmlDoc.DocumentNode.Descendants(0).Where(n => n.HasClass("subtext"));
var storyComments = commentsNode.Select(n =>
n.SelectSingleNode("//a[3]")).ToList();
this only give me "comments" annoyingly enough.
I can't use the href id as there are many of these items, so i cant hardcord the href
how can i extract the number aswell?
Just use the #href attribute and a dedicated string function :
substring-before(//a[#href="item?id=22513425"],"comments")
returns 67.
EDIT : Since you can't hardcode all the content of #href, maybe you can use starts-with. XPath 1.0 solution.
Shortest form (+ text has to contain "comments") :
substring-before(//a[starts-with(#href,"item?") and text()[contains(.,"comments")]],"c")
More restrictive (+ text has to finish with "comments") :
substring-before(//a[starts-with(#href,"item?")][substring(//a, string-length(//a) - string-length('comments')+1) = 'comments'],"c")
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
var doc = new HtmlDocument();
doc.Load(#"C:\desktop\anchor.html"); //I created an html file with your <a> element as the body
var anchor = doc.DocumentNode.CssSelect("a").FirstOrDefault();
if (anchor == null) return;
var digits = anchor.InnerText.ToCharArray().Where(c => Char.IsDigit(c));
Console.WriteLine($"anchor text: {anchor.InnerText} - digits only: {new string(digits.ToArray())}");
Output:

Scrape Instagram Web Hashtag Posts

I'm trying to scrape the number of posts to a given hashtag (#castles) and populate a Google Sheet cell using ImportXML.
I tried copying the Xpath from Chrome and paste it to the ImportXML parameter in the cell like this:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id="react-root"]/section/main/header/div[2]/div/div[2]/span/span")
I saw there is a problem with the quotation marks so I also tried:
=ImportXML("https://www.instagram.com/explore/tags/castels/", "//*[#id='react-root']/section/main/header/div[2]/div/div[2]/span/span")
Nevertheless, both return an error.
What am I doing wrong?
P.S. I am aware of the Xpath to the meta tag description "//meta[#name='description']/#content" however I would like to scrape the exact number of posts and not an abbreviated number.
Try this -
function hashCount() {
var url = 'instagram.com/explore/tags/cats/';
var response = UrlFetchApp.fetch(url, {muteHttpExceptions: true}).getContentText();
var regex = /(edge_hashtag_to_media":{"count":)(\d+)(,"page_info":)/gm;
var count = regex.exec(response)[2];
Logger.log(count);
}
Demo -
I've added muteHttpExceptions: true which was not added in my comment above. Hope this helps.

Iterate over Umbraco getAllTagsInGroup result

I'm trying to get a list of tags from a particular tag group in Umbraco (v4.0.2.1) using the following code:
var tags = umbraco.editorControls.tags.library.getAllTagsInGroup("document downloads");
What I want to do is just output a list of those tags. However, if I output the variable 'tags' it just outputs a list of all tags in a string. I want to split each tag onto a new line.
When I check the datatype of the 'tags' variable:
string tagType = tags.GetType().ToString();
...it outputs MS.Internal.Xml.XPath.XPathSelectionIterator.
So question is, how do I get the individual tags out of the 'tags' variable? How do I work with a variable of this data type? I can find examples of how to do it by loading an actual XML file, but I don't have an actual XML file - just the 'tags' variable to work with.
Thanks very much for any help!
EDIT1: I guess what I'm asking is, how do I loop through the nodes returned by an XPathSelectionIterator data type?
EDIT2: I've found this code, which almost does what I need:
XPathDocument document = new XPathDocument("file.xml");
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator nodes = navigator.Select("/tags/tag");
nodes.MoveNext();
XPathNavigator nodesNavigator = nodes.Current;
XPathNodeIterator nodesText = nodesNavigator.SelectDescendants(XPathNodeType.Text, false);
while (nodesText.MoveNext())
debugString += nodesText.Current.Value.ToString();
...but it expects the URL of an actual XML file to load into the first line. My XML file is essentially the 'tags' variable, not an actual XML file. So when I replace:
XPathDocument document = new XPathDocument("file.xml");
...with:
XPathDocument document = new XPathDocument(tags);
...it just errors.
Since it is an Iterator, I would suggest you iterate it. ;-)
var tags = umbraco.editorControls.tags.library.getAllTagsInGroup("document downloads");
foreach (XPathNavigator tag in tags) {
// handle current tag
}
I think this does the trick a little better.
The problem is that getAllTagsInGroup returns the container for all tags, you need to get its children.
foreach( var tag in umbraco.editorControls.tags.library.getAllTagsInGroup("category").Current.Select("/tags/tag") )
{
/// Your Code
}

Pulling Images from rss/atom feeds using magpie rss

Im using php and magpie and would like a general way of detecting images in feed item. I know some websites place images within the enclosure tag, others like this images[rss] and some simply add it to description. Is there any one with a general function for detecting if rss item has image and extracting image url after its been parsed by magpie?
i think reqular expressions would be needed to extract from description but im a noob at those. Please help if you can.
I spent ages searching for a way of displaying images in RSS via Magpie myself, and in the end I had to examine the code to figure out how to get it to work.
Like you say, the reason Magpie doesn't pick up images in the element is because they are specified using the 'enclosure' tag, which is an empty tag where the information is in the attributes, e.g.
<enclosure url="http://www.mysite.com/myphoto.jpg" length="14478" type="image/jpeg" />
As a hack to get it to work quickly for me I added the following lines of code into rss_parse.inc:
function feed_start_element($p, $element, &$attrs) {
...
if ( $el == 'channel' )
{
$this->inchannel = true;
}
...
// START EDIT - add this elseif condition to the if ($el=xxx) statement.
// Checks if element is enclosure tag, and if so store the attribute values
elseif ($el == 'enclosure' ) {
if ( isset($attrs['url']) ) {
$this->current_item['enclosure_url'] = $attrs['url'];
$this->current_item['enclosure_type'] = $attrs['type'];
$this->current_item['enclosure_length'] = $attrs['length'];
}
}
// END EDIT
...
}
The url to the image is in $myRSSitem['enclosure_url'] and the size is in $myRSSitem['enclosure_length'].
Note that enclosure tags can refer to many types of media, so first check if the type is actually an image by checking $myRSSitem['enclosure_type'].
Maybe someone else has a better suggestion and I'm sure this could be done more elegantly to pick up attributes from other empty tags, but I needed a v quick fix (deadline pressures) but I hope this might help someone else in difficulty!

Image tag not closing with HTMLAgilityPack

Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.

Resources