Extracting via HtmlAgilityPack

Extracting via HtmlAgilityPack - html-agility-pack

I'm using the HtmlAgilityPack and trying to extract an image name from html. Here's the html string I have:
sHtml = "<HTML><HEAD></HEAD><BODY>Here are some images.</br>1) < IMG style='MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px' align=right src='images/sample001.jpg'>2) < IMG style='MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px' align=right src='images/sample002.png'></br> And some docs as well.</br>1) href='javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})'></br>2) href='javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})'></br></BODY></HTML>"
In WPF C# I pass this string into the following routine:
private static List<string> ExtractHtmlInfo(string sHtml)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(new StringReader(sHtml));
HtmlNode root = doc.DocumentNode;
List<string> anchorTags = new List<string>();
//foreach (HtmlNode link in root.SelectNodes("//a"))
foreach (HtmlNode link in root.SelectNodes("//img"))
{
string att = link.OuterHtml;
anchorTags.Add(att);
}
return anchorTags;
}
When I step through the code I see that the line:
string att = link.OuterHtml;
provides the entire < img node ... which is more than I want.
I would like anchorTags to have just the folder and name of the file, as in:
[0] = images/sample001.jpg
[1] = images/sample002.png
So, I need something other than .OuterHtml but cannot find it.
Can anyone help?

You are looking for the values of the src attributes of the image elements:
foreach (HtmlNode img in root.SelectNodes("//img"))
{
string att = img.Attributes["src"].Value;
anchorTags.Add(att);
}

Related

htmlagilitypack remove row attributes

How do we remove the inline height attribute from html?
<tr style="height:2px;">
</tr>
<tr style="height:2px;">
</tr>
I want only height attributes to be removed from all tr tags.
Thanks a lot in advance,

You can:
If your trs have no other styles other than height, you can simply remove strip them from their style attribute (the line I commented out)
Otherwise, you can write something like the snippet below to filter which style keys you want to remove
string html = #"<tr style='height:2px;'>
</tr>
<tr style='height:2px;'>
</tr>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var trs = doc.DocumentNode.SelectNodes("tr");
foreach (var tr in trs)
{
Console.WriteLine(tr.OuterHtml);
//tr.Attributes.Remove("style");
var filteredStyles = GetStyles(tr.GetAttributeValue("style"), "height");
tr.SetAttributeValue("style", string.Join(":", filteredStyles));
Console.WriteLine(tr.OuterHtml);
}
Helper function:
private static List<string> GetStyles(string style, params string[] keysToRemove)
{
List<string> styles = new List<string>();
var stylesKeyPairs = style.Split(new char[] { ';' }, StringSplitOptions.RemoveEmptyEntries);
if (keysToRemove != null)
{
foreach (var styleKeyPair in stylesKeyPairs)
{
var styleKeys = styleKeyPair.Split(new char[] { ':' }, StringSplitOptions.RemoveEmptyEntries);
if (!keysToRemove.Contains(styleKeys.FirstOrDefault()))
styles.Add(styleKeyPair);
}
}
else
styles.AddRange(stylesKeyPairs);
return styles;
}
Output (for both solutions, in this case):

HtmlAgility Pack get single node get null value

I am trying to get a single node with an XPath, but i am getting a null value on the node, don' t know why
WebClient wc = new WebClient();
string nodeValue;
string htmlCode = wc.DownloadString("http://www.freeproxylists.net/fr/?c=&pt=&pr=&a%5B%5D=0&a%5B%5D=1&a%5B%5D=2&u=50");
HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(htmlCode);
HtmlNode node = html.DocumentNode.SelectSingleNode("//table[#class='DataGrid']/tbody/tr[#class='Odd']/td/a");
nodeValue = (node.InnerHtml);

I see at least 2 mistakes in your xpath compared to the html you're trying to get information from.
There are no <a> that has <tr class=Odd"> as an ancestor.
Even if your Xpath had worked then you would only have gotten one <td> since you have decided to SelectSingleNode instead of SelectNodes
It looks like the are doing some kind of lazy protection from what you're trying to do. Since the a-tag is just represented in hexadecimal enclosed in IPDecode. So really it is no problem to extract the link. But the least you could have done was to look at the html before posting. You clearly have not tried at all. Since the html you're getting from your current code is not the <body> of the link you gave us - meaning you have to get the htmlpage from the absolute url or just use Selenium.
But since I am such a swell guy I will make your entire solution for you using Xpath, Html Agility Pack and Selenium. The following solutions gets the html of the site. then reads only the <tr> that has class="Odd". After that it finds all the "encrypted" <a> and decodes them into a string and writes them into an array. After that there is a small example of how to get an attribute value from one anchor.
private void HtmlParser(string url)
{
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags=true;
GetHTML(url);
htmlDoc.Load("x.html", Encoding.ASCII, true);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//table[#class='DataGrid']/descendant::*/tr[#class='Odd']/td/script");
List<string> urls = new List<string>();
foreach(HtmlNode x in nodes)
{
urls.Add(ConvertStringToUrl(x.InnerText));
}
Console.WriteLine(ReadingTheAnchor(urls[0]));
}
private string ConvertStringToUrl(string octUrl)
{
octUrl = octUrl.Replace("IPDecode(\"", "");
octUrl = octUrl.Remove(octUrl.Length -2);
octUrl = octUrl.Replace("%", "");
string ascii = string.Empty;
for (int i = 0; i < octUrl.Length; i += 2)
{
String hs = string.Empty;
hs = octUrl.Substring(i,2);
uint decval = System.Convert.ToUInt32(hs, 16);
char character = System.Convert.ToChar(decval);
ascii += character;
}
//Now you get the <a> containing the links. which all can be read as seperate html files containing just a <a>
Console.WriteLine(ascii);
return ascii;
}
private string ReadingTheAnchor(string anchor)
{
//returns url of anchor
HtmlDocument anchorHtml = new HtmlAgilityPack.HtmlDocument();
anchorHtml.LoadHtml(anchor);
HtmlNode h = anchorHtml.DocumentNode.SelectSingleNode("a");
return h.GetAttributeValue("href", "");
}
//using OpenQA.Selenium; using OpenQA.Selenium.Firefox;
private void GetHTML(string url)
{
using (var driver = new FirefoxDriver())
{
driver.Navigate().GoToUrl(url);
Console.Clear();
System.IO.File.WriteAllText("x.html", driver.PageSource);
}
}

Image cropper doesn't work within nested foreach

I have a problem in Umbraco 7. I'm using a nested Multiple Node Tree Picker, but the GetCropUrl doesn't work. The crop function is ok, I've already used it.
#{
if (CurrentPage.HasValue("artists"))
{
var artistList = CurrentPage.artists.ToString().Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
var artistCollection = Umbraco.Content(artistList);
foreach (var artist in artistCollection)
{
if (artist.HasValue("coverImages"))
{
var coverImagesList = artist.coverImages.Split(new string[] { "," }, StringSplitOptions.RemoveEmptyEntries);
var coverImagesCollection = Umbraco.Media(coverImagesList);
foreach (var coverImage in coverImagesCollection.RandomOrder().Take(1).Where("Visible"))
{
<img src="#coverImage.GetCropUrl(305, 195)"/>
}
}
}
}
}
Update:
I changed the code and I started to use Id.
When I use this:
foreach (var coverImage in coverImagesCollection)
<p>#coverImage.Id</p>
<img src="#Umbraco.TypedMedia(1105).Url"/>
}
I got back the the image id from #coverImage.Id, and the image is working.
When I use this:
foreach (var coverImage in coverImagesCollection)
<img src="#Umbraco.TypedMedia(coverImage.Id).Url"/>
}
The image is still good.
After I'm cropping with fix id.
foreach (var coverImage in coverImagesCollection)
<img src="#Umbraco.TypedMedia(1105).GetCropUrl(305, 195)"/>
}
Working, but then:
foreach (var coverImage in coverImagesCollection)
<img src="#Umbraco.TypedMedia(coverImage.Id).GetCropUrl(305, 195)"/>
}
I got an error:
'Umbraco.Web.Models.PublishedContentBase' does not contain a definition for 'GetCropUrl'
How is that possible?

I've got an answer from Umbraco forum. He said coverImage.Id is dynamic, so I need to try this, and it worked perfectly:
foreach (var coverImage in coverImagesCollection)
<img src="#Umbraco.TypedMedia((int)coverImage.Id).GetCropUrl(305, 195)"/>
}

Referencing external css file with XMLWorker

I have switched from HtmlWorker to XMLWorker in order to take advantage of the use of external css files. Besides, in this way, I avoid the use of inline styling.
This is currently my code:
public ActionResult ViewPdf(object model)
{
var doc = new Document(PageSize.A3, 10f, 1f, 10f, 30f);
var memStream = new MemoryStream();
var writer = PdfWriter.GetInstance(doc, memStream);
writer.CloseStream = false;
doc.Open();
var xmltext = RenderActionResultToString(View(model));
var htmlContext = new HtmlPipelineContext(null);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(false);
cssResolver.AddCssFile(Server.MapPath("~/Content/StyleSheet1.css"), true);
var pipeline = new CssResolverPipeline(cssResolver,
new HtmlPipeline(htmlContext, new PdfWriterPipeline(doc, writer)));
var worker = new XMLWorker(pipeline, true);
var parser = new XMLParser(worker);
using (var sr = new StringReader(xmltext))
{
parser.Parse(sr);
}
doc.Close();
var buf = new byte[memStream.Position];
memStream.Position = 0;
memStream.Read(buf, 0, buf.Length);
return new BinaryContentResult(buf, "application/pdf");
}
RenderActionResultToString(View(model)) is giving me the string I want to parse into pdf. This all works fine. The problem is with the styles. If I use inline styling such as:
<table style ="width: 400px">
, everything works like a charm. But if I use something like:
<table class = "foo">
, and reference a css file(StyleSheet1.css) like this:
.foo {
width: 400px;
}
, it doesn't work...
I tried referencing the stylesheet in the string I am parsing, like this in my view:
<head>
<link href ="#Server.MapPath("~/Content/StyleSheet1.css")" rel ="stylesheet" type="text/css"/>
</head>
I inserted the link to the css file just as the documentation states, inside the head tags.
Thanks in Advance

How to display video from xml file?

Hi am using xml file given below,how can i get videos from xml file?
<Category name="Videos">
<article articleid="68">
<videourl>
<iframe src="http://player.vimeo.com/video/52375409?fullscreen=0" width="500" height="298" frameborder="0"></iframe>
</videourl>
</article>
</Category>
My Code is
XDocument loadedData = XDocument.Load("CountriesXML.xml");
var data = from query in loadedData.Descendants("Country")
select new CountryData
{
url = (string)query.Element("videourl").Elements("iframe").Single().Attribute("src").Value,
};
countryList = data.ToList();
but i got NullReferenceException error

var xdoc = XDocument.Load("CountriesXML.xml");
var videos = from f in xdoc.Descendants("iframe")
select new {
Src = (string)f.Attribute("src"),
Width = (int)f.Attribute("width"),
Height = (int)f.Attribute("height")
};
Or with your updated code:
var xdoc = XDocument.Load("CountriesXML.xml");
var data = from c in xdoc.Descendants("Category") // you have Category element
select new CountryData {
url = (string)c.Element("article") // there is also article element
.Element("videourl")
.Elements("iframe")
.Single().Attribute("src")
};

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting via HtmlAgilityPack - html-agility-pack

You are looking for the values of the src attributes of the image elements: foreach (HtmlNode img in root.SelectNodes("//img")) { string att = img.Attributes["src"].Value; anchorTags.Add(att); }

Related

htmlagilitypack remove row attributes

HtmlAgility Pack get single node get null value

Image cropper doesn't work within nested foreach

Referencing external css file with XMLWorker

How to display video from xml file?

Categories

Resources