c# html agility pack pulling strings from html source code - xpath

I am just learning html agility pack and would like to extract a couple pieces of data from a website.
I want to store the item name and price into strings. I have html source code that contains 25 products
with 1 segment of the code posted below
I have very little xpath and html agility pack experience. and I am working on a class project to compare lowes and home depot prices for a few items.
I want to save string data_price = "14.97"; and string item = "Leaktite 5-Gal. Blue Bucket (Pack of 3)"
below is a portion of the source code I am working with
<div class="pod-inner">
<div class="productlist plp-pod__compare">
<div class="checkbox-btn js-podclick-analytics" data-podaction="compare">
<input type="checkbox" data-img="https://images.homedepot-static.com/productImages/8c1c50a0-e17c-4624-9e9e-35653052c1ce/svn/leaktite-paint-buckets-lids-209334-64_400_compressed.jpg" data-uom=" /package" data-price="$14.97" data-title="Leaktite 5-Gal. Blue Bucket (Pack of 3)" value="203924937" id="compare203924937" name="product" autocomplete="off" class="checkbox-btn__input">
so far I got
HtmlDocument doc = new HtmlDocument();
string home_bucket_url="https://www.homedepot.com/s/5%2520gallon%2520bucket?NCNI-5";
WebClient client = new WebClient();
string home_bucket_raw = client.DownloadString(home_bucket_url);
var findclasses = doc.DocumentNode.Descendants("input type").Where(d => d.Attributes.Contains("checkbox"));
foreach (var x in findclasses)
{
Console.WriteLine(x.ToString());
}

If you successfully selected node that you need (please debug to make sure), what you can do is simply get the attribute values that you need. Something like this:
x.Attributes["data-price"].Value;
x.Attributes["data-title"].Value;

Related

How to use replace in wkhtmltopdf or send value to header

First question:
I want to replace a value in the header. I use --header-HTML header.html for PDF header. For example :
I want to pass 3 values to a PDF:
date
Letter_Number
letter_title
Second question:
Can I use a view for the header? I want to use a view in ASP. For example:
CustomSwitches = "--header-HTML header.cshtml "
About first question
Maybe you could use an HTML page as header, as you actually do, generate new HTML using C# code, and replacing existent HTML file content, with the one you have created, just after generating PDF using Rotativa. The other option I can see, maybe a little bit efficient, because avoids generating all HTML code using C#, is that you use javascript inside your HTML to get this values (not sure if it's completely achievable, since I ignore the origin of the values you mention).
Supposing date value is current date, you could use something like this on your HTML:
<!DOCTYPE html>
<html>
<head>
<script>
function subst() {
var currentDate = new Date();
var dd = String(currentDate.getDate()).padStart(2, '0');
var mm = String(currentDate.getMonth() + 1).padStart(2, '0');
var yyyy = currentDate.getFullYear();
currentDate = dd + '-' + mm + '-' + yyyy;
document.getElementById("dateSpan").innerHTML = currentDate;
}
</script>
</head>
<body onload="subst()">
<div>
Date: <span id="dateSpan"></span>
</div>
</body>
</html>
And on the other side, point to the HTML in custom switches command. Guessing it is located in a folder called PDF, inside Views folder, you could do:
customSwitches = " --header-html " + Server.MapPath("~/Views/PDF/header.html");
I make use of similar code for generating a footer with page number and it works like a charm.
About second question:
I use an MVC action to generate the the partial view that I use as PDF header.
Your code for the custom switches should look like this (using GenerateHeader as action name, PDF as controller and yourModel as the model to be passed to the View, on which you are supposed to store you values):
customSwitches = "--header-html " + Url.Action("GenerateHeader", "PDF", yourModel, Request.Url.Scheme);
For your PDF controller, assuming PdfHeader.cshtml is the view you want to use as PDF header, the code for the action would be as this:
public PartialViewResult GenerateHeader(YourModelType yourModel)
{
return PartialView("PDF/PdfHeader", yourModel);
}
For this PartialView references, remember to include at your controller:
usign System.Web.Mvc;
Hope this helps, if don't, please let me know.

2sxc | SQL Datasource - LINQ Filter Query

I have a SQL Datasource setup to get all documents of a certain extension from the standard DNN 'Files' table but I want to add an extra level of specification on what category of file to display but not sure how best to go about it. See my current SQL Datasource code below:
#using ToSic.Eav.DataSources
#functions
{
// Default data initialization - should be the place to write data-retrieval code
// In the future when routing & pipelines are fully implemented, this will usually be an external configuration-only system
// For now, it's done in a normal event, which is automatically called by the razor host in 2SexyContent
public override void CustomizeData()
{
var source = CreateSource<SqlDataSource>();
// source.TitleField = "EntityTitle"; // not necessary, default
// source.EntityIdField = "EntityId"; // not necessary, default
// source.ConnectionString = "..."; // not necessary, we're using the ConnectionStringName on the next line
source.ConnectionStringName = Content.ConnectionName;
// Special note: I'm not selecting * from the DB, because I'm activating JSON and want to be sure that no secret data goes out
source.SelectCommand = "Select Top 10 FileId as EntityId, FileName as EntityTitle, PublishedVersion, UniqueId, FileName, Size as Size, Extension as Extension, CreatedOnDate as Date, Folder as Folder FROM Files WHERE PortalId = #PortalId AND Extension = 'docx' OR Extension = 'xlsx' OR Extension = 'pdf'";
source.Configuration.Add("#PortalId", Dnn.Portal.PortalId.ToString());
Data.In.Add("FileList", source.Out["Default"]);
// enable publishing
Data.Publish.Enabled = true;
Data.Publish.Streams = "Default,FileList";
}
}
I want to sync the 2sxc Categories entity with DNN's Tab/Page Taxonomy Tags/Categories so as to allow a user to select a DNN Tag on Page setup which (if synced with the 2sxc Categories entity) will allow me to assign a specific doc/excel/pdf file (already connected via 2sxc iCache to a 2sxc Category) to an app based on the SQL Datasource which connects via joining the taxonomy_terms table with the content items table and in turn with the content item tags table which connects with the DNN tabs table.
How can I correct my LINQ/Razor code below to filter my Categories to only display files with the exact 'Services' Category assigned to them. I will use this filter to sync with the Taxonomy Tag 'Services' (exact match) which I want to link to the 2sxc Category (which has an uploaded Adam file already connected via 2sxc iCache) with DNN Taxonomy term 'Services'?
#foreach (var file in AsDynamic(Data.In["FileList"]).Where(i =>
(i.Category as List<dynamic>).Any(c => c.EntityId == FileList.EntityId)))
{
<li>#file.FileName</li>
}
I have looked in detail at the wiki notes on https://github.com/2sic/2sxc/wiki/DotNet-Query-Linq and I am stuck on getting the correct syntax for the category filter with using a foreach with the SQL Datasource template.
Cheers...
I believe we have solved this by mail already.
One minor recommendation: if you use DnnSqlDataSource instead of the SqlDataSource you already have the correct connection string for your current DNN. See also http://2sxc.org/en/docs/Feature/feature/4670 as well as https://github.com/2sic/2sxc/wiki/DotNet-DataSources-All
Yes, the filter I needed was as you provided below:
#using ToSic.SexyContent
#{
// all QandA items
var all = AsDynamic(App.Data["QandA"].List);
// the filter value, can be set in template
// but usually you would either get it from url with Request["urlparam"]
// or from a setting in the view, using Content.Category
var currentCat = "Business";
// filter, find any which have the current category
var filtered = all
.Where(p => (p.Categories as List<DynamicEntity>).Any(c => AsDynamic(c).Name == currentCat));
}
<div class="clearfix">
#foreach (var q in filtered)
{
<div class="sc-element" data-tags="#String.Join(",", ((List<DynamicEntity>)q.Categories).Select(a => AsDynamic(a).EntityId))">
#Edit.Toolbar(Content)
<div class="col-md-12">
<div class="">
<a href="#q.Link" class="">
<span class="">#q.Link</span>
</a>
<p>#q.Title</p>
</div>
</div>
</div>
}
</div>
Thanks again!

HtmlAgilityPack Div Class Contains String

I'm trying to scrape only article text from web pages. I have discovered that the article is always surrounded with div tags. Unfortunately the class of these div tags is slightly different for each web page. I looked into using XPath but I don't think it will work due to the different class names. Is there a way I can get all the div tags and then get the class?
Examples
<div class="entry_single">
<p>I recently traveled without my notebook for the first time in ages.</p>
</div>
<div class="entry-content-pagination">
<p>Ward 9 Ald. Steven Dove</p>
</div>
That'd be easier using Linq.
foreach(HtmlNode div in doc.DocumentNode.Descendants("div"))
{
string className = div.GetAttributeValue("class", string.Empty);
// do something with class name
}

Getting raw text using #Html.ActionLink in Razor / MVC3?

Given the following Html.ActionLink:
#Html.ActionLink(Model.dsResults.Tables[0].Rows[i]["title"].ToString(), "ItemLinkClick",
new { itemListID = #Model.dsResults.Tables[0].Rows[i]["ItemListID"], itemPosNum = i+1 }, ...
Data from the model contains HTML in the title field. However, I am unable to display the HTML encoded values. ie. underlined text shows up with the <u>....</u> around it.
I've tried Html.Raw in the text part of the ActionLink, but no go.
Any suggestions?
If you still want to use a helper to create an action link with raw HTML for the link text then I don't believe you can use Html.ActionLink. However, the answer to this stackoverflow question describes creating a helper which does this.
I would write the link HTML manually though and use the Url.Action helper which creates the URL which Html.ActionLink would have created:
<a href="#Url.Action("ItemLinkClick", new { itemListID = #Model.dsResults.Tables[0].Rows[i]["ItemListID"], itemPosNum = i+1 })">
#Html.Raw(Model.dsResults.Tables[0].Rows[i]["title"].ToString())
</a>
MVCHtmlString.Create should do the trick.
Using the actionlink below you do not need to pass html in the model. Let the css class or inline style determine how the href is decorated.
#Html.ActionLink(Model.dsResults.Tables[0].Rows[i]["title"], "ItemLinkClick", "Controller", new { #class = "underline", style="text-decoration: underline" }, null)
those are the cases that you should take the other path
#{
string title = Model.dsResults.Tables[0].Rows[i]["title"].ToString(),
aHref = String.Format("/ItemLinkClick/itemListID={0}&itemPosNum={1}...",
Model.dsResults.Tables[0].Rows[i]["ItemListID"],
i+1);
}
#Html.Raw(title)
Remember that Razor helpers, help you, but you can still do things in the HTML way.
You could also use this:
<a class='btn btn-link'
href='/Mycontroler/MyAction/" + item.ID + "'
data-ajax='true'
data-ajax-method='Get'
data-ajax-mode='InsertionMode.Replace'
data-ajax-update='#Mymodal'>My Comments</a>

Html Agility Pack: Setting an HtmlNode's Attribute Value isn't reflected in the HtmlDocument

In Html Agility Pack, when I set an attribute of an HtmlNode, should I see this in the HtmlDocument from which the node was selected?
Lets say that htmlDocument is an HtmlDocument. So the simplified code looks like this:
HtmlNode documentNode = htmlDocument.DocumentNode;
HtmlNodeCollection nodeCollection = documentNode.SelectNodes(someXPath);
foreach(var node in nodeCollection)
if(SomeCondition(node))
node.SetAttributeValue("class","something");
Now, I see the class attribte of node change, but I don't see this change reflected in the htmlDocument's html.
Actually it was a case of ProgrammerTooStupidException :(
I used a MyHtmlPage class, with an Html property and an DocumentProperty.
_html = theHtml;
_htmlDocument = new HtmlDocument();
HtmlDocument.LoadHtml(theHtml)l
_documentNode = HtmlDocument.DocumentNode;
Now, of coourse manipulation the DocumentNode had no effect on the _html value.
Posting this reply to clear the name of HAP.

Resources