How to get a select in HTML Agility Pack - html-agility-pack

I'm trying to get the select element for a particular webpage, but I have trouble doing this.
Here's my code so far.
I'm trying to get the select element in a web page, containing the name "postalDistrictList", and none of my code works.
I also tried htmlweb.DocumentNode.SelectNodes("//select") but this returns null.
Does anyone have any idea how I can do this?
static void Main(string[] args)
{
HtmlNode.ElementsFlags.Remove("option");
HtmlWeb htmlweb = new HtmlWeb();
HtmlDocument html = htmlweb.Load("https://www.ura.gov.sg/realEstateWeb/realEstate/pageflow/transaction/submitSearch.do");
// HtmlNode bodyNode = html.DocumentNode.SelectNodes("//select");
HtmlNode bodyNode = html.DocumentNode.SelectSingleNode("/html/body");
HtmlNode selectNode = html.GetElementbyId("postalDistrictList");
HtmlNodeCollection selectNodes = html.DocumentNode.SelectNodes("//select[#name='postalDistrictList']");
// HtmlNode selectNode = html.DocumentNode.SelectSingleNode("//select[#name='postalDistrictList']");
HtmlNode node = selectNode;
// foreach (HtmlNode node in selectNodes)
{
Console.Out.Write(node.Attributes["options"].Value);
Console.Out.WriteLine();
}
}

Try the XPath //./select[#name='postalDistrictList'], i.e.
HtmlNodeCollection selectNodes = html.DocumentNode.SelectNodes("//./select[#name='postalDistrictList']");
should help you get the a collection of the select elements you are looking for.

Related

Exception : An existing connection was forcibly closed by the remote host

i'm working on a project that is based on web scraping with .NET Framework and Html-Agility-Pack tool.
At first, i made a method that parse the Category list from https://www.gearbest.com and it's totally working fine.
But now i need to parse the products from each category list item.
For example there is appliances category https://www.gearbest.com/appliances-c_12245/, but when i run the method it returns an error :
'The underlying connection was closed: An unexpected error occurred on a receive'
Here is my code :
public void Get_All_Categories()
{
var html = #"https://www.gearbest.com/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("/html/body/div[1]/div/ul[2]/li[1]/ul/li//a/span/../#href");
foreach (HtmlNode n in nodes)
{
Category c = new Category();
c.Name = n.InnerText;
c.CategoryLink = n.GetAttributeValue("href", string.Empty);
categories.Add(c);
}
}
This is working pretty much fine.
public void Get_Product()
{
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12 | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls;
var html = #"https://www.gearbest.com/appliances-c_12245/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var x = htmlDoc.DocumentNode.SelectSingleNode("//*[#id=\"siteWrap\"]/div[1]/div[1]/div/div[3]/ul/li[1]/div/p[1]/a");
Console.WriteLine(x.InnerText);
Console.WriteLine("done");
}
But this method doesn't work and it returns that error.
How can i fix this please ?
P.S : I already saw some solutions about HTTPS handling but it didn't work for me, maybe because i don't understand it.
I would appreciate any help, thank you in advance.

How to get "Repro Steps" of a list of work items?

My team has been using VSTS for 8 months. Now, Our customer is asking to get "Repro Steps" of the work items in VSTS.
Is there any way to get the content of "Repro Steps" without the HTML format?
No, because the Repro Steps value is the rich text that can contain image etc…. So, the value is incorrect if just return the data without HTML format.
However, you can remove HTML tag programing.
Simple code:
public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
var u = new Uri("[collection URL]"");
VssCredentials c = new VssCredentials(new Microsoft.VisualStudio.Services.Common.WindowsCredential(new NetworkCredential("[user name]", "[password]")));
var connection = new VssConnection(u, c);
var workitemClient = connection.GetClient<WorkItemTrackingHttpClient>();
var workitem = workitemClient.GetWorkItemAsync(96).Result;
object repoValue = workitem.Fields["Microsoft.VSTS.TCM.ReproSteps"];
string repoValueWithOutformat = StripHTML(repoValue.ToString());

Extracting SPWeb.Groups.Xml in XElement

I need to get SPWeb.Groups.Xml in XElement to create XDocument.
SPSite site = new SPSite(url);
foreach (SPWeb web in site.AllWebs)
{
SPUserCollection spusers = site.RootWeb.SiteUsers;
XElement xeGroup = new XElement("Groups");
xeGroup = new XElement(web.Groups.Xml);}
currently I am getting error as "The '<' character, hexadecimal value 0x3C, cannot be included in a name.",
Please suggest workaround or correct way to retrieve the information.
Thanks for your help.
My solution, not very elegant...
SPSite site = new SPSite(url);
foreach (SPWeb web in site.AllWebs)
{
XElement xeGroup = new XElement("Groups");
xd.LoadXml(web.Groups.Xml);
xeGroup = XElement.Load(new XmlNodeReader(xd));

How to find out what is wrong in test case during importing it to TFS 2010

I'm importing test cases from xml file to TFS2010 and get an exception. But there is no info about what definitely is incorrect.
"Work item 0 is invalid and cannot be saved. Exception: 'TF237124: Work Item is not ready to save'."
How is it possible to determine what is wrong in imported data from xml?
using System.Text.RegularExpressions;
using System.Xml;
using Microsoft.TeamFoundation.Server;
using Microsoft.TeamFoundation.WorkItemTracking.Client;
using System;
using System.Linq;
internal class Program
{
// Input File
private static TestLink testLink = new TestLink("E:\\dev\\TestLinkToTfs\\testsuites.xml");
// Target TFS server
private static Tfs tfs = new Tfs("http://host:8080/tfs/Test", "Test");
private static void Main(string[] args)
{
var testLinkTestCase = testLink.GetTestCases().Take(1).ToList();
var steps = testLinkTestCase.Descendants("step");
var testCase = tfs.Project.TestCases.Create(tfs.Project.WitProject.WorkItemTypes["Test Case"]);
testCase.Title = testLinkTestCase.Attribute("name").Value;
var summary = testLinkTestCase.Descendants("summary").ToList();
var issueId = TestLink.GetLinkedIssueId(summary);
var regEx = new Regex(#"[^a-zA-Z0-9 -]");
var grandParentName = regEx.Replace(testLinkTestCase.Parent.Parent.Attribute("name").Value, string.Empty);
var parentName = regEx.Replace(testLinkTestCase.Parent.Attribute("name").Value, string.Empty);
var area = string.Format(#"Test\Test Cases\{0}\{1}", grandParentName, parentName);
testCase.CustomFields["Assigned To"].Value = string.Empty;
testCase.Area = area;
Tfs.AddSteps(steps, testCase);
testCase.Save();
}
Console.ReadKey();
}
}
}
When the Work Item id is 0 means that this is created dynamically and some field values are not valid. You should try the method
workitem.validate();
before you save the Work Item and then try to debug you code. This will tell you the exact fields that have invalid data.
I could be more helpful if you post the code and the xml that you use for this.

Removing childnodes using HAP

When i'm trying to remove a childnode from my xpath i'm getting a weird error:-
System.ArgumentOutOfRangeException was unhandled
Message=Node "" was not found in the collection
I know there an issue with HAP childremoving but idk if they have fix it with the new release or not. My question is it my code that is wrong or is it HAP? In either way is there any way to get around that and remove those childnode?
Here is my code:-
List<MediNetScheme> medinetScheme = new List<MediNetScheme>();
HtmlDocument htdoc = new HtmlDocument();
htdoc.LoadHtml(results);
foreach (HtmlNode table in htdoc.DocumentNode.SelectNodes("//table[#class='list-medium']/tbody[1]/tr[#class]"))
{
string itemValue = string.Empty;
HtmlNode ansvarig =table.SelectSingleNode("//table[#class='list-medium']/tbody[1]/tr[#class]/td[4]");
table.RemoveChild(ansvarig, true);
itemValue = table.InnerText;
medinetScheme.Add(new MediNetScheme(){Datum=itemValue.Remove(15),Sections=itemValue.Remove(0,15)});
}
MediNetScheme.ItemsSource = medinetScheme;
Edit:-
My HTML document has a table with several rows that have this xpath :- "//table[#class='list-medium']/tbody1/tr[#class]". Each row in this table have 5 columns td1...td[5]. In my first foreach loop i'm using selectnodes to get the HTMLcode of each row in the table. What i want to do is to get only the innertext from the first 3 td in each row, which means i need to get rid of td[4] and td[5] from each row. When i used your edited code, i was able to get rid of td[4] and td[5] in the first row but not other rows that follows the first row.
Here is a pic of my HTML:-
the better way to remove a node from their parent in HtmlAgilityPack is this:
nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
In your code you can use like this:
List<MediNetScheme> medinetScheme = new List<MediNetScheme>();
HtmlDocument htdoc = new HtmlDocument();
htdoc.LoadHtml(results);
foreach (HtmlNode table in htdoc.DocumentNode.SelectNodes("//table[#class='list-medium']/tbody[1]/tr[#class]"))
{
string itemValue = string.Empty;
HtmlNode ansvarig =table.SelectSingleNode("//table[#class='list-medium']/tbody[1]/tr[#class]/td[4]");
ansvarig.ParentNode.RemoveChild(ansvarig);
itemValue = table.InnerText;
medinetScheme.Add(new MediNetScheme(){Datum=itemValue.Remove(15),Sections=itemValue.Remove(0,15)});
}
MediNetScheme.ItemsSource = medinetScheme;
I hope this will be useful for you :)
EDITED:
Do you want to get the InnerText of the three first td's in each row.
I'm checking your code and i think that xpath inside the foreach is wrong.
I would change the xpath for a classic counted loop with linq like this:
foreach (HtmlNode trNodes in htdoc.DocumentNode.SelectNodes("//table[#class='list-medium']/tbody[1]/tr[#class]"))
{
string itemValue = string.Empty;
int position = 1;
foreach (var td in tr.DescendantNodes("td"))
{
itemValue = td .InnerText;
medinetScheme.Add(new MediNetScheme(){Datum=itemValue.Remove(15),Sections=itemValue.Remove(0,15)});
position++;
if (position == 3)
break;
}
After few hours of testing different codes and ways to achive what i wanted, i figured it out.
But i have to thank vfportero for his answer and flag it as answer too.
The answer to the edited verion of my question is simply this code ;)
List<MediNetScheme> medinetScheme = new List<MediNetScheme>();
HtmlDocument htdoc = new HtmlDocument();
htdoc.LoadHtml(results);
foreach (HtmlNode table in htdoc.DocumentNode.SelectNodes("//table[#class='list-medium']/tbody[1]/tr[#class]"))
{
table.ChildNodes.RemoveAt(3);
string itemValue = table.InnerText;
medinetScheme.Add(new MediNetScheme(){Datum=itemValue.Remove(15),Sections=itemValue.Remove(0,15)});
}
MediNetScheme.ItemsSource = medinetScheme;
You can see that i omit RemoveChild method coz it was not doing what i wanted (plz read the edit of my question), and instead i used .ChildNodes.RemoveAt(int //the placeof the child you want to remove).
Hope this will help some other ppl facing the same problem.
Yours

Resources