HtmlAgility Pack get single node get null value

HtmlAgility Pack get single node get null value - xpath

I am trying to get a single node with an XPath, but i am getting a null value on the node, don' t know why
WebClient wc = new WebClient();
string nodeValue;
string htmlCode = wc.DownloadString("http://www.freeproxylists.net/fr/?c=&pt=&pr=&a%5B%5D=0&a%5B%5D=1&a%5B%5D=2&u=50");
HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(htmlCode);
HtmlNode node = html.DocumentNode.SelectSingleNode("//table[#class='DataGrid']/tbody/tr[#class='Odd']/td/a");
nodeValue = (node.InnerHtml);

I see at least 2 mistakes in your xpath compared to the html you're trying to get information from.
There are no <a> that has <tr class=Odd"> as an ancestor.
Even if your Xpath had worked then you would only have gotten one <td> since you have decided to SelectSingleNode instead of SelectNodes
It looks like the are doing some kind of lazy protection from what you're trying to do. Since the a-tag is just represented in hexadecimal enclosed in IPDecode. So really it is no problem to extract the link. But the least you could have done was to look at the html before posting. You clearly have not tried at all. Since the html you're getting from your current code is not the <body> of the link you gave us - meaning you have to get the htmlpage from the absolute url or just use Selenium.
But since I am such a swell guy I will make your entire solution for you using Xpath, Html Agility Pack and Selenium. The following solutions gets the html of the site. then reads only the <tr> that has class="Odd". After that it finds all the "encrypted" <a> and decodes them into a string and writes them into an array. After that there is a small example of how to get an attribute value from one anchor.
private void HtmlParser(string url)
{
HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags=true;
GetHTML(url);
htmlDoc.Load("x.html", Encoding.ASCII, true);
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//table[#class='DataGrid']/descendant::*/tr[#class='Odd']/td/script");
List<string> urls = new List<string>();
foreach(HtmlNode x in nodes)
{
urls.Add(ConvertStringToUrl(x.InnerText));
}
Console.WriteLine(ReadingTheAnchor(urls[0]));
}
private string ConvertStringToUrl(string octUrl)
{
octUrl = octUrl.Replace("IPDecode(\"", "");
octUrl = octUrl.Remove(octUrl.Length -2);
octUrl = octUrl.Replace("%", "");
string ascii = string.Empty;
for (int i = 0; i < octUrl.Length; i += 2)
{
String hs = string.Empty;
hs = octUrl.Substring(i,2);
uint decval = System.Convert.ToUInt32(hs, 16);
char character = System.Convert.ToChar(decval);
ascii += character;
}
//Now you get the <a> containing the links. which all can be read as seperate html files containing just a <a>
Console.WriteLine(ascii);
return ascii;
}
private string ReadingTheAnchor(string anchor)
{
//returns url of anchor
HtmlDocument anchorHtml = new HtmlAgilityPack.HtmlDocument();
anchorHtml.LoadHtml(anchor);
HtmlNode h = anchorHtml.DocumentNode.SelectSingleNode("a");
return h.GetAttributeValue("href", "");
}
//using OpenQA.Selenium; using OpenQA.Selenium.Firefox;
private void GetHTML(string url)
{
using (var driver = new FirefoxDriver())
{
driver.Navigate().GoToUrl(url);
Console.Clear();
System.IO.File.WriteAllText("x.html", driver.PageSource);
}
}

Related

itext7 barcodes in footer

I'm working on a rather complex solution that takes an html-like input and converts it to a pdf. One of the many items that I'm trying to solve for is adding barcodes (all types, 3 of 9, PDF417, and qr code) to the footer of documents.
A couple details that give me pause on how to implement:
Bar code will contain current page number
Bar code will contain total page count
Bar code will be inside other itext elements (like a table cell or paragraph) and (in the final solution) needs to be parsed out ahead of time
Knowing those details, I'm struggling a bit on how to combine barcodes with something like the page x of y strategy of using a template to replace page count after rendering all the content.
I assume that each bar code will need it's own template because of the page count, and keep track of the templates until all the content is rendered and then update each individual template with the appropriate bar code. But because the footer is parsed out ahead of time, I need a template that represents a bar code so that the footer will have the correct height and content can be adjusted appropriately.
I believe that each of these pieces need to be handled in the event handler for end of page, is that a correct assessment?
UPD Edited to include code sample. I pulled out quite a bit of the other stuff I was trying to accomplish from this example. As for the parsed ahead of time, instead of going over a loop from 1 to 20 and creating random elements, some other process creates all the elements that need to be present on the document and will pass in that list of elements to the renderer. That does include the footer content as well. In this case I'm creating the footer table in the constructor of the HeaderHandler as that is close to the same concept. The reason I bring this up is that I won't be able to create the table in the HandleEvent of the handler like in most examples I have seen about tables in footers. Hope that makes sense.
void Main()
{
PdfDocument pdf = new PdfDocument(new PdfWriter(Dest));
PageSize pageSize = PageSize.A4;
Document doc = new Document(pdf, pageSize, true);
HeaderHandler hh = new HeaderHandler(doc);
...
some other object generation
...
// create random paragraphs to fill up multiple pages in the final solution this would have already happened.
for (var i = 0; i < 20; i++)
AddItemToList(elementList, i, objects);
// add random elements back to the document
foreach (var e in elementList)
{
... add each item just added to elementList to the document ...
}
renderer.Flush();
hh.UpdateTotal(pdf);
// I think I need to update all the barcodes and print them out here so that page count part of the barcode can be written
doc.Close();
}
class HeaderHandler : IEventHandler
{
Table Footer;
Document Doc;
public Margin First;
public Margin Middle;
public Margin Last;
public Dictionary<int, Margin> PageMargins { get; set; }
public float HeaderHeight { get; }
public float FooterHeight { get; }
PdfFormXObject PgCount;
Text PageNumber;
Dictionary<string, PdfFormXObject> BarcodeImages;
public HeaderHandler(Document doc)
{
Doc = doc;
Footer = new Table(new float[] { 4, 2, 4}).SetAutoLayout();
PageMargins = new Dictionary<int, Margin>();
BarcodeImages = new Dictionary<string, PdfFormXObject>();
var pageSize = Doc.GetPdfDocument().GetDefaultPageSize();
var width = pageSize.GetRight() - pageSize.GetLeft() - Doc.GetLeftMargin() - Doc.GetRightMargin();
// page total
PgCount = new PdfFormXObject(new Rectangle(0,0, 13, 13));
Footer.AddCell(new Cell().Add(new Paragraph("info 1")));
PageNumber = new Text("{page}");
var cell = new Cell().Add(new Paragraph().Add(PageNumber).Add(" of ").Add(new Image(PgCount)).Add(" pages").SetTextAlignment(TextAlignment.CENTER));
Footer.AddCell(cell);
Footer.AddCell(new Cell().Add(new Paragraph("info 2")));
Footer.AddCell("footer 1");
Footer.AddCell("footer 2");
// I think I need to add a template here for the barcode as a placeholder so that when the renderersubtree is ran it provides space for the barcode
Footer.AddCell(new Cell().Add(new Paragraph("{barcode} {qr code - {page} | {pagect} | doc name}")));
TableRenderer fRenderer = (TableRenderer)Footer.CreateRendererSubTree();
using (var s = new MemoryStream())
{
fRenderer.SetParent(new Document(new PdfDocument(new PdfWriter(s))).GetRenderer());
FooterHeight = fRenderer.Layout(new LayoutContext(new LayoutArea(0, PageSize.A4))).GetOccupiedArea().GetBBox().GetHeight();
}
}
public void UpdateTotal(PdfDocument pdf) {
Canvas canvas = new Canvas(PgCount, pdf);
canvas.ShowTextAligned(pdf.GetNumberOfPages().ToString(), 0, -3, TextAlignment.LEFT);
}
//draw footer and header tables
public void HandleEvent(Event e)
{
PdfDocumentEvent docEvent = e as PdfDocumentEvent;
if (docEvent == null)
return;
PdfDocument pdf = docEvent.GetDocument();
PdfPage page = docEvent.GetPage();
PdfCanvas pdfCanvas = new PdfCanvas(page.GetLastContentStream(), page.GetResources(), pdf);
int pageNum = pdf.GetPageNumber(page);
var pageSize = Doc.GetPdfDocument().GetDefaultPageSize();
Margin activeMargin = new Margin();
if (PageMargins.ContainsKey(pageNum))
activeMargin = PageMargins[pageNum];
var width = pageSize.GetRight() - pageSize.GetLeft() - activeMargin.Left - activeMargin.Right;
Header.SetWidth(width);
Footer.SetWidth(width);
var pageReferences = new List<TextRenderer>();
// update page number text so it can be written to in the footer
PageNumber.SetText(pageNum.ToString());
// draw the footer
rect = new Rectangle(pdf.GetDefaultPageSize().GetX() + activeMargin.Left, activeMargin.Bottom - GetFooterHeight(), 100, GetFooterHeight());
canvas = new Canvas(pdfCanvas, pdf, rect);
// I think it's here that I need to be able to add a barcode placeholder to something that can be called
canvas.Add(Footer);
}
public float GetFooterHeight()
{
return FooterHeight;
}
}

Issue with HttpClient.GetStringAsync method in windows phone 8.1

private async void refresh_Tapped(object sender, TappedRoutedEventArgs e)
{
httpclient.CancelPendingRequests();
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url); //PROBLEM
source = WebUtility.HtmlDecode(source);
HtmlDocument result = new HtmlDocument();
result.LoadHtml(source);
List<HtmlNode> toftitle = result.DocumentNode.Descendants().Where
(x => (x.Attributes["style"] != null
&& x.Attributes["style"].Value.Contains("font-size:14px;line-height:20px;margin-bottom:10px;"))).ToList();
var li = toftitle[0].InnerHtml.Replace("<br>", "\n");
li = li.Replace("<span style=\"text-transform: uppercase\">", "");
li = li.Replace("</span>", "");
postTextBlock.Text = li;
}
What this code does is basically retrieve a string from a website (HTML source which is parsed right after). This code is executed whenever i click a button: the first time i click it it works correctly, but the second time i think that the method (GetStringAsync) returns an uncompleted task and then execution continues using the old value of source. Indeed, my TextBlock does not update.
Any solution?

You get probably a cached response.
May this will work for you:
httpclient.CancelPendingRequests();
// disable caching
httpclient.DefaultRequestHeaders.Add("Cache-Control", "no-cache");
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url);
...
You can also add a meaningless value to your url like this:
string url = "http://gensav.altervista.org/" + "?nocahce=" + Guid.NewGuid();

To prevent Http responses from getting cached, I do this (in WP8.1):
HttpBaseProtocolFilter filter = new HttpBaseProtocolFilter();
filter.CacheControl.ReadBehavior =
Windows.Web.Http.Filters.HttpCacheReadBehavior.MostRecent;
filter.CacheControl.WriteBehavior =
Windows.Web.Http.Filters.HttpCacheWriteBehavior.NoCache;
_httpClient = new HttpClient(filter);
Initialize your HttpClient in this manner to prevent caching behaviour.

C#/ Html agility pack, is there a more eloquent way to screen scrape?

I'm working on an app in C# that gathers web data from a few different pages daily and saves it in SQL Server. I'm using html agility pack... at the moment I have an xpath for each field/ column in the database. There are 62 columns in the table, and with checking for proper values and formatting, the code below is VERY verbose and repetitive (specifically, xpath expressions and associated blocks). I was wondering if there was a nicer, more concise way, perhaps using LINQ? (which I haven't used much yet but would like to) Here's just the first couple fields set below, this repeats .... 62 cols. I'm not looking for a rewrite, just any suggestions I can get.
List<IDataPoint> list = new List<IDataPoint>();
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmlDoc = hwObject.Load(AddressString);
if (htmlDoc.DocumentNode != null && !htmlDoc.DocumentNode.InnerHtml.Contains("There is no key statistics data available"))
{
var symbolNode = htmlDoc.DocumentNode.SelectSingleNode("/html/body/div[3]/div[4] /div/div/div/div/div/div/h2");
if (symbolNode != null)
{
KeyStatsDP keyStatsDp = new KeyStatsDP();
String symb = "";
symb = symbolNode.InnerHtml;
symb = symb.Substring(symb.LastIndexOf("(") + 1);
symb = symb.Substring(0, symb.Length - 1);
keyStatsDp.Symbol = symb;
String mktCapXPath = "//*[#id=\"yfs_j10_" + symb.ToLower() + "\"]";
var mktCapNode = htmlDoc.DocumentNode.SelectSingleNode(mktCapXPath);
if (mktCapNode != null)
{
String mktCap = mktCapNode.InnerHtml;
keyStatsDp.MarketCapIntraDay = ConvertMoneyInStrToInt(mktCap);
}
var entValNode = htmlDoc.DocumentNode.SelectSingleNode("//html/body/div[3]/div[4]/table[2]/tr[2]/td/table[2]/tr/td/table/tr[2]/td[2]");
if (entValNode != null)
{
if (!entValNode.InnerHtml.Contains("N"))
{
String entVal = entValNode.InnerHtml;
keyStatsDp.EntValue = ConvertMoneyInStrToInt(entVal);
}
}

How to convert img url to BASE64 string in HTML on one method chain by using LINQ or Rx

I found I could generate XDocument object from html by using SgmlReader.SL.
https://bitbucket.org/neuecc/sgmlreader.sl/
The code is like this.
public XDocument Html(TextReader reader)
{
XDocument xml;
using (var sgmlReader = new SgmlReader { DocType = "HTML", CaseFolding = CaseFolding.ToLower, InputStream = reader })
{
xml = XDocument.Load(sgmlReader);
}
return xml;
}
Also we can get src attributes of img tags from the XDocument object.
var ns = xml.Root.Name.Namespace;
var imgQuery = xml.Root.Descendants(ns + "img")
.Select(e => new
{
Link = e.Attribute("src").Value
});
And, we can download and convert stream data of image to BASE64 string.
public static string base64String;
WebClient wc = new WebClient();
wc.OpenReadAsync(new Uri(url)); //image url from src attribute
wc.OpenReadCompleted += new OpenReadCompletedEventHandler(wc_OpenReadCompleted);
void wc_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
using (MemoryStream ms = new MemoryStream())
{
while (true)
{
byte[] buf = new byte[32768];
int read = e.Result.Read(buf, 0, buf.Length);
if (read > 0)
{
ms.Write(buf, 0, read);
}
else { break; }
}
byte[] imageBytes = ms.ToArray();
base64String = Convert.ToBase64String(imageBytes);
}
}
So, What I'd like to do is bellow steps. I'd like to do bellow steps in one method chain like LINQ or Reactive Extensions.
Get src attributes of img tags from XDocument object.
Get image datas from urls.
Generate BASE64 string from image datas.
Replace src attributes by BASE64 string.
The simplest source and output are here.
Before
<html>
<head>
</head>
<body>
<img src='http://image.com/image.jpg' />
<img src='http://image.com/image2.png' />
</body>
</html>
After
<html>
<head>
</head>
<body>
<img src='data:image/jpg;base64,iVBORw...' />
<img src='data:image/png;base64,iSDoske...' />
</body>
</html>
Does anyone know the solution for this?
I'd like to ask experts.

Both LINQ and Rx are designed to promote transformations that result in new objects, not ones that modify existing objects, but this is still doable. You have already done the first step, breaking the task into parts. The next step is to make composable functions that implement those steps.
1) You mostly have this one already, but we should probably keep the elements around to update later.
public IEnumerable<XElement> GetImages(XDocument document)
{
var ns = document.Root.Name.Namespace;
return document.Root.Descendants(ns + "img");
}
2) This seems to be where you have hit a wall from the composability point of view. To start, lets make a FromEventAsyncPattern observable generator. There are already ones for the Begin/End async pattern and standard events, so this will come out somewhere in between.
public IObservable<TEventArgs> FromEventAsyncPattern<TDelegate, TEventArgs>
(Action method, Action<TDelegate> addHandler, Action<TDelegate> removeHandler
) where TEventArgs : EventArgs
{
return Observable.Create<TEventArgs>(
obs =>
{
//subscribe to the handler before starting the method
var ret = Observable.FromEventPattern<TDelegate, TEventArgs>(addHandler, removeHandler)
.Select(ep => ep.EventArgs)
.Take(1) //do this so the observable completes
.Subscribe(obs);
method(); //start the async operation
return ret;
}
);
}
Now we can use this method to turn the downloads into observables. Based on your usage, I think you could also use DownloadDataAsync on the WebClient instead.
public IObservable<byte[]> DownloadAsync(Uri address)
{
return Observable.Using(
() => new System.Net.WebClient(),
wc =>
{
return FromEventAsyncPattern<System.Net.DownloadDataCompletedEventHandler,
System.Net.DownloadDataCompletedEventArgs>
(() => wc.DownloadDataAsync(address),
h => wc.DownloadDataCompleted += h,
h => wc.DownloadDataCompleted -= h
)
.Select(e => e.Result);
//for robustness, you should probably check the error and cancelled
//properties instead of assuming it finished like I am here.
});
}
EDIT: As per your comment, you appear to be using Silverlight, where WebClient is not IDisposable and does not have the method I was using. To deal with that, try something like:
public IObservable<byte[]> DownloadAsync(Uri address)
{
var wc = new System.Net.WebClient();
var eap = FromEventAsyncPattern<OpenReadCompletedEventHandler,
OpenReadCompletedEventArgs>(
() => wc.OpenReadAsync(address),
h => wc.OpenReadCompleted += h,
h => wc.OpenReadCompleted -= h);
return from e in eap
from b in e.Result.ReadAsync()
select b;
}
You will need to find an implementation of ReadAsync to read the stream. You should be able to find one pretty easily, and the post was long enough already so I left it out.
3 & 4) Now we are ready to put it all together and update the elements. Since step 3 is so simple, I'll just merge it in with step 4.
public IObservable<Unit> ReplaceImageLinks(XDocument document)
{
return (from element in GetImages(document)
let address = new Uri(element.Attribute("src").Value)
select (From data in DownloadAsync(address)
Select Convert.ToBase64String(data)
).Do(base64 => element.Attribute("src").Value = base64)
).Merge()
.IgnoreElements()
.Select(s => Unit.Default);
//select doesn't really do anything as IgnoreElements eats all
//the values, but it is needed to change the type of the observable.
//Task may be more appropriate here.
}

How to extract bullet information from word document?

I want to extract information of bullets present in word document.
I want something like this :
Suppose the text below, is in word document :
Steps to Start car :
Open door
Sit inside
Close the door
Insert key
etc.
Then I want my text file like below :
Steps to Start car :
<BULET> Open door </BULET>
<BULET> Sit inside </BULET>
<BULET> Close the door </BULET>
<BULET> Insert key </BULET>
<BULET> etc.</BULET>
I am using C# language to do this.
I can extract paragraphs from word document and directly write them in text file with some formatting information like whether text is bold or is in italics, etc. but dont know how to extract this bullet information.
Can anyone please tell me how to do this?
Thanks in advance

You can do it by reading each sentence. doc.Sentences is an array of Range object. So you can get same Range object from Paragraph.
foreach (Paragraph para in oDoc.Paragraphs)
{
string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
string bulletStr = para.Range.ListFormat.ListString;
MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
}
Into paraNumber you can get paragraph level and into buttetStr you can get bullet as string.

I am using this OpenXMLPower tool by Eric White. Its free and available at NUGet package. you can install it from Visual studio package manager.
He has provided a ready to use code snippet. This tool has saved me many hours. Below is the way I have customized code snippet to use for my requirement.
Infact you can use these methods as it in your project.
private static WordprocessingDocument _wordDocument;
private StringBuilder textItemSB = new StringBuilder();
private List<string> textItemList = new List<string>();
/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
string modifiedString = string.Empty;
List<string> allList = new List<string>();
List<string> allListText = new List<string>();
try
{
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
//RevisionAccepter.AcceptRevisions(_wordDocument);
XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
XElement body = root.LogicalChildrenContent().First();
OutputBlockLevelContent(_wordDocument, body);
}
catch (Exception ex)
{
logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
}
}
// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
try
{
string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
foreach (XElement blockLevelContentElement in
blockLevelContentContainer.LogicalChildrenContent())
{
if (blockLevelContentElement.Name == W.p)
{
listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
itemText = blockLevelContentElement
.LogicalChildrenContent(W.r)
.LogicalChildrenContent(W.t)
.Select(t => (string)t)
.StringConcatenate();
if (itemText.Trim().Length > 0)
{
if (null == listItem)
{
// Add html break tag
textItemSB.Append( itemText + "<br/>");
}
else
{
//if listItem == "" bullet character, replace it with equivalent html encoded character
textItemSB.Append(" " + (listItem == "" ? "•" : listItem) + " " + itemText + "<br/>");
}
}
else if (null != listItem)
{
//If bullet character is found, replace it with equivalent html encoded character
textItemSB.Append(listItem == "" ? " •" : listItem);
}
else
textItemSB.Append("<blank>");
continue;
}
// If element is not a paragraph, it must be a table.
foreach (var row in blockLevelContentElement.LogicalChildrenContent())
{
foreach (var cell in row.LogicalChildrenContent())
{
// Cells are a block-level content container, so can call this method recursively.
OutputBlockLevelContent(wordDoc, cell);
}
}
}
if (textItemSB.Length > 0)
{
textItemList.Add(textItemSB.ToString());
textItemSB.Clear();
}
}
catch (Exception ex)
{
.....
}
}

I got the answer.....
First I was converting doc on paragraph basis. But instead of that if we process doc file sentence by sentence basis, it is possible to determine whether that sentence contains bullet or any kind of shape or if that sentence is part of table. So once we get this information, then we can convert that sentence appropriately. If someone needs source code, I can share it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

HtmlAgility Pack get single node get null value - xpath

Related

itext7 barcodes in footer

Issue with HttpClient.GetStringAsync method in windows phone 8.1

C#/ Html agility pack, is there a more eloquent way to screen scrape?

How to convert img url to BASE64 string in HTML on one method chain by using LINQ or Rx

How to extract bullet information from word document?

Categories

Resources