Why AgilityPack not Loading From Browser dynamic text? - html-agility-pack

In scraping the following website, I am not get the table in order to scrape. I am waiting for the dynamic text to load. But I never see the results of the correct table.
https://masseyratings.com/nba/games
Here is my Agility Pack code:
var url = "https://masseyratings.com/nba/games";
HtmlWeb web = new HtmlWeb();
var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;
// WAIT until the dynamic text is set
return !string.IsNullOrEmpty(webBrowser.Document.GetElementById("mytable0").InnerText);
});
int docLen = doc.Text.Length;
currentSiteData = doc.Text.ToString();
I am not getting any error, I am just not seeing the table of data. And strangely, the HTML tags are getting capitalized.
How can I get the correct data into the currentsiteData variable to further process?

I was able to fix the problem by using the "PuppeteerSharp" and "AngleSharp" nuget package.
Here is my code that works.
using PuppeteerSharp;
using AngleSharp;
var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
Headless = true
});
var page = await browser.NewPageAsync();
await page.GoToAsync("https://masseyratings.com/nba/games");
var content = await page.GetContentAsync();
var context = BrowsingContext.New(AngleSharp.Configuration.Default);
var document = await context.OpenAsync(req => req.Content(content));
var currentSiteData = document.Source.Text.ToString();

Related

Chromeless - get all images src from a webpage

I'm trying to get the src values for all img tags in an HTML page using Chromeless. My current implementation is something like this:
async function run() {
const chromeless = new Chromeless();
let url = 'http://someurl/somepath.html';
var allImgUrls = await chromeless
.goto(url)
.evaluate(() => document.getElementsByTagName('img'));
var htmlContent = await chromeless
.goto(url)
.evaluate(() => document.documentElement.outerHTML );
console.log(allImgUrls);
await chromeless.end()
}
The issue is, I'm not getting any values of img object in the allImgUrls.
After some research, found out that we could use this approach:
var imgSrcs = await chromeless
.goto(url)
.evaluate(() => {
/// since document.querySelectorAll doesn't actually return an array but a Nodelist (similar to array)
/// we call the map function from Array.prototype which is equivalent to [].map.call()
const srcs = [].map.call(document.querySelectorAll('img'), img => img.src);
return JSON.stringify(srcs);
});

Issue with HttpClient.GetStringAsync method in windows phone 8.1

private async void refresh_Tapped(object sender, TappedRoutedEventArgs e)
{
httpclient.CancelPendingRequests();
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url); //PROBLEM
source = WebUtility.HtmlDecode(source);
HtmlDocument result = new HtmlDocument();
result.LoadHtml(source);
List<HtmlNode> toftitle = result.DocumentNode.Descendants().Where
(x => (x.Attributes["style"] != null
&& x.Attributes["style"].Value.Contains("font-size:14px;line-height:20px;margin-bottom:10px;"))).ToList();
var li = toftitle[0].InnerHtml.Replace("<br>", "\n");
li = li.Replace("<span style=\"text-transform: uppercase\">", "");
li = li.Replace("</span>", "");
postTextBlock.Text = li;
}
What this code does is basically retrieve a string from a website (HTML source which is parsed right after). This code is executed whenever i click a button: the first time i click it it works correctly, but the second time i think that the method (GetStringAsync) returns an uncompleted task and then execution continues using the old value of source. Indeed, my TextBlock does not update.
Any solution?
You get probably a cached response.
May this will work for you:
httpclient.CancelPendingRequests();
// disable caching
httpclient.DefaultRequestHeaders.Add("Cache-Control", "no-cache");
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url);
...
You can also add a meaningless value to your url like this:
string url = "http://gensav.altervista.org/" + "?nocahce=" + Guid.NewGuid();
To prevent Http responses from getting cached, I do this (in WP8.1):
HttpBaseProtocolFilter filter = new HttpBaseProtocolFilter();
filter.CacheControl.ReadBehavior =
Windows.Web.Http.Filters.HttpCacheReadBehavior.MostRecent;
filter.CacheControl.WriteBehavior =
Windows.Web.Http.Filters.HttpCacheWriteBehavior.NoCache;
_httpClient = new HttpClient(filter);
Initialize your HttpClient in this manner to prevent caching behaviour.

any idea why the below parse cloud code parallel promises do not run?

Am trying to match urls depending on the id given that i dont know if the id given is an object id / title / shorturl code.
var promises = [];
var id = request.params.id;
//short url
queryShortUrl = new Parse.Query("books");
queryShortUrl.limit = 1;
queryShortUrl.equalTo('shortURLcode',id);
promises.push(queryShortUrl.find());
//title
queryTitle = new Parse.Query("books");
queryTitle.limit = 1;
queryTitle.equalTo('bookTitle',id);
promises.push(queryNickname.find());
//objectId
queryObjectId = new Parse.Query("books");
promises.push(queryObjectId.get(request.params.genuid));
// run these
Parse.Promise.when(promises).then(function() {
console.log('res');
response.success(arguments);
})

Radgrid insertItem from client side

How can i get radgrid insert items from clientside.
I have used the following code, but its not working.
var mode = rgBoxLimits.get_isItemInserted();
var insertItems;
var dpToDate;
if (mode) {
insertItems= rgBoxLimits.get_insertItem();
dpToDate = insertItems[0].findElement("dtToDate"); //Not working
}
For edit items, i have the following code and its working fine.
var editedItems = rgBoxLimits.get_editItems();
var dpToDate = editedItems[0].findElement("dtToDate");
The problem is that you are running the get_isItemInserted() and get_insertedItem() method on a RadGrid object, while they are methods for a GridTableView object. See the RadGrid Documentation for more info.
Try this:
function getItems(sender, args) {
var myRadGrid = document.getElementById("MainContent_RadGrid1");
var grid = window.$find("MainContent_RadGrid1");
var mode = grid.get_masterTableView().get_isItemInserted();
mode = true;
var insertItems;
var dpToDate;
if (mode) {
insertItems = grid.get_masterTableView().get_insertItem();
dpToDate = insertItems[0].findElement("dtToDate"); //Not working
}
}

Allow images in AtomPub ASPNET Web Api Server

I'm trying to create an Atompub service with ASP.NET WEB API, all it's ok but when I try to post any image from Windows Live Writer I get an error "The blog doesn't allow the image load" I'm reading the ietf doc.
My services controller code:
public class ServicesController : ApiController
{
public HttpResponseMessage Get()
{
var serviceDocument = new ServiceDocument();
var workSpace = new Workspace
{
Title = new TextSyndicationContent("Nicoloco Site"),
BaseUri = new Uri(Request.RequestUri.GetLeftPart(UriPartial.Authority))
};
var posts = new ResourceCollectionInfo("Nicoloco Blog",
new Uri(Url.Link("DefaultApi", new { controller = "blogapi" })));
posts.Accepts.Add("application/atom+xml;type=entry");
var images = new ResourceCollectionInfo("Images Blog",
new Uri(Url.Link("DefaultApi", new { controller = "images" })));
images.Accepts.Add("image/png");
images.Accepts.Add("image/jpeg");
images.Accepts.Add("image/jpg");
images.Accepts.Add("image/gif");
var categoriesUri = new Uri(Url.Link("DefaultApi", new { controller = "tags", format = "atomcat" }));
var categories = new ReferencedCategoriesDocument(categoriesUri);
posts.Categories.Add(categories);
workSpace.Collections.Add(posts);
workSpace.Collections.Add(images);
serviceDocument.Workspaces.Add(workSpace);
var response = new HttpResponseMessage(HttpStatusCode.OK);
var formatter = new AtomPub10ServiceDocumentFormatter(serviceDocument);
var stream = new MemoryStream();
using (var writer = XmlWriter.Create(stream))
{
formatter.WriteTo(writer);
}
stream.Position = 0;
var content = new StreamContent(stream);
response.Content = content;
response.Content.Headers.ContentType = new MediaTypeHeaderValue("application/atomsvc+xml");
return response;
}
}
The http GET Request generate the follow XML:
<?xml version="1.0" encoding="utf-8"?>
<app:service
xmlns:a10="http://www.w3.org/2005/Atom"
xmlns:app="http://www.w3.org/2007/app">
<app:workspace xml:base="http://localhost:53644/">
<a10:title type="text">Nicoloco Site</a10:title>
<app:collection href="http://localhost:53644/api/blogapi">
<a10:title type="text">Nicoloco Blog</a10:title>
<app:accept>application/atom+xml;type=entry</app:accept>
<app:categories href="http://localhost:53644/api/tags?format=atomcat" />
</app:collection>
<app:collection href="http://localhost:53644/api/images">
<a10:title type="text">Images Blog</a10:title>
<app:accept>image/png</app:accept>
<app:accept>image/jpeg</app:accept>
<app:accept>image/jpg</app:accept>
<app:accept>image/gif</app:accept>
</app:collection>
</app:workspace>
</app:service>
But I can't publish images using this service.
Best regards.
I found my error on "categories line", WLW log file shows a malformed XML error in this line, I removed it and all works fine for me... in this blog post explains how WLW Works with image files
If somebody have any comment... I'll be grateful

Resources