In scraping the following website, I am not get the table in order to scrape. I am waiting for the dynamic text to load. But I never see the results of the correct table.
https://masseyratings.com/nba/games
Here is my Agility Pack code:
var url = "https://masseyratings.com/nba/games";
HtmlWeb web = new HtmlWeb();
var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;
// WAIT until the dynamic text is set
return !string.IsNullOrEmpty(webBrowser.Document.GetElementById("mytable0").InnerText);
});
int docLen = doc.Text.Length;
currentSiteData = doc.Text.ToString();
I am not getting any error, I am just not seeing the table of data. And strangely, the HTML tags are getting capitalized.
How can I get the correct data into the currentsiteData variable to further process?
I was able to fix the problem by using the "PuppeteerSharp" and "AngleSharp" nuget package.
Here is my code that works.
using PuppeteerSharp;
using AngleSharp;
var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
Headless = true
});
var page = await browser.NewPageAsync();
await page.GoToAsync("https://masseyratings.com/nba/games");
var content = await page.GetContentAsync();
var context = BrowsingContext.New(AngleSharp.Configuration.Default);
var document = await context.OpenAsync(req => req.Content(content));
var currentSiteData = document.Source.Text.ToString();
Related
I'm trying to get the src values for all img tags in an HTML page using Chromeless. My current implementation is something like this:
async function run() {
const chromeless = new Chromeless();
let url = 'http://someurl/somepath.html';
var allImgUrls = await chromeless
.goto(url)
.evaluate(() => document.getElementsByTagName('img'));
var htmlContent = await chromeless
.goto(url)
.evaluate(() => document.documentElement.outerHTML );
console.log(allImgUrls);
await chromeless.end()
}
The issue is, I'm not getting any values of img object in the allImgUrls.
After some research, found out that we could use this approach:
var imgSrcs = await chromeless
.goto(url)
.evaluate(() => {
/// since document.querySelectorAll doesn't actually return an array but a Nodelist (similar to array)
/// we call the map function from Array.prototype which is equivalent to [].map.call()
const srcs = [].map.call(document.querySelectorAll('img'), img => img.src);
return JSON.stringify(srcs);
});
private async void refresh_Tapped(object sender, TappedRoutedEventArgs e)
{
httpclient.CancelPendingRequests();
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url); //PROBLEM
source = WebUtility.HtmlDecode(source);
HtmlDocument result = new HtmlDocument();
result.LoadHtml(source);
List<HtmlNode> toftitle = result.DocumentNode.Descendants().Where
(x => (x.Attributes["style"] != null
&& x.Attributes["style"].Value.Contains("font-size:14px;line-height:20px;margin-bottom:10px;"))).ToList();
var li = toftitle[0].InnerHtml.Replace("<br>", "\n");
li = li.Replace("<span style=\"text-transform: uppercase\">", "");
li = li.Replace("</span>", "");
postTextBlock.Text = li;
}
What this code does is basically retrieve a string from a website (HTML source which is parsed right after). This code is executed whenever i click a button: the first time i click it it works correctly, but the second time i think that the method (GetStringAsync) returns an uncompleted task and then execution continues using the old value of source. Indeed, my TextBlock does not update.
Any solution?
You get probably a cached response.
May this will work for you:
httpclient.CancelPendingRequests();
// disable caching
httpclient.DefaultRequestHeaders.Add("Cache-Control", "no-cache");
string url = "http://gensav.altervista.org/";
var source = await httpclient.GetStringAsync(url);
...
You can also add a meaningless value to your url like this:
string url = "http://gensav.altervista.org/" + "?nocahce=" + Guid.NewGuid();
To prevent Http responses from getting cached, I do this (in WP8.1):
HttpBaseProtocolFilter filter = new HttpBaseProtocolFilter();
filter.CacheControl.ReadBehavior =
Windows.Web.Http.Filters.HttpCacheReadBehavior.MostRecent;
filter.CacheControl.WriteBehavior =
Windows.Web.Http.Filters.HttpCacheWriteBehavior.NoCache;
_httpClient = new HttpClient(filter);
Initialize your HttpClient in this manner to prevent caching behaviour.
Am trying to match urls depending on the id given that i dont know if the id given is an object id / title / shorturl code.
var promises = [];
var id = request.params.id;
//short url
queryShortUrl = new Parse.Query("books");
queryShortUrl.limit = 1;
queryShortUrl.equalTo('shortURLcode',id);
promises.push(queryShortUrl.find());
//title
queryTitle = new Parse.Query("books");
queryTitle.limit = 1;
queryTitle.equalTo('bookTitle',id);
promises.push(queryNickname.find());
//objectId
queryObjectId = new Parse.Query("books");
promises.push(queryObjectId.get(request.params.genuid));
// run these
Parse.Promise.when(promises).then(function() {
console.log('res');
response.success(arguments);
})
How can i get radgrid insert items from clientside.
I have used the following code, but its not working.
var mode = rgBoxLimits.get_isItemInserted();
var insertItems;
var dpToDate;
if (mode) {
insertItems= rgBoxLimits.get_insertItem();
dpToDate = insertItems[0].findElement("dtToDate"); //Not working
}
For edit items, i have the following code and its working fine.
var editedItems = rgBoxLimits.get_editItems();
var dpToDate = editedItems[0].findElement("dtToDate");
The problem is that you are running the get_isItemInserted() and get_insertedItem() method on a RadGrid object, while they are methods for a GridTableView object. See the RadGrid Documentation for more info.
Try this:
function getItems(sender, args) {
var myRadGrid = document.getElementById("MainContent_RadGrid1");
var grid = window.$find("MainContent_RadGrid1");
var mode = grid.get_masterTableView().get_isItemInserted();
mode = true;
var insertItems;
var dpToDate;
if (mode) {
insertItems = grid.get_masterTableView().get_insertItem();
dpToDate = insertItems[0].findElement("dtToDate"); //Not working
}
}
I'm trying to create an Atompub service with ASP.NET WEB API, all it's ok but when I try to post any image from Windows Live Writer I get an error "The blog doesn't allow the image load" I'm reading the ietf doc.
My services controller code:
public class ServicesController : ApiController
{
public HttpResponseMessage Get()
{
var serviceDocument = new ServiceDocument();
var workSpace = new Workspace
{
Title = new TextSyndicationContent("Nicoloco Site"),
BaseUri = new Uri(Request.RequestUri.GetLeftPart(UriPartial.Authority))
};
var posts = new ResourceCollectionInfo("Nicoloco Blog",
new Uri(Url.Link("DefaultApi", new { controller = "blogapi" })));
posts.Accepts.Add("application/atom+xml;type=entry");
var images = new ResourceCollectionInfo("Images Blog",
new Uri(Url.Link("DefaultApi", new { controller = "images" })));
images.Accepts.Add("image/png");
images.Accepts.Add("image/jpeg");
images.Accepts.Add("image/jpg");
images.Accepts.Add("image/gif");
var categoriesUri = new Uri(Url.Link("DefaultApi", new { controller = "tags", format = "atomcat" }));
var categories = new ReferencedCategoriesDocument(categoriesUri);
posts.Categories.Add(categories);
workSpace.Collections.Add(posts);
workSpace.Collections.Add(images);
serviceDocument.Workspaces.Add(workSpace);
var response = new HttpResponseMessage(HttpStatusCode.OK);
var formatter = new AtomPub10ServiceDocumentFormatter(serviceDocument);
var stream = new MemoryStream();
using (var writer = XmlWriter.Create(stream))
{
formatter.WriteTo(writer);
}
stream.Position = 0;
var content = new StreamContent(stream);
response.Content = content;
response.Content.Headers.ContentType = new MediaTypeHeaderValue("application/atomsvc+xml");
return response;
}
}
The http GET Request generate the follow XML:
<?xml version="1.0" encoding="utf-8"?>
<app:service
xmlns:a10="http://www.w3.org/2005/Atom"
xmlns:app="http://www.w3.org/2007/app">
<app:workspace xml:base="http://localhost:53644/">
<a10:title type="text">Nicoloco Site</a10:title>
<app:collection href="http://localhost:53644/api/blogapi">
<a10:title type="text">Nicoloco Blog</a10:title>
<app:accept>application/atom+xml;type=entry</app:accept>
<app:categories href="http://localhost:53644/api/tags?format=atomcat" />
</app:collection>
<app:collection href="http://localhost:53644/api/images">
<a10:title type="text">Images Blog</a10:title>
<app:accept>image/png</app:accept>
<app:accept>image/jpeg</app:accept>
<app:accept>image/jpg</app:accept>
<app:accept>image/gif</app:accept>
</app:collection>
</app:workspace>
</app:service>
But I can't publish images using this service.
Best regards.
I found my error on "categories line", WLW log file shows a malformed XML error in this line, I removed it and all works fine for me... in this blog post explains how WLW Works with image files
If somebody have any comment... I'll be grateful