Parsing Financial information from HTML - visual-studio-2010

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.
From this page I am attempting to pull out the numbers from the "Net Income" row for each quarter.
here is my current progress, (But I am uncertain of how to proceed further):
String url = "http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii"
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var body = document.DocumentNode.Descendants()
.Where(n => n.Name == "body")
.FirstOrDefault();
if (body != null)
{
}

Well, first of all there's no need to get the body first, you can directly query the document for what you want. As for finding the value you're looking for, this is how you could do it:
HtmlNode tdNode = document.DocumentNode.DescendantNodes()
.FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
HtmlNode trNode = tdNode.ParentNode;
foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType == HtmlNodeType.Element))
{
Console.WriteLine(node.InnerText.Trim());
//Output:
//Net Income
//265.00
//298.00
//601.00
//672.00
//666.00
}
}
Also note the Trim calls because there are newlines in the innertext of some elements.

Related

ImportXML "post_name" from an article, having trouble finding proper XPath

I've been having trouble finding the proper or accurate Xpath for google sheets ImportXML.
Article in question:
https://www.digitaltrends.com/news/this-is-what-a-birthday-party-on-the-iss-looks-like/
Result i'm looking for:
'post_name': 'this-is-what-a-birthday-party-on-the-iss-looks-like'
Using the "copy full XPath feature in Chrome Inspect feature, i'm getting:
/html/head/script[43]/text()
Which does not work with Google Sheet's ImportXML feature. Can someone guide me through how will i be able to pull this section of the site?
EDIT: I'm trying to retrieve anything within these parameters such as "post name, post title, post id." [View Source1
The page is built in javascript on the client side, not the server side. It is therefore impossible to retrieve information with the IMPORTXML function. You need to read what is included in the script ...
function extract(){
var url='https://www.digitaltrends.com/news/this-is-what-a-birthday-party-on-the-iss-looks-like/'
var source = UrlFetchApp.fetch(url).getContentText()
var data = source.split('<script>')
//Logger.log(data[3])
info = "'post_name" + data[3].split('post_name')[1].split(',')[0]
Logger.log(info)
}
Now if you want to retrieve all the information contained in the JSON
function extract(){
var url='https://www.digitaltrends.com/news/this-is-what-a-birthday-party-on-the-iss-looks-like/'
var source = UrlFetchApp.fetch(url).getContentText()
var data = source.split('<script>')
//Logger.log(data[3])
info = data[3].replace(/(\t)/gm,"").replace(/(\n)/gm,"").replace(/(')/gm,"\"").replace(/(: )/gm,":")
info = info.split('({')[1].split(',});}')[0]
//Logger.log(info)
var myData = JSON.parse('{' + info + '}')
getPairs(eval(myData),'myData')
}
function getPairs(obj,id) {
const regex = new RegExp('[^0-9]+');
const fullPath = true
var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet()
for (let p in obj) {
var newid = (regex.test(p)) ? id + '.' + p : id + '[' + p + ']';
if (obj[p]!=null){
if (typeof obj[p] != 'object' && typeof obj[p] != 'function'){
sheet.appendRow([fullPath?newid:p, obj[p]]);
}
if (typeof obj[p] == 'object') {
getPairs( obj[p], newid );
}
}
}
}

Why can't I compare two fields in a search predicate in Sitecore 7.5?

I am trying to build a search predicate in code that compares two fields in Sitecore and I am getting a strange error message. Basically I have two date fields on each content item - FirstPublishDate (the date that the content item was first published) and LastPublishDate (the last date that the content item was published). I would like to find all content items where the LastPublishDate falls within a certain date range AND where the LastPublishDate does not equal the FirstPublishDate. Using Linq here is my method for generating the predicate...
protected Expression<Func<T, Boolean>> getDateFacetPredicate<T>() where T : MySearchResultItem
{
var predicate = PredicateBuilder.True<T>();
foreach (var facet in myFacetCategories)
{
var dateTo = System.DateTime.Now;
var dateFrom = dateTo.AddDays(facet.Value*-1);
predicate = predicate.And(i => i.LastPublishDate.Between(dateFrom, dateTo, Inclusion.Both)).And(j => j.LastPublishDate != j.FirstPublishDate);
}
return predicate;
}
Then I use this predicate in my general site search code to perform the search as follows: the above predicate gets passed in to this method as the "additionalWhere" parameter.
public static SearchResults<T> GeneralSearch<T>(string searchText, ISearchIndex index, int currentPage = 0, int pageSize = 20, string language = "", IEnumerable<string> additionalFields = null,
Expression<Func<T, Boolean>> additionalWhere = null, Expression<Func<T, Boolean>> additionalFilter = null, IEnumerable<string> facets = null,
Expression<Func<T, Boolean>> facetFilter = null, string sortField = null, SortDirection sortDirection = SortDirection.Ascending) where T : SearchResultItem {
using (var context = index.CreateSearchContext()) {
var query = context.GetQueryable<T>();
if (!string.IsNullOrWhiteSpace(searchText)) {
var keywordPred = PredicateBuilder.True<T>();
// take into account escaping of special characters and working around Sitecore limitation with Contains and Equals methods
var isSpecialMatch = Regex.IsMatch(searchText, "[" + specialSOLRChars + "]");
if (isSpecialMatch) {
var wildcardText = string.Format("\"*{0}*\"", Regex.Replace(searchText, "([" + specialSOLRChars + "])", #"\$1"));
wildcardText = wildcardText.Replace(" ", "*");
keywordPred = keywordPred.Or(i => i.Content.MatchWildcard(wildcardText)).Or(i => i.Name.MatchWildcard(wildcardText));
}
else {
keywordPred = keywordPred.Or(i => i.Content.Contains(searchText)).Or(i => i.Name.Contains(searchText));
}
if (additionalFields != null && additionalFields.Any()) {
keywordPred = additionalFields.Aggregate(keywordPred, (current, field) => current.Or(i => i[field].Equals(searchText)));
}
//query = query.Where(i => (i.Content.Contains(searchText) || i.Name.Contains(searchText))); // more explicit call to check the content or item name for our term
query = query.Where(keywordPred);
}
if (language == string.Empty) {
language = Sitecore.Context.Language.ToString();
}
if (language != null) {
query = query.Filter(i => i.Language.Equals(language));
}
query = query.Page(currentPage, pageSize);
if (additionalWhere != null) {
query = query.Where(additionalWhere);
}
if (additionalFilter != null) {
query = query.Filter(additionalFilter);
}
query = query.ApplySecurityFilter();
FacetResults resultFacets = null;
if (facets != null && facets.Any()) {
resultFacets = facets.Aggregate(query, (current, fname) => current.FacetOn(i => i[fname])).GetFacets();
}
// calling this before applying facetFilter should allow us to get a total facet set
// instead of just those related to the current result set
// var resultFacets = query.GetFacets();
// apply after getting facets for more complete facet list
if (facetFilter != null) {
query = query.Where(facetFilter);
}
if (sortField != null)
{
if (sortDirection == SortDirection.Ascending)
{
query = query.OrderBy(x => x[sortField]);
}
else
{
query = query.OrderByDescending(x => x[sortField]);
}
}
var results = query.GetResults(); // this enumerates the actual results
return new SearchResults<T>(results.Hits, results.TotalSearchResults, resultFacets);
}
}
When I try this I get the following error message:
Server Error in '/' Application.
No constant node in query node of type: 'Sitecore.ContentSearch.Linq.Nodes.EqualNode'. Left: 'Sitecore.ContentSearch.Linq.Nodes.FieldNode'. Right: 'Sitecore.ContentSearch.Linq.Nodes.FieldNode'.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: System.NotSupportedException: No constant node in query node of type: 'Sitecore.ContentSearch.Linq.Nodes.EqualNode'. Left: 'Sitecore.ContentSearch.Linq.Nodes.FieldNode'. Right: 'Sitecore.ContentSearch.Linq.Nodes.FieldNode'.
Source Error:
Line 548: FacetResults resultFacets = null;
Line 549: if (facets != null && facets.Any()) {
Line 550: resultFacets = facets.Aggregate(query, (current, fname) => current.FacetOn(i => i[fname])).GetFacets();
Line 551: }
Line 552: // calling this before applying facetFilter should allow us to get a total facet set
From what I can understand about the error message it seems to not like that I am trying to compare two different fields to each other instead of comparing a field to a constant. The other odd thing is that the error seems to be pointing to a line of code that has to do with aggregating facets. I did a Google search and came up with absolutely nothing relating to this error. Any ideas?
Thanks,
Corey
I think what you are trying is not possible, and if you look at this that might indeed be the case. A solution that is given there is to put your logic in the index: create a ComputedField that checks your dates and puts a value in the index that you can search on (can be a simple boolean).
You will need to split your logic though - the query on the date range can still be done in the predicate (as it is relative to the current date) but the comparison of first and last should be done on index time instead of on query time.

Can't use == in LINQ Extension Method

I've got the following struct that is the key for my dictionary:
public struct CodeAttribute
{
public int ProcessorId;
public Enums.TransactionType transactionType;
public string ErrorMessage;
}
I've got the following dictionary (one value for now as it's just an example):
var errors = new Dictionary<CodeAttribute, int>
{
{CreateCodeAttributeList(2, Enums.TransactionType.Order, "Invalid ProcessorId sent in the Payment Request"), 100 }
};
And I'm trying to pull out the item in the dictionary that matches on the struct that has a match for both its ProcessorId and TransactionType properties:
private static string GetRelatedMessage(int errorCode, Dictionary<CodeAttribute, int> errorsList)
{
CodeAttribute codeAttribute = errorsList.Where(e => e.Key.ProcessorId == _processorId)
.Where(e => e.Key.transactionType == _transactionType) == errorCode;
return codeAttribute.ErrorMessage;
}
I also want to match on error code as part of the filtering, not just paymentprocessorId and transactionType, just a side note. The item in the dictionary must match all 3 values in order to get the right one in our case.
UPDATE
I tried this as well,and yes I get the error that it can't convert IEnumerable to CodeAtribute
CodeAttribute codeAttributes = errorsList.Where(e => e.Key.ProcessorId == _processorId)
.Where(e => e.Key.transactionType == _transactionType)
.Where(e => e.Value.Equals(errorCode));
UPDATE
with the help of Sam I think this may work
CodeAttribute codeAttribute = errorsList.FirstOrDefault(e => e.Key.ProcessorId ==
_processorId && e.Key.transactionType == _transactionType
&& e.Value == errorCode).Key;
If I understand correctly then you want
var codeAttribute = errorsList.FirstOrDefault(e =>
e.Key.ProcessorId == _processorId
&& e.Key.transactionType == _transactionType
&& e.Value == errorCode);
if(codeAttribute == null)
{
//no item matches in the dictionary.
}
return codeAttribute.Key.ErrorMessage;
Note that codeAttribute will be a KeyValuePair so you will need the codeAttribute.Key.ErrorMessage as your return value.
You don't need to use Where as that will return an IEnumerable so this won't work if you want a single item.
You probably need to go with something like this:
CodeAttribute codeAttribute = errorsList.FirstOrDefault(e => e.Key.ProcessorId == _processorId && e.Key.transactionType ==_transactionType)
While the other answers are correct, I would probably write it like this:
var errorMessage = errorsList
.Where(e => e.Key.ProcessorId == _processorId
&& e.Key.transactionType == _transactionType
&& e.Value == errorCode)
.Select(e => e.Key.ErrorMessage)
.FirstOrDefault();
That is, push the condition to filter earlier on, select the data I want from that result-set, and then take the first result (should one exist) of the transformed data.
Since the IEnumerable queries are lazy then this will still stop on the first sucessfully filtered object.
Since the source is a Dictionary, it may be also prudent to set up a relevant Equals/GetHashCode and structure the code such that it will be used.

OData webservice quering from VS Service Reference

I want to send request like this:
/odata.svc/Pages(ItemId=27,PublicationId=1)
Here's the code I'm using:
CdService.ContentDeliveryService cdService = new ContentDeliveryService(new Uri("http://xxx.xx:81/odata.svc"));
var pages = cdService.Pages;
pages = pages.AddQueryOption("ItemId", "270");
pages = pages.AddQueryOption("PublicationId", "2");
var result = pages.Execute();
My problem is that this code is sending request like this:
/odata.svc/Pages()?ItemId=270&PublicationId=2
The problem with this request is that it returns me all the pages there are and not just the one I need.
I could use LINQ:
result.Single(page => page.ItemId == 27 && page.PublicationId == 1);
But the problem is that all the pages will still be sent over the wire
I've done a quick test with LINQ and it seems to be doing the correct query:
ContentDeliveryService.ContentDeliveryService service =
new ContentDeliveryService.ContentDeliveryService(new Uri("http://localhost:99/odata.svc"));
var page = from x in service.Pages
where x.ItemId == 2122
&& x.PublicationId == 16
select x;
foreach (var page1 in page)
{
Console.WriteLine(page1.Title);
}
Console.Read();
You can try this:
EntityDescriptor entityDescriptor = service.Entities.Where(c =>
c.Entity is CDService.Page
&& ((CDService.Page)c.Entity).ItemId == pageId.ItemId
&& ((CDService.Page)c.Entity).PublicationId == pageId.PublicationId)
.FirstOrDefault();
if (entityDescriptor != null)
{
return (CDService.Page)entityDescriptor.Entity;
}
I have found a solution, although not very nice:
ContentDeliveryService cdService1
= new ContentDeliveryService(new Uri("http://xxx.xx:81/odata.svc"));
var page = cdService1.Execute<Page>(
new Uri("http://xxx.xx:81/odata.svc/Pages(ItemId=27,PublicationId=1)"));

C#/ Html agility pack, is there a more eloquent way to screen scrape?

I'm working on an app in C# that gathers web data from a few different pages daily and saves it in SQL Server. I'm using html agility pack... at the moment I have an xpath for each field/ column in the database. There are 62 columns in the table, and with checking for proper values and formatting, the code below is VERY verbose and repetitive (specifically, xpath expressions and associated blocks). I was wondering if there was a nicer, more concise way, perhaps using LINQ? (which I haven't used much yet but would like to) Here's just the first couple fields set below, this repeats .... 62 cols. I'm not looking for a rewrite, just any suggestions I can get.
List<IDataPoint> list = new List<IDataPoint>();
HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmlDoc = hwObject.Load(AddressString);
if (htmlDoc.DocumentNode != null && !htmlDoc.DocumentNode.InnerHtml.Contains("There is no key statistics data available"))
{
var symbolNode = htmlDoc.DocumentNode.SelectSingleNode("/html/body/div[3]/div[4] /div/div/div/div/div/div/h2");
if (symbolNode != null)
{
KeyStatsDP keyStatsDp = new KeyStatsDP();
String symb = "";
symb = symbolNode.InnerHtml;
symb = symb.Substring(symb.LastIndexOf("(") + 1);
symb = symb.Substring(0, symb.Length - 1);
keyStatsDp.Symbol = symb;
String mktCapXPath = "//*[#id=\"yfs_j10_" + symb.ToLower() + "\"]";
var mktCapNode = htmlDoc.DocumentNode.SelectSingleNode(mktCapXPath);
if (mktCapNode != null)
{
String mktCap = mktCapNode.InnerHtml;
keyStatsDp.MarketCapIntraDay = ConvertMoneyInStrToInt(mktCap);
}
var entValNode = htmlDoc.DocumentNode.SelectSingleNode("//html/body/div[3]/div[4]/table[2]/tr[2]/td/table[2]/tr/td/table/tr[2]/td[2]");
if (entValNode != null)
{
if (!entValNode.InnerHtml.Contains("N"))
{
String entVal = entValNode.InnerHtml;
keyStatsDp.EntValue = ConvertMoneyInStrToInt(entVal);
}
}

Resources