My background:
I am a newbie when it comes to HTML scrubbing. It has been about four years since I did my only work coding for with C# for html. My other coding with C# equally a while back was for forms to manipulate data in SQL Server databases.
What I have done to try to get started with HTML Agility Pack (HAP):
I have spent several days trying to make sense of instructions found from various online sources about how to get started with HTML Agility Pack. Some of what I have found so far is listed below:
www.4guysfromrolla.com/articles/011211-1.aspx
olussier.net/2010/03/30/easily-parse-html-documents-in-csharp/
stackoverflow.com/questions/846994/how-to-use-html-agility-pack
shatalov.su/en/articles/web/parser_1.php
still more referred to below...
My Results so far:
I have found the material to be quite confusing with each source seeming to tell me something different. All my attempts have come to dead ends.
So that you can efficiently sort out my confusion and reply to my specific situation I will describe in three sections below my project, my environment and my questions;
My Project
I am tasked with creating a process to scrub data from html files. I know the files well. The files will reside on the file system on local on the machine. The html file(s) will be created elsewhere by a process we do not own and will be placed in the local folder I just referred to above. (FYI - Though it is not a part of my question, I expect to create a project or app that will be run on a schedule to perform the scrubbing task and then input the collected data into a database table.)
My Environment
As stated above the html file(s) to be processed will reside on the local machine.
I have newly installed Visual Studio 2010 Professional on this machine to code for this project.
The HTML Agility Pack is now accessible to this machine on a file share.
Under REGEIT: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP are listed the following indicating the version of .NET framework installed on this machine;
CDF
V2.0.50727
V3.0
V3.5
V4
V4.0
My Questions
1.) I am told by some sites to download HTML Agility Pack and to use the file "HtmlAgilityPack.dll," however the zip file contains nine folders, each with a different copy of this file. Which one do I want?
Here are the names of the folders;
Net20
Net40
Net40-client
Net45
sl3-wp
sl4
sl4-windowsphone71
sl5
winrt45
2.) An answer to a forum question “How to I use the HTML Agility Pack” at stackoverflow.com/questions/846994/how-to-use-html-agility-pack instructs the questioner to “Download and build the HTML Agility Pack Solution”, and directs the questioner to the site htmlagilitypack.codeplex.com which then has a link to nuget.org/packages/HtmlAgilityPack which says to ‘install’ the HTMLAgilityPack by running the command “PM> Install-Package HtmlAgilityPack” in the “Package Manager Console”
What does all this mean? Other sites say to bout the dll in the bin folder. What is that telling me to do?
Please explain with more detailed to get me started.
3.) Assuming I am using C# what kind of project should I create?
4.) Please direct me to any other resources that you believe is applicable to my project.
Looks like you can create a .NET 4.0 project, given the .NET framework versions you have installed on your machine. What type of project depends on how you'd like your application to run. I'd personally opt for creating a C# Class Library project that contains the load html and scrub code and then host that in whatever mechanism you want to use to actually open the files.
To open a file from FileSystem, either use File.OpenRead or File.ReadAllText from System.IO.File. You can pass the stream or the file contents to the HtmlDocument.Load/LoadHtml methods.
HtmlDocument doc = new HtmlDocument();
// Use File.ReadAllText
string contents = File.ReadAllText("PathToFileName");
doc.LoadHtml(contents);
// Or use a stream
using (var contents = File.OpenRead("PathToFileName"))
{
doc.Load(contents);
}
Possibilities for hosting are plentiful. Console Application (can be invoked from the command line or through the Task Scheduler), Windows Service (can be loaded in Windows, run in the background even when nobody is logged on to the machine and can potentially use the FileSystemWatcher to automatically pic up the files, or a Windows Forms/WPF application which will let the user select the files to process and then show the results somehow.
As for how to use it, this is one of the primary issues with the Html Agility Pack. New ways of using it have been added over time and the actual library has therefore several ways you can use. You could take the old fashioned XPath query route (which was the original API) or you can use the Linq-to-HTML/XML route (which is the newer, way). Neither is better than the other, they both have their distinct advantages. The XPath solution allows you to store the queries in a text file easily, so it's great for a configurable system, while the Linq-To-HTML version is a little easier on the eyes from a developer perspective.
As for how to download it, there are a number of options here as well.
You can indeed download the sources from the CodePlex website. Regardless of how you proceed, you might want to do that any way, it allows you to look under the hood and figrue out why something works the way it does, even if you don't compile the library yourself.
You can download the binaries from CodePlex and store them with your project, before the creation of services such as NuGet, this was the only easy way for developers to distribute their libraries.
I'd personally choose to go the NuGet route. When you're using Visual Studio 2012, NuGet is already integrated with Visual Studio. When you're using Visual Studio 2010, you'll have to install the NuGet extension to get the same functionality. Once installed you can open the Nuget Package manager Console from within Visual Studio. With a Visual Studio Solution open and your freshly created Class Library selected in the Solution Explorer you then proceed to enter the Install-Package HtmlAgilityPack command to let Visual Studio download and install the proper version of the HTML Agility Pack for your project. No worries about which library to select, Visual Studio will do that for you.
How to use it now that you've installed the library completely depends on what type of HTML scrubbing you're after and whether you choose the XPath or the Linq-to-HTML route. But it generally comes down to loading the HTML Document:
HtmlDocument doc = new HtmlDocument();
doc.Load(/* path to file or stream */); or doc.LoadHtml(/*string*/);
And after loading the file and catching any parsing errors that might occur, proceed to query the HTML using XPath like the contents are actually XML (the XML/XPath documentation from MSDN actually applies here):
var nodes = doc.DocumentNode.SelectNodes("//table/tr/td");
Or the same query using Linq-to-HTML:
var nodes = doc.DocumentNode.Descendants("table")
.Select(table => table.Elements("tr").Select(tr => tr.Elements("td")));
Or use the Linq-to-Html with Linq query syntax:
var tds = from tables in doc.DocumentNode.Descendants("table")
from tr in tables.Elements("tr")
from td in tr.Elements("td")
select td;
You can make the queries as wild as you want. The syntax is either similar to the standard XPathnavigator syntax in the .NET Framework (using SelectNodes/SelectSingleNode/Children etc) or the Linq-to-XML syntax (using .Descendants/.Ancesters/.Element(s) and standard Linq).
See also:
Linq to XML documentation
XPathNavigator/IXPathNavigable documentation
Related
We recently switched to team foundation server 2010 for our source code management, everything works just fine, except for some legacy code written in FoxPRO 7 and 9, source code files are some sort of tables. For Forms, there are two kind of files, one ending in .scx and another in .sct, both can be explored using the fox studio but there is no way to open them in a text editor.
does anyone have any experience getting the fox code to work/merge... on TFS ?
I'm not aware of all of the ins and outs for source control and FoxPro, but if some of the source is binary, you can configure file extensions to disallow merges.
Right-click on the collection (root node) in the TeamExplorer window. Go to Team Project Collection Settings | File Types.
You should be able to add the extensions (like .sct), and specify that merging and multiple checkout is not allowed for those files.
The downside will be that only 1 person at a time can check those files out, but since the forms are FoxPro tables, I would imagine that's the same problem that you would have with any source control tool.
For merging you can set up a merge tool that is capable of merging those files. This must be done on every developer station (Tools->Source Control->VS Team Foundation Server->Configure User Tools).
It may be that VS uses a server-side merge tool to do auto-merges, I don't know if or where you can change that.
I've worked with VFP since it was FoxBase back in late 80's. Visual Foxpro used .dbf files (renamed extensions) for purposes of building forms (.scx/.sct) and visual class libraries (.vcx/.vct) and reports (.frx/.frt).
I've written some code to run through a given project and dump out a text version of all the code as if it was all text-based. All the controls are dumped in alpha order, embedded procedures, etc too. List all property settings in same place too.
Its not PERFECT, but I've used over the years in comparing source code versions when dealing with other developers who liked to change things and not notify me (or others) of such changes and finding later by other horrendous means.
If this is something you might be interested in, I can strip-down the code (some) and send it to you via an email, but would need an email address. The code is written in VFP as a .prg file, so nothing compiled that you would need to worry about any viruses or anything.
At least this way, you COULD get a text version associated with the binary pair's of files used within VFP.
I am building a web project in Visual Studio that uses dojo, but I am unsure of how to link in the 3rd party dojo files so they get copied to the output directory.
In the past for things like jQuery, I placed the jquery.js file in a separate folder, went to "Add Existing Item," added jquery.js as a link, and set Visual Studio to copy it to the output directory (if newer). This worked great.
For dojo, there are hundreds (if not thousands) of related external files. This is not practical to add to Visual Studio (though I did find a way to do it in bulk).
This makes me think that I am approaching this incorrectly. How can or should I include something like dojo in a C# project without having to reference each file? Should I use a post-build step to robocopy the files into the output directory?
My goal is to be able to build multiple projects which all use dojo, but I don't want to have multiple copies of dojo checked in, or have to reference each file in the project.
Use the "Add as link" feature of Visual Studio.
http://msdn.microsoft.com/en-us/library/vstudio/9f4t9t92(v=vs.100).aspx
. . .
I am also a Dojo user. You will want to learn to do Dojo builds, to reduce Dojo to just a few files, and host them on your server. In many cases, in lieu of that, with some tiny loss of efficiency for the first load (after that, caching takes care of things) using the one of the CDNs like google to access the Dojo files is also effective.
Depending on your particular circumstances, it may be better to put the files out on a server, and just reference them in your HTML templates. This is, for example, how we do all our internal Dojo applications in my organization--three developers use one set of Dojo files for all applications.
So we have this tool, it's a web page, we drop a large piece of text in textBox a (say sql) run the tool
and it generates the guts of a code file in TextBoxb (say a custom view class model).
it's written in C#.
I know there are several ways to create visual studio extensions.
What I'd like to know is, what's the best/easiest/fastest way to take a c# dll that has a method that takes text in and returns text out, and turn it into a VisualStudio extenson, that takes text in and creates a files, adds it to the project and puts the text into it.
We're using Vs2008 and VS2010, and I'm okay the best soloution only work on 2010.
The closest I've found by googling so far is this:
http://visualstudiomagazine.com/articles/2009/03/01/generate-code-from-custom-file-formats.aspx
but they are for custom file formats only, i want to generate*.cs and *.rdlc and similar files.
I'd also like to be able to add methods to an existing class.
tutorial walkthroughs or better approaches woud be greatly appreicated.
VS package Builder is the answer. Lots easier.
I need to add a C# solution with examples that would be distributed as part of a software library installer. This solution would have various examples on how to use the product's API.
I want to be able to display a simple "quick start" file explaining how to run the examples when the solution is opened in Visual Studio.
Is there a way to tell Visual Studio to open a specific text file when the solution/project opens?
It sounds like a solution or project template would be the best option. This would let you create an entry in the user's File - New dialog (Similar to 'New Class Library" etc). In VS 2008, these are easier to create - File -> Export Template. The template is just a zip of the project(s) with an xml manifest file you can modify. Part of the manifest schema allows you to specify files to open as HTML or text. The templates can be installed relatively easily as part of a installer package.
Here's more on the general concept:
http://msdn.microsoft.com/en-us/library/6db0hwky.aspx
And schema reference about how to open files in various modes on startup:
http://msdn.microsoft.com/en-us/library/ys81cc94.aspx
If you need to provide more guidance/wizards, consider Guidance Automation Toolkit.
What Will said.
The UI state of the solution (e.g. which files are open for editing) is stored in one of the solution files of which there's supposed to be a separate copy for each user, and which therefore isn't usually checked-in to the shared version control: i.e. not the *.sln file but instead I think the *.suo file (but beware, this is a binary file which won't 'merge').
I don't think it is possible to have a solution file open specific content or even script actions, actually.
Perhaps you could create an MSI setup for your library (if you haven't already) and not deliver a solution with example code, but a project template that is installed by the MSI in the right place to be instantly available as a template in VisualStudio? Then someone can easily do "New Project", select the demonstration template and get a project preset with your example code.
Just make a .bat file (using the VS env) with that calls devenv /useenv yoursolution.sln - this way you can make things a bit fancy if you want to ;)
I have recently started using CAML.NET IntelliSense for SharePoint with Visual Studio 2008; which works great; however whenever I create a new project using STSDev 2008 (and thus generate feature.xml and WebParts.xml) the default schemas include the CAML.NET IntelliSense and the built-in (relatively incomplete) schemas:
caml.xsd
wss.xsd
coredefinitions.xsd
camlview.xsd
All found in web server extensions\12\TEMPLATE\XML. The existence of both of these schemas for the file causes a large number of warnings, notifying me that a specific schema entry is already declared in one of the above files. Disabling them for each file individually works great, however in a SharePoint solution whit 40 or 50 XML files this quickly becomes laborious.
Is it possible to disable these built-in Schemas, selecting "Do not use selected schemas" does not work for future XML files only the current one?
Well, if you really don't want them - you could remove the schemas from the xsd path (%VsInstallDir%\xml\Schemas) - and perhaps disable download (Options->TextEditor->Xml->Miscellaneous). My machine isn't in a suitable state to try it, but it should work in theory...