Agnostic Screen scraper using HtmlAgilityPack - html-agility-pack

Lets say I want a screen scraper that doesn't care if you pass it an HTML page, url that goes to an XML Document, or a Url that goes to a text file.
examples:
http://tonto.eia.doe.gov/oog/info/wohdp/dslpriwk.txt
http://google.com
This will work if the page is HTML or a text file:
public class ScreenScrapingService : IScreenScrapingService
{
public XDocument Scrape(string url)
{
var scraper = new HtmlWeb();
var stringWriter = new StringWriter();
var xml = new XmlTextWriter(stringWriter);
scraper.LoadHtmlAsXml(url, xml);
var text = stringWriter.ToString();
return XDocument.Parse(text);
}
}
However; if it is an XML file such as:
http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml
[Test]
public void Scrape_ShouldScrapeSomething()
{
//arrange
var sut = new ScreenScrapingService();
//act
var result = sut.Scrape("http://www.eia.gov/petroleum/gasdiesel/includes/gas_diesel_rss.xml");
//assert
}
Then I get the error:
An exception of type 'System.Xml.XmlException' occurred in System.Xml.dll but was not handled in user code
Is it possible to write this so that it doesn't care what the URL ultimately is?

to get the exact exception on visual studio CTR+ALT+E and enable CommonLanguageRunTimeExceptions, it seems like LoadHtmlAsXml expects html, so probably your best bet is to use a WebClient.DownloadString(url) and HtmlDocument with property OptionOutputAsXml set to true as the following, when that fails catch it
public XDocument Scrape(string url)
{
var wc = new WebClient();
var htmlorxml = wc.DownloadString(url);
var doc = new HtmlDocument() { OptionOutputAsXml = true};
var stringWriter = new StringWriter();
doc.Save(stringWriter);
try
{
return XDocument.Parse(stringWriter.ToString());
}
catch
{
//it only gets here when the string is xml already
try
{
return XDocument.Parse(htmlorxml);
}
catch
{
return null;
}
}
}

Related

Error "Must set UnitOfWorkManager before use it"

I'm developing the service within ASP.NET Boilerplate engine and getting the error from the subject. The nature of the error is not clear, as I inheriting from ApplicationService, as documentation suggests. The code:
namespace MyAbilities.Api.Blob
{
public class BlobService : ApplicationService, IBlobService
{
public readonly IRepository<UserMedia, int> _blobRepository;
public BlobService(IRepository<UserMedia, int> blobRepository)
{
_blobRepository = blobRepository;
}
public async Task<List<BlobDto>> UploadBlobs(HttpContent httpContent)
{
var blobUploadProvider = new BlobStorageUploadProvider();
var list = await httpContent.ReadAsMultipartAsync(blobUploadProvider)
.ContinueWith(task =>
{
if (task.IsFaulted || task.IsCanceled)
{
if (task.Exception != null) throw task.Exception;
}
var provider = task.Result;
return provider.Uploads.ToList();
});
// store blob info in the database
foreach (var blobDto in list)
{
SaveBlobData(blobDto);
}
return list;
}
public void SaveBlobData(BlobDto blobData)
{
UserMedia um = blobData.MapTo<UserMedia>();
_blobRepository.InsertOrUpdateAndGetId(um);
CurrentUnitOfWork.SaveChanges();
}
public async Task<BlobDto> DownloadBlob(int blobId)
{
// TODO: Implement this helper method. It should retrieve blob info
// from the database, based on the blobId. The record should contain the
// blobName, which should be returned as the result of this helper method.
var blobName = GetBlobName(blobId);
if (!String.IsNullOrEmpty(blobName))
{
var container = BlobHelper.GetBlobContainer();
var blob = container.GetBlockBlobReference(blobName);
// Download the blob into a memory stream. Notice that we're not putting the memory
// stream in a using statement. This is because we need the stream to be open for the
// API controller in order for the file to actually be downloadable. The closing and
// disposing of the stream is handled by the Web API framework.
var ms = new MemoryStream();
await blob.DownloadToStreamAsync(ms);
// Strip off any folder structure so the file name is just the file name
var lastPos = blob.Name.LastIndexOf('/');
var fileName = blob.Name.Substring(lastPos + 1, blob.Name.Length - lastPos - 1);
// Build and return the download model with the blob stream and its relevant info
var download = new BlobDto
{
FileName = fileName,
FileUrl = Convert.ToString(blob.Uri),
FileSizeInBytes = blob.Properties.Length,
ContentType = blob.Properties.ContentType
};
return download;
}
// Otherwise
return null;
}
//Retrieve blob info from the database
private string GetBlobName(int blobId)
{
throw new NotImplementedException();
}
}
}
The error appears even before the app flow jumps to 'SaveBlobData' method. Am I missed something?
Hate to answer my own questions, but here it is... after a while, I found out that if UnitOfWorkManager is not available for some reason, I can instantiate it in the code, by initializing IUnitOfWorkManager in the constructor. Then, you can simply use the following construction in your Save method:
using (var unitOfWork = _unitOfWorkManager.Begin())
{
//Save logic...
unitOfWork.Complete();
}

How to download Image[Call API, API returns byte[] of image] from server and show it in contentpage in xamarin forms

Call Rest service
Rest service returns byte[] representation of image/audio/video
convert into byte[] to image and show in content page in xamarin
First of all, you can create a function that simply makes an API request and obtains the content in the form of byte array. A simple example of HTTP request:
public static byte[] GetImageByteArray(string url)
{
try
{
using (var client = new HttpClient())
{
var uri = new Uri(url);
var response = client.GetAsync(uri).Result;
if (response.IsSuccessStatusCode)
{
var content = response.Content.ReadAsByteArrayAsync();
return content.Result;
}
}
return null;
}
catch
{
return null;
}
}
Next, you can simply bind the output from your result into your image source and the image to your content:
var mainStack = new StackLayout();
var imageByteArray = GetImageByteArray("https://static.pexels.com/photos/34950/pexels-photo.jpg");
Image image;
if (imageByteArray != null)
{
image = new Image()
{
Source = ImageSource.FromStream(() => new MemoryStream(imageByteArray))
};
mainStack.Children.Add(image);
}
Content = mainStack;

How do we set content-type to "text/plain" in asp.net web api

We are using asp.net web api in our application, in that we are trying to return the response with content-type with text/plain format but We are unable to succeeded. Same thing we tried with ASP.NET MVC it is working fine could you please provide me equivalent solution in Web API.
Please find below for the function implemented in ASP.NET MVC
public JsonResult FileUpload(HttpPostedFileBase file)
{
string extension = System.IO.Path.GetExtension(file.FileName);
string bufferData = string.Empty;
if (file != null)
{
using (MemoryStream ms = new MemoryStream())
{
file.InputStream.CopyTo(ms);
byte[] array = ms.GetBuffer();
var appendInfo = "data:image/" + extension + ";base64,";
bufferData = appendInfo + Convert.ToBase64String(array);
}
}
var result = new
{
Data = bufferData
};
return Json(result,"text/plain");
}
Could you please suggest same implementation in WebAPI.
Thanks,
Bhagat
Web Api does the JSON work for you, so you can simplify your code handling on the endpoint. By default, you need to make changes in your WebApiConfig.cs for everything to work nicely. I've modified your method below:
public HttpResponseMessage FileUpload(HttpPostedFileBase file) {
var result = new HttpResponseMessage(HttpStatusCode.NotFound);
var bufferData = string.Empty;
try
{
if (file != null)
{
var extension = System.IO.Path.GetExtension(file.FileName);
using (MemoryStream ms = new MemoryStream())
{
file.InputStream.CopyTo(ms);
var array = ms.GetBuffer();
var appendInfo = "data:image/" + extension + ";base64,";
bufferData = appendInfo + Convert.ToBase64String(array);
result.StatusCode = HttpStatusCode.OK;
// Set Headers and Content here
result.Content = bufferData;
}
}
}
catch(IOException ex)
{
// Handle IO Exception
}
return result
}
The changes you need to make in your WebApiConfig.cs could look like this:
public static void Register(HttpConfiguration config)
{
config.Routes.MapHttpRoute(
name: "DefaultApi",
routeTemplate: "api/{controller}/{action}",
defaults: null,
constraints: new { action = #"\D+" }
);
// This makes the response default into JSON instead of XML
config.Formatters.Remove(config.Formatters.XmlFormatter);
}
As a note, the very fastest fix you can make to your code would be to do this, but I don't recommend returning strings.
public string FileUpload(HttpPostedFileBase file) {
var result = new HttpResponseMessage(HttpStatusCode.NotFound);
var bufferData = string.Empty;
if (file != null)
{
var extension = System.IO.Path.GetExtension(file.FileName);
using (MemoryStream ms = new MemoryStream())
{
file.InputStream.CopyTo(ms);
var array = ms.GetBuffer();
var appendInfo = "data:image/" + extension + ";base64,";
bufferData = appendInfo + Convert.ToBase64String(array);
return bufferData;
}
}
// If you get here and have not returned,
// something went wrong and you should return an Empty
return String.Empty;
}
Good luck - there's lots of ways of handling files and file returns, so I want to assume you don't have some special return value on your handling.

ASP.Net WebAPI Return CSS

I need to write a Web API method that return result as CSS plain text and not the default XML or JSON, Is there a specific provider that I need to use?
I tried using the ContentResult class (http://msdn.microsoft.com/en-us/library/system.web.mvc.contentresult(v=vs.108).aspx) but no luck.
Thanks
You should bypass the content negotiation which means that you should return a new instance of HttpResponseMessage directly and set the content and the content type yourself:
return new HttpResponseMessage(HttpStatusCode.OK)
{
Content = new StringContent(".hiddenView { display: none; }", Encoding.UTF8, "text/css")
};
Using the answers here as inspiration. You should be able to do something as simple as this:
public HttpResponseMessage Get()
{
string css = #"h1.basic {font-size: 1.3em;padding: 5px;color: #abcdef;background: #123456;border-bottom: 3px solid #123456;margin: 0 0 4px 0;text-align: center;}";
var response = new HttpResponseMessage(HttpStatusCode.OK);
response.Content = new StringContent(css, Encoding.UTF8, "text/css");
return response;
}
Can you return a HttpResponseMessage, get the file and just return the stream? Something like this seems to work....
public HttpResponseMessage Get(int id)
{
var dir = HttpContext.Current.Server.MapPath("~/content/site.css"); //location of the template file
var stream = new FileStream(dir, FileMode.Open);
var response = new HttpResponseMessage
{
StatusCode = HttpStatusCode.OK,
Content = new StreamContent(stream)
};
return response;
}
Although I would add some error checking in there if the file doesn't exist etc....
And just to pile on for fun, here's a version that would work under self-host too assuming you store the .css as an embedded file that sits in the same folder as the controller. Storing it in a file in your solution is nice because you get all the VS intellisense. And I added a bit of caching because chances are this resource isn't going to change much.
public HttpResponseMessage Get(int id)
{
var stream = GetType().Assembly.GetManifestResourceStream(GetType(),"site.css");
var cacheControlHeader = new CacheControlHeaderValue { MaxAge= new TimeSpan(1,0,0)};
var response = new HttpResponseMessage
{
StatusCode = HttpStatusCode.OK,
CacheControl = cacheControlHeader,
Content = new StreamContent(stream, Encoding.UTF8, "text/css" )
};
return response;
}
For anyone using AspNet Core WebApi you can simply do it like this
[HttpGet("custom.css")]
public IActionResult GetCustomCss()
{
var customCss = ".my-class { color: #fff }";
return Content(customCss, "text/css");
}

Convert PartialView Html to String for ITextSharp HtmlParser

I've got a partial view, i'm trying to use ITextSharp to convert the html to pdf. How can I convert the html to string so I can use ItextSharps HtmlParser?
I've tried something like this with no luck...any ideas?:
var contents = System.IO.File.ReadAllText(Url.Action("myPartial", "myController", new { id = 1 }, "http"));
I have created a special ViewResult class that you can return as the result of an Action.
You can see the code on bitbucket (look at the PdfFromHtmlResult class).
So what it basically does is:
Render the view through the Razor engine (or any other registered engine) to Html
Give the html to iTextSharp
return the pdf as the ViewResult (with correct mimetype, etc).
My ViewResult class looks like:
public class PdfFromHtmlResult : ViewResult {
public override void ExecuteResult(ControllerContext context) {
if (context == null) {
throw new ArgumentNullException("context");
}
if (string.IsNullOrEmpty(this.ViewName)) {
this.ViewName = context.RouteData.GetRequiredString("action");
}
if (this.View == null) {
this.View = this.FindView(context).View;
}
// First get the html from the Html view
using (var writer = new StringWriter()) {
var vwContext = new ViewContext(context, this.View, this.ViewData, this.TempData, writer);
this.View.Render(vwContext, writer);
// Convert to pdf
var response = context.HttpContext.Response;
using (var pdfStream = new MemoryStream()) {
var pdfDoc = new Document();
var pdfWriter = PdfWriter.GetInstance(pdfDoc, pdfStream);
pdfDoc.Open();
using (var htmlRdr = new StringReader(writer.ToString())) {
var parsed = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(htmlRdr, null);
foreach (var parsedElement in parsed) {
pdfDoc.Add(parsedElement);
}
}
pdfDoc.Close();
response.ContentType = "application/pdf";
response.AddHeader("Content-Disposition", this.ViewName + ".pdf");
byte[] pdfBytes = pdfStream.ToArray();
response.OutputStream.Write(pdfBytes, 0, pdfBytes.Length);
}
}
}
}
With the correct extension methods (see BitBucket), etc, the code in my controller is something like:
public ActionResult MyPdf(int id) {
var myModel = findDataWithID(id);
// this assumes there is a MyPdf.cshtml/MyPdf.aspx as the view
return this.PdfFromHtml(myModel);
}
Note: Your method does not work, because you will retrieve the Html on the server, thereby you loose all cookies (=session information) that are stored on the client.

Resources