Cleaning html string for preventing XSS using HtmlUnit - htmlunit

I vaguely recall reading somewhere that HtmlUnit can be used to filter html strings for preventing XSS, but am unable to find any clue how to. Does HtmlUnit provide this function, or I need to use JSoup or something else for it?
Thanks,
Sanjay

Related

Getting Dynamically Generated HTML With Nokogiri/Open URI

I'm trying to scrape a site by looking at its HTML in Chrome and grabbing the data using Nokogiri. The problem is that some of the tags are dynamically generated, and they don't appear with an open(url) request when using open-uri. Is there a way to "force" a site to dynamically generate its content for a tool like open uri to read?
If reading it via open-uri doesn't produce the content you need, then chances are good that the client is generating content with Javascript.
This may be good news - by inspecting the AJAX requests that the page makes, you might find a JSON feed of the content you're looking for, which you can then request and parse directly. This would get you your data without having to dig through the HTML - handy!
If that doesn't work for some reason, though, you're going to need to open the page with some kind of browser, let it execute its clientside Javascript, then dump the resulting DOM to HTML. Something like PhantomJS is an excellent choice for this kind of work.

Facebook can't scrape my page and linter tool says document returns no data

Can anybody tell me why Facebook doesn't scrape my page, and also the debug/linter tools cant scrape it? I've searched and searched and can't find a way to fix it.
As far as I can tell all the og:tags and scripts are implemented correctly.
The page is at http://www.coincident.dk
The debug url is this: http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fwww.coincident.dk
It looks to me like Facebook is scraping your page successfully, just not fully. Weird.
I would try moving the OG meta tags lower in the header. After your content-type meta tag at the very least.
These problems might be due to character encoding issues. If the Facebook scraper is relying on the content-type tag to know the encoding is UTF-8, they it might not be reading your OG tags correctly.
I hope that helps!

How Does Firebug Get Contents From an IFrame?

I am well aware of cross-origin restrictions when it comes to browsers, but what I don't get is how Firebug can get and display the HTML from an iframe with this restriction in place. Is there something plugins have access to that lets it get around this?
Plugins have access to quite a bit. They're not considered cross-origin, they're considered a part of your browser.

Watin & Google Searches

I'm trying to use Watin to parse google search results.
However watin is unable to find elements i the Google Search Result page. When I view the source it's because the page is generated off javascript so the search results are not sent over the wire in html.
However when I open up Firebug (in Firefox) I am able to parse the html that gets generated by the javascript.
Does anyone know how I can get Watin to do the same so I'm able to parse the results?
Thanks :)
Could you use the google search API instead?
http://code.google.com/apis/ajaxsearch/documentation/#fonje
It may be a matter of timing. If javascript is generating the data, you may be checking for your data before it is written. Try running it in debug, stepping through and waiting until you know the items exist in the source before using WatiN to test for it.
I would also suggest posting code so we can see exactly what you're trying to do.

Is there a way to visually see if htmlunit is performing the correct commands?

Is there a way to visually see if htmlunit is performing the correct commands? I have a hard requirement to use htmlunit. I just don't know if it's filling out all the form correctly.
HTMLunit is designed to be GUI less browser and for your requirements you can consider using Webdriver or Watir or Selenium etc such tools. In case you are in to Ruby, take a look at Celerity which wrapped HtmlUnit in a Watir-ish API; In fact Celerity is itself being wrapped by Culerity, which integrates Celerity and Cucumber and that could be of more interest to you.
Yes. you can see the HTTP traffic by using proxy like webscarab, fiddler..etc.
Make sure the following
Set the proxy details to Htmlunit via contsructor. I think it is webclient
Make sure you either trust all the certs or add proxy certificate to truststore
What do you mean by "correct commands"? HtmlUnit itself won't give you a running description of what it's doing, if that's what you mean. As suthasankar says, HtmlUnit is a headless browser (intentionally so) and will never give you the cool Watir experience of watching pages fly by.
Any time I've wanted to know what's happening during a test's execution, I have added logging statements at various points in the test code and then watched them in the console. You could send messages to any other monitoring system you instead.
It wouldn't take much to then write wrappers around the "commands" you're interested in, like "getPage" and button clicks and form entries and the like.
It's not possible to view what HtmlUnit is doing unless you code logging and some sort of display yourself. I have done this in the past, and it's helpful to a certain degree but it's not really possible to have a visual feedback to see what HtmlUnit is doing. Even with logging, it's not possible to know every single detail what HtmlUnit is doing or where it goes wrong, so it's an extremely time consuming task. I even resorted to outputting the current page viewed but this is pretty limited as an html page cannot tell the actual "commands" HtmlUnit is executing on that page.
Another approach would be to use Selenium, which executes your "commands" in a visual manner you can see where things go wrong instantly by watching it.

Resources