Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:
I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.
Most web scrapers, including Scrapy, don't support the Shadow DOM, so you will not be able to access elements in shadow trees at all.
And even if a web scraper did support the Shadow DOM, XPath is not supported at all. Only selectors are supported to some extent, as documented in the CSS Scoping spec.
One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (let el of shadowRoot.childNodes) {
shadowHTML += el.nodeValue || el.outerHTML;
}
return shadowHTML;
};
// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
for (let el of rootElement.querySelectorAll('*')) {
if (el.shadowRoot) {
replaceShadowDomsWithHtml(el.shadowRoot)
el.innerHTML += getShadowDomHtml(el.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.
Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom
Related
I've searched thru Corvid docs and Stack, not finding anything.
Is there a way to appendChild() in Wix Corvid(Code)?
EDIT: Wix does not allow DOM access directly. I assumed that people answering this would know i was looking for an alternative to appencChild and knew this method could not be used as is in Wix.
so to clarify: is there a way to add a child to a parent element using Wix's APIs?
It depends what you are trying to achieve,
the only thing off the top of my head is adding more items to a repeater
which you can do by first getting the initial data from the repeater, adding another item to array and reassign the data property of the repeater
const initialData = $w('#repeater').data
const newItem = {
_id: 'newItem1', // Must have an _id property
content: 'some content'
}
const newData = [...initialData, newItem]
$w('#repeater').data = newData
https://www.wix.com/corvid/reference/$w.Repeater.html#data
In Corvid, you cannot use any function which accesses the DOM.
Coming from one of the developers of Corvid:
Accessing document elements such as div, span, button, etc is off-limits. The way to access elements on the page is only through $w. One small exception is the $w.HtmlComponent (which is based on an iFrame). This element was designed to contain vanilla HTML and it works just fine. You just can't try to trick it by using parent, window, top, etc.
Javascript files can be added to your site's Public folder, but the same limitations apply - no access to the DOM.
Read more here: https://www.wix.com/corvid/forum/main/comment/5afd2dd4f89ea1001300319e
I am facing an odd problem. I im trying to parse the following html:
The problem is that when I do
response.xpath('//div//section//div[#id="hiring-candidate-app"]')[0].extract()
I only get
'<div id="hiring-candidate-app"></div>'
instead of all the content under hiring-candidate-app.
I would like to get, for instance, inside-content, but it looks like I am not even getting that in the response. This webpage requires to be logged in, which I am.
Thanks in advance!
It looks like your Xpath is grabbing the right thing. But your issue might have to do with the '[0]' part of the call. I would remove that to get the full content of the div.
It looks like the elements in question sit on an <iframe>, and therefore live in a different context. You need to activate or switch to the context of the iframe, eg. using JavaScript to interact with an iframe and the document inside of it, e.g.
//Note: Assigning document.domain is forbidden for sandboxed iframes, i.e. on stacksnippets
//document.domain = "https://stacksnippets.net";
var ifrm = document.getElementById("myFrame");
// reference to iframe's window
//var win = ifrm.contentWindow;
// reference to document in iframe
var doc = ifrm.contentDocument ? ifrm.contentDocument : ifrm.contentWindow.document;
// reference an element via css selector in iframe
//var form = doc.getElementById('body > div > div.message');
// reference an element via xpat in iframe
var xpathResult = doc.evaluate("/html/body/div/div[1]", doc, null, XPathResult.ANY_TYPE, null);
<iframe id="myFrame" src="https://stacksnippets.net" style="height:380px;width:100%"></iframe>
However, as you can see when you run the snipped, cross-document interactions are only possible if the documents have the same origin. There are other, more involved methods like the postMessage method that provide the means of interacting cross-domain.
Currently I am scraping article news sites, in the process of getting its main content, I ran into the issue that a lot of them have embedded tweets in them like these:
I use XPath expressions with XPath helper(chrome addon) in order to test if I can get content, then add this expression to scrapy python, but with elements that are inside a #shadow-root elements seem to be outside the scope of the DOM, I am looking for a way to get content inside these types of elements, preferably with XPath.
Most web scrapers, including Scrapy, don't support the Shadow DOM, so you will not be able to access elements in shadow trees at all.
And even if a web scraper did support the Shadow DOM, XPath is not supported at all. Only selectors are supported to some extent, as documented in the CSS Scoping spec.
One way to scrape pages containing shadow DOMs with tools that don't work with shadow DOM API is to recursively iterate over shadow DOM elements and replace them with their HTML code:
// Returns HTML of given shadow DOM.
const getShadowDomHtml = (shadowRoot) => {
let shadowHTML = '';
for (let el of shadowRoot.childNodes) {
shadowHTML += el.nodeValue || el.outerHTML;
}
return shadowHTML;
};
// Recursively replaces shadow DOMs with their HTML.
const replaceShadowDomsWithHtml = (rootElement) => {
for (let el of rootElement.querySelectorAll('*')) {
if (el.shadowRoot) {
replaceShadowDomsWithHtml(el.shadowRoot)
el.innerHTML += getShadowDomHtml(el.shadowRoot);
}
}
};
replaceShadowDomsWithHtml(document.body);
If you are scraping using a full browser (Chrome with Puppeteer, PhantomJS, etc.) then just inject this script to the page. Important is to execute this after the whole page is rendered because it possibly breaks the JS code of shadow DOM components.
Check full article I wrote on this topic: https://kb.apify.com/tips-and-tricks/how-to-scrape-pages-with-shadow-dom
I'm writing some simple jasmine tests and I'm getting an exception since the code I'm testing is looking for a form that doesn't exist because there's no DOM when testing a js file only: $("form")[0] in the tested js file leads to:
TypeError: $(...)[0] is undefined
I read a bit about jasmine-jquery and realized I can use some html fixture with an external html file. That flow seems quite messy, since all I need to do is only to add an empty valid form so that the test (which focusing on something else) will run, something like <form></form> appending would be enough I think.
At first I thought that sandbox() function will be the solution, but it seems that it creates only divs, and I need a form.
Any simple way to add some elements by using only code in jasmine spec file?
The simplest solution is to add the form to the DOM by yourself in the before block and then delete it in the after block:
describe(function(){
var form;
beforeEach(function(){
form = $('<form>');
$(document.body).append(form);
});
it('your test', function(){
})
afterEach(function(){
form.remove();
form = null;
});
});
Also writing your sandbox helper isn't that hard:
function sandbox(html){
var el;
beforeEach(function(){
el = $(html);
$(document.body).append(el);
});
afterEach(function(){
el.remove();
el = null;
});
Another approach is to use jasmine fixture
The concept
Here's one way to think about it:
In jQuery, you give $() a CSS selector and it finds elements on the
DOM.
In jasmine-fixture, you give affix() a CSS selector and it adds those
elements to the DOM.
This is very useful for tests, because it means that after setting up
the state of the DOM with affix, your subject code under test will
have the elements it needs to do its work.
Finally, jasmine-fixture will help you avoid test pollution by tidying
up and remove everything you affix to the DOM after each spec runs.
See also: SO: dom manipulation in Jasmine test
You should use sandbox() to create a div and create a form element and append to sandbox, this is the safer way to jasmine take control to this fixtures in the DOM.
How to get all the images, after decoding if possible, on a webpage through XPCOM ?
The image might be specified in HTML as a background url in some CSS property, inside img tag, or in any form that a web developer might have included.
I tried looking into imgIContainer, imgIDecodeObserver and many other interfaces. Although there is a way through which we can provide image URI to Mozilla so that it loads the image, decodes it and returns imgIContainer. But I couldn't find anyway to get all images in current webpage.
This has to be done in either Java or Javascript.
Any suggestions?
#Wladimir - Thanks for your help.
I want all the images including CSS constructs (background images). So now I am listening to events from nsIWebProgressListener.
onStateChange: function(webProgress, request, stateFlags, status) {
if ((~stateFlags & (nsIWebProgressListener.STATE_IS_REQUEST | nsIWebProgressListener.STATE_STOP)) == 0) {
var imgReq = request.QueryInterface(CI.imgIRequest);
if (imgReq)
var img = imgReq.image;
}
}
The problem is that request.QueryInterface(CI.imgIRequest) throws exception for all NON-image requests. Although those exceptions can be ignored by putting code inside try-catch block, but I'd prefer to do things cleanly.
Is there any condition that can be checked to know whether request is for image or not?
There is existing code that you can look at. The Page Info dialog has a Media tab that successfully shows most images on the page. The important function is grabAll() in pageInfo.js, it is called for each element (via a TreeWalker). As you can see, there is no generic way to get the image, this function rather uses window.getComputedStyle() to extract the values of a bunch of the CSS properties for this element: background-image, border-image, list-style-image, cursor. It will also look for <img>, <svg:image>, <link> (favicon), <input>, <button>, <object> and <embed> tags. It doesn't manage to recognize everything however, e.g. these CSS constructs will not be recognized:
.foo:before
{
content: url(image.png);
}
.foo:hover
{
background-image: url(image.png);
}
Still, this is probably as far as you can get - unless you want to look at the requests made by the web page as it loads.
Edit: If you look at the requests as they are performed (via a web progress listener), you can do the following:
if (request instanceof CI.imgIRequest)
var img = request.URI.spec;
Note that request.image won't help you much, almost all methods of imgIContainer are only accessible from native code.