I'm trying to get the propierty of an HTML value using Bash. It's a personal project so I can use cURL, wget, everything, but needs to be bash.
Suppose I want to scrape Google to get the value of the jscontroller property that appears in the first div inside the body
Basically, how can I do it?
Related
I am trying to get TinyMCE 4's image_list to work with a URL returning JSON data as specified in the example here.
I have setup a GET endpoint http://demo.com/media on my server which gives back a JSON response consisting of a list of objects with their title and value attributes set, for example:
[{"title":"demo.jpg","value":"http://demo.com/demo.jpg"}]
I have also specified the option image_list: "http://demo.com/media" when initializing the plugin.
However, when I click the image icon in the toolbar, nothing pops up. All I can see in the network tab is an OPTIONS request with status 200, but then nothing. The GET request I was expecting never happens.
What is the correct way of using image_list in TinyMCE 4? Also, does anyone have a working demo example? I couldn't find anything.
It is somewhat hard to say what the issue is without seeing the exact data your URL is returning. I have created a TinyMCE Fiddle to show (in general) how this is supposed to work:
http://fiddle.tinymce.com/pwgaab
There is a JavaScript variable at the top (pretendFetchedData) that simulates what you would grab from the server (an array of JavaScript objects) and that is referenced via image_list.
If you enter your URL (http://demo.com/media) into a browser window what is returned? Are you sure its an array of JavaScript objects?
I have the identical problem. No matter what I do with the detail of the format (e.g. putting quotes round title and value), nothing happens.
I guess the only way (for me anyway) is to insert the list into the script with php before sending the web page.
So I would go to an instagram account, say, https://www.instagram.com/foodie/ to copy its xpath that gives me the number of posts, number of followers, and number of following.
I would then run the command this command on a scrapy shell:
response.xpath('//*[#id="react-root"]/section/main/article/header/section/ul')
to grab the elements on that list but scrapy keeps returning an empty list. Any thoughts on what I'm doing wrong here? Thanks in advance!
This site is a Single Page Application (SPA) so it's javascript that render DOM is not rendered yet at the time your downloader working.
When you use view(response) the javascript that your downloader collected can continue render by your browser, so you can see the page with DOM rendered (but can't interacting with Site API). You can look at your downloaded content via response.text and saw that!
In this case, you can apply selenium + phantomjs to making a rendered page for your spider!
Another trick: You can use regular expression to select the JSON part of Script, parse it to JSON obj and select your correspond attribute value (number of post, following, ...) from script!
I have followed How can I find an element by CSS class with XPath? which gives the selector to use for selecting elements by classname. The problem is when I use it it retrieves an empty result "[]" and I know by fact there is a div classed "zoomWindow" in the url fed to the scrapy shell.
My attempt:
scrapy shell "http://www.niceicdirect.com/epages/NICShop.sf/secAlIVFGjzzf2/?ObjectPath=/Shops/NICShop/Products/5696"
response.xpath("//*[contains(#class, 'zoomWindow')]")
I have looked at many resources that provide varied selectors. In my case the element only has one class, so versions that use "concat" I used but didn't work and discarded.
I have installed ubuntu and scrapy in a virtual machine just to make sure it was not a bug in my installation on windows but my attempt on ubuntu had the same results.
I don't know what else to try, can you see any typo in the selector?
If you would check the response.body in the shell - you would see that it doesn't contain an element with class="zoomWindow":
In [3]: "zoomWindow" in response.body
Out[3]: False
But, if you open the page in the browser and inspect the HTML source, you would see that the element is there. This means that the page load involves javascript logic or additional AJAX requests. Scrapy is not a browser and doesn't have a javascript engine built-in. In other words, it only downloads the initial HTML code of the page without additionally downloading js and css files and "executing" them.
What you can try, for starters, is to use scrapyjs download handler and middleware.
To image you want to extract is also available in the img tag with id="PreviewImage":
In [4]: response.xpath("//img[#id='PreviewImage']/#src").extract()
Out[4]: [u'/WebRoot/NICEIC/Shops/NICShop/547F/0D9A/F434/5E4C/0759/0A0A/124C/58F7/5708.png']
I'm trying to use Simple HTML DOM to find objects via XPath.
It's working pretty well but I can't seem to get the current element:
$object->find('.');
$object->find('..');
$object->find('//');
all return an empty array
$object->innertext
returns a normal table with HTML, so the object IS valid.
Simple HTML DOM doesn't recognize '.' for getting the current element,
in fact, it uses Regex to find elements using XPath.
In order to solve this problem I used DOMXPath instead of Simple HTML DOM,
which has a lot more options and functionality.
I am using Codeigniter and everything works just fine. I can assign PHP variable to smarty an display them.
But now I am calling a webservice and this webservice returns a complete HTML (and javascript) page.
I want to display this in a smarty template.
So I have done the following:
I have assigned the output of the webservice to a PHP variable and assigned this to a smarty variable (HTMLstring), like I always do. That part works.
In my smarty template I don't need anything but to display the contents of the variable. So my template contains just one line:
{HTMLstring}
But this displays the literal HTML including tags and all. I want to display the output.
(If I copy-paste the output in a separate html file, and open that, it just looks fine)
I 'figured out' the answer.
It appears it makes a difference if I call the template from code or just type the complete url in my browser for testing purposes. The latter didn't work, the former does. I still don't know why. Sorry...
Question closed.