I'm using Scrapy to crawl a webpage. I get the XPath selectors by using an xpath Chrome extension, which works fine. I'm getting everything I want on the product page like description, price etc.
If I click on a small image of an item, the big image of that item pops up, and I want to crawl this big image. But the Xpath I'm using for this big image isn't fetching anything. Also, when I viewed the source code, it shows that it uses a javascript function to load these pop up images. Is there a way to fetch these images?
start_urls = ['http://www.flipkart.com/nokia-lumia-620/p/itmdgkwywkmaa2w4?pid=MOBDGH6AKH9ERJAF']
description = hxs.select('/html/body/div[#class=" fkart fksk-body line "]/div[#id="fk-mainbody-id"]/div[#class="fk-content fksk-content enable-compare line"]/div[#class="fk-mproduct fk-mproduct-mobile "]/div[#class="mprod-section unit"]/div[#id="topsection"]/div[#class="mprod-summary lastUnit"]/div[#class="mprod-summary-title fksk-mprod-summary-title"]/h1/text()').extract()
price = hxs.select('/html/body/div/div/div/div/div/div/div/div/div/div/div/div/span/text()').extract()
image_urls = hxs.select('/html/body/div[#class="fk-ui-dialog fk-popup"]/div[#class="window alpha30 window-absolute"]/div[#class="content"]/div[#class="dialog-body"]/div[#id="pp-large-images-popup"]/div[#class="main-container"]/div[#class="pp-carousel-bd"]/div[#class="visible-image-large fk-text-center"]/img[#id="visible-image-large"]').extract()
Result :
{'description': [u'Nokia Lumia 620'],
'image_urls': [],
'price': u'14999'}
To get the list of image urls for the small thumbnails you can use this XPath:
//div[#class="thumbs thumbs-small"]/img/#src
You can derive the urls of the big images from the urls of the thumbnail images. Just replace 40x40 to 275x275 and you will get the url of the big images.
Related
I'm tasked with making some of the images on a website appear in Google Sheets via the -IMPORTXML function. I have a basic knowledge of Xpath, but here I am just trying to show the current image (because it changes often) instead of pulling in the image URL.
On this link: https://www.kissusa.com/nails/best-sellers
I'm using this, returns "N/A": =importxml(A2,"//*[#id='layer-product-list']/div[2]/ol/li1","src")
To get this image in the first row of products: enter image description here
Any suggestions on how this can happen are greatly appreciated.
I am trying to reproduce a RSS reader like Feedly with Google Sheet and displaying with Glide as an app on my mobile phone.
Everything's fine with IMPORTFEED() function with titles, description, URL.
But it seems this function doesn't allow pictures to be displayed even if they are in the feed (which is not all the time).
So I am looking for a way to extract the main image from a blog post... the one displayed when you hover on a link in a Google Sheet cell.
I would like to get the link of that image displayed in the link preview and put that link in another cell.
Here is an example:
I tried IMAGE()
and also IMPORTXML when there is an image in the RSS feeds (but not all of them do... so I stopped)
Is it possible in Google Sheet to get the main image from the one displayed in the link preview ?
For instance, one of the blog I want to extract the main picture of a blog article would be Creajv (URL : https://creajv.com/ ; Feed : https://creajv.com/feed/)
So the IMPORTFEED() function I did in Google Sheets was :
=IMPORTFEED("https://creajv.com/feed/";"items";FALSE;3)
Which stands for :
=IMPORFEED(...) the function to import feeds from an URL
"items" the way to pull every data there is in the feed (you can use other parameters and can see all the possibilities on the GoogleFormulas documentation)
FALSE because I don't want the headers to be included
and the number 3 because I want only the last 3 results displayed.
And it displays perfectly : author, description, URL, date
But I did a little digging in Google and found that basically IMPORTFEED() cannot get images from feeds, even if it is added by the author of the blog (he has to add a feature to do it).
So I am now trying to see if there is another way which is not IMPORTFEED() to get every time the main image of a blog post.
And I saw Google Sheet is able to pull instantly it when I copy paste the URL of a blog article within a cell for instance for Creajv :
Print screen of the image I get when I click in the cell which contains the post URL
So my thoughts would be that I can pull the author, date, description etc. with IMPORTFEED (which works perfectly every time) and use a formula on the cell with the URL to get in another cell the URL of the picture pulled from the one in the link preview.
Two other possibilities might also be with Google App Script :
creating with the App Script a custom function
or creating a script pulling the image in a cell every time a new row is added via the IMPORTFEED() function.
Functions only, as Apps Script doesn't run on mobile Apps
How about this solution? I checked the website and inspected the image from the thumbnail.
Luckily, the structure is simple:
<div class="article-image">
<img src="https://creajv.com/wp-content/uploads/2020/11/HighresScreenshot00000.png" alt="Concours de Level Design avec Unreal Engine, du 11/11 au 05/12/2020">
</div>
You can get the url with IMPORTXML, and apply IMAGE to it:
=IMAGE(ImportXml("https://creajv.com/2020/11/08/concours-de-level-design-avec-unreal-engine/", "//div[#class='article-image']//img/#src"))
Since you are already retrieving the post url with your previous formula, change the source url by the correspondent cell:
=IMAGE(ImportXml(C1, "//div[#class='article-image']//img/#src"))
For example:
I'm scraping image urls and text from tables and I want to know how to collect nothing (or at least no url) for a cell whose image is missing. I don't want to remove any rows since I want to reproduce the table as is. Here's an example of a table with missing images and the html to the right.
I'm using the following in google sheets:
=importxml(D1,"//div[#class='colsx immagine']/img/#src")
=importxml(D1,"//div[#class='coldx domanda']")
which works fine to get the image urls and text for each row if the images are all there (as on this page). But if any images are missing then I collect an url for the (wrong) image below in the table. I want to skip url collection if there's no img url.
I just starting to learn xpath, and I suspect I need to use the | or not to fix this, but need some help because nothing I've tried works.
Thanks
You can solve this by adding another predicate. So assuming that you want to list the URLs of the //div[#class='lista'] with an image, you can use
=importxml(D1,"//div[#class='lista' and div/img]/div[#class='colsx immagine']")
=importxml(D1,"//div[#class='lista' and div/img]/div[#class='coldx domanda']")
=importxml(D1,"//div[#class='lista' and div/img]/div[#class='coldx risposta active gius']")
and so on.
This should skip all class 'lista' div's without an img tag and select its URLs.
I built a website in plain html/css with my own design. Now I need to put this website in TYPO3 CMS 9.5.4. Unfortunately it's my first time working with TYPO3 and I don't really know what I'm doing.
What I got so far:
Most of the website is already working fine. I included fluid_styled_content and my setup basically looks like this:
page = PAGE
page.1 = FLUIDTEMPLATE
page.1 {
file = fileadmin/sitedesign/Resources/Private/Templates/Page.html
variables {
content < styles.content.get
}
}
The Page.html file is basically my whole html template and I put
{content->f:format.raw()}
where I want my content.
All content I create in the backend is displayed as I want except of images.
My question:
I can display images by creating a "Text & Images" content element and adding the images in the "Images" tab. In the "Media Adjustments" section I can now set the width and height of each element and below I can choose the number of columns.
However these do not change anything in the source code of my website i.e. in the content variable, so all images are displayed in full size.
What can I do to make the width/height appear in the source code (ideally as width/height attribute of that element)?
Hey Erik and welcome to TYPO3. Usually TYPO3 will take care of the correct images sizes when using the default Content Elements (e.g. Text & Images). But TYPO3 requires ImageMagick or GraphicsMagick installed on the server to modify the pictures.
You can check out if your system matches all requirements using the "Environment" module in the TYPO3 backend (modules on the left side of the backend). Then you will see a function called "Image processing" which will test the required image functions of your server.
I'm new in this topic, so my apologies for this question :)
I need to create a page that must show all images uploaded in a post, one at time. When the user click next button, the it must load the next picture, replacing the first one.
< [ image1 ] >
Title
Description
other content
--> User clicks next
< [ image2 ] >
Title
Description
other content
However, in order to speed the image display load, the two following images need to be downloaded in hidden panels, so the images will be cached when user clicks next and load the next bundle of images.
If I use ajax to do this task, will the browser use the downloaded images or ajax will download them once again?
Is there a way to do this process more optimal?
Thank you very much!
You can download the next image's data to strings using ajax and convert it to a base64 string and embed it in the html (when the next button is clicked, using javascript) by changing the src atribute of your image to "data:image/png;base64,(base64stringhere)" replacing (base64stringhere) with the downloaded base64 string of an image
ref: http://www.techerator.com/2011/12/how-to-embed-images-directly-into-your-html/
ref: How can you encode a string to Base64 in JavaScript?
Note: most browsers cache images, so it will waste quite a bit of bandwidth if the users are viewing images they already have downloaded.