Scrapy: Check if response is an image - image

I need check if response is an image.
For requirements of the work I need to generate the url of the photos that can exist or no and record the url that contains an image.
When the url generated doesn't show a photo the response of the website is an html when the body is:
<body>No File Found</body>
also the response.status =200
The response header doesn't have a valuable info for both results with image and No File Found
For instance
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Transfer-Encoding: chunked
Expires: 0
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
X-Frame-Options: AllowAll
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: *
Date: Tue, 13 Aug 2019 01:44:40 GMT
The way that I found to check if the response is an image for this case was:
try :
no_file_found = response.xpath("/html/body[contains(., 'No File Found')]")
except:
photo_url = response.url
photo = PhotoItem()
photo['id'] = id
photo['url'] = photo_url
yield photo
Because When the response is an image the line
no_file_found = response.xpath("/html/body[contains(., 'No File Found')]")
throw this exception:
raise NotSupported("Response content isn't text")
I know that this isn't an elegant solution , but for this context it works
Question
My question is If there is another way more elegant to solve this problem, that not use try to solve that.
Notice that I don't need to download the image just need to record the valid url
Any suggestion is welcome.
Thanks in advance!!!

The simplest way would probably be to just check the type of the response:
from scrapy.http.response.text import TextResponse
if not isinstance(response, TextResponse):
# it's probably an image; do image stuff

Related

How can I extract a certain header from http response [Set-Cookie]

I've searched around google for this, but I haven't been able to fix this problem yet, so I'm sending a post http request to a certain website, and in return I get this headers
map[Cache-Control:[no-cache] Content-Type:[application/json; charset=utf-8] Date:[Mon, 19 Oct 2020 15:38:41 GMT] Expires:[-1] P3p:[CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV INT DEM PRE"] Pragma:[no-cache] Site-Machine-Id:[CHI1-WEB5027] Set-Cookie:[.ROBL=A6251611CFF4513169C4CED10EFDC2DE9444E207CA376B27B906306AC18BFB0ACA6F17381240973900B19186F3E46C6BC9C52B99BC040579110E87145209A47040B241FE2C702C18AEF12A1AC746812B22596BFDB33C24DF1D5CEC72705DAD266343EC259528D8B7617BBE17408A957DF7A1C2CC7AC9DD9CC05FF8F4831BCC1669FB5221A74E6DB5C8EE0ED7F8F4AFA3767CCC39D919A62C6800EFFFF812DED5325F68D36B410D86A0CAB1FB0B8A90ADD529BE75A2DAFD3EB59D86BBC831C3144E577357B8EB0C514D0433F0B8E69DA151E6BA2C63968B46184167CAE05FE6B4749DC0449C71BB80A1306C6699E9EBD79E4C6A348CC33418D3E0DC3E6F5; expires=Wed, 12-Oct-2050 15:38:42 GMT; path=/; HttpOnly .RBID=eyJhbGciOiJIUzI1NiJ9.eyJqdGkiOiIyY2Q5N2IyOS01MjgxLTRjMWQtYjgxMS03OTQzNWZkNzU0ZjkiLCJzdWIiOjcxNjQ3MzgxOH0.yg9EiXLF4VY2O7Eu5mTdbax60tMrodiPbADWwRwZMeo; expires=Thu, 17-Oct-2030 15:38:42 GMT; path=/; secure; HttpOnly Data=UserID=-733325636; expires=Fri, 06-Mar-2048 16:38:42 GMT; path=/ REventTrackerV2=CreateDate=10/19/2020 10:38:42 AM&rbxid=&browserid=65376450118; expires=Fri, 06-Mar-2048 16:38:42 GMT; path=/] Vary:[Accept-Encoding] X-Frame-Options:[SAMEORIGIN]]
I want to extract .ROBL from, [Set-Cookie] just doing res.Header.Get(".ROBL") doesn't seem to be doing the job.
I tried to do split := strings.Split(string(header), ";") but that panics on fail so it's not relaible
Is there any relaible ways to extract .ROBL from [Set-Cookie] in the header?
Cookies are sent with the Set-Cookie HTTP header, so you can't simply get them as Header.Get("cookie-name"). You would have to parse the Set-Cookie header values. But the standard lib does this for you:
Cookies sent by a server may be parsed using Response.Cookies(). It returns you a slice of cookies (http.Cookie), just iterate over them until you find the one you're looking for.
cookies := resp.Cookies()
for _, c := range cookies {
if c.Name == ".ROBL" {
fmt.Println(c)
fmt.Println(c.Value)
}
}
Also note that if you want cookie management, you should consider using a CookieJar. For details, see What is the difference between cookie and cookiejar?

Apiary Blueprint attributes under headers and body are not recognized

Final Edit: This works with no semantic errors:
+ Request
+ Headers
Accept: application/json
Content-Type: application/json
X-Auth-Client: Your Client Id
X-Auth-Token: Your Token
+ Body
+ Attributes (ProductPost)
+ Response 200
+ Headers
Content-Encoding: Entity header is used to compress the media-type.
Content-Type: application/json
Date: The date the response was sent.
Transfer-Encoding: Header specifies the form of encoding used to safely transfer the entity to the user.
Vary: HTTP response header determines how to match future request headers to decide whether a cached response can be used rather than requesting a fresh one from the origin server. We use Accept Encoding
X-Rate-Limit-Requests-Left: Header details how many remaining requests your client can make in the current window before being rate-limited.
X-Rate-Limit-Requests-Quota: Header shows how many API requests are allowed in the current window for your client
X-Rate-Limit-Time-Reset-Ms: Header shows how many milliseconds are remaining in the window.
X-Rate-Limit-Time-Window-Ms: Header shows the size of your current rate-limiting window
+ Body
+ Attributes (ProductResponse)
Edit: The header section is rendering, but now the Body section is just showing the text " + Attributes (ProductPost)"
+ Request
+ Headers
Accept: application/json
Content-Type: application/json
X-Auth-Client: Your Client Id
X-Auth-Token: Your Token
+ Response 200
+ Headers
Content-Encoding: Entity header is used to compress the media-type.
Content-Type: application/json
Date: The date the response was sent.
Transfer-Encoding: Header specifies the form of encoding used to safely transfer the entity to the user.
Vary: HTTP response header determines how to match future request headers to decide whether a cached response can be used rather than requesting a fresh one from the origin server. We use Accept Encoding
X-Rate-Limit-Requests-Left: Header details how many remaining requests your client can make in the current window before being rate-limited.
X-Rate-Limit-Requests-Quota: Header shows how many API requests are allowed in the current window for your client
X-Rate-Limit-Time-Reset-Ms: Header shows how many milliseconds are remaining in the window.
X-Rate-Limit-Time-Window-Ms: Header shows the size of your current rate-limiting window
+ Body
+ Attributes (ProductCollectionResponse)
I am trying to define the Request Body and after reading this:
https://help.apiary.io/api_101/apib-authentication/ &
https://github.com/apiaryio/api-blueprint/blob/master/API%20Blueprint%20Specification.md#def-headers-section
It seemed like I could split them into sections. But the Attributes section is not being recognized. This is a /GET request.
Any ideas why?
+ Request (application/json)
+ Headers
+ Attributes (RequestHeaders)
+ Body
+ Attributes (ProductPost)
Headers section can't contain Attributes, you need to define them explicitly. Just replace:
+ Attributes (RequestHeaders)
with definition of RequestHeaders.
Also try to align Body and Attributes at the same column:
+ Body
+ Attributes (ProductPost)

Drupal 7 & Varnish 4 - I always get X-Drupal-Cache: MISS but X-Cache: HIT

I have run into the same issue as this person: X-Drupal-Cache for Drupal 7 website always hits MISS, and can not find a way out.
I am running Drupal 7 - Pressflow
and
Varnish 4.0
When I curl I get this result:
TTP/1.1 200 OK
Date: Fri, 08 Jul 2016 17:45:08 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Set-Cookie: __cfduid=db5fd757e7485622ac16af86f292603f51467999908; expires=Sat, 08-Jul-17 17:45:08 GMT; path=/; domain=.adland.tv; HttpOnly
X-Content-Type-Options: nosniff
**X-Drupal-Cache: MISS**
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: public, max-age=86400
X-Content-Type-Options: nosniff
Content-Language: en
X-Generator: Drupal 7 (http://drupal.org)
Last-Modified: Fri, 08 Jul 2016 17:41:27 GMT
Vary: Accept-Encoding
X-Varnish: 196743 3
Age: 213
Via: 1.1 varnish-v4
**X-Cache: HIT**
X-Cache-Hits: 22
Server: cloudflare-nginx
CF-RAY: 2bf55922d49b23d8-IAD
isvarnishworking.com tells me: "You deserve a gold star, here you go: gold star badge"....
While the "Varnish Indicator Chrome Extension" suggested in the linked Drupal org thread, tells me Varnish missed, on every single page of my website regardless if I am logged in or not.
If I turn Drupal cache for anonymous users at admin/config/development/performance off, Varnish will not work at all. If I set different minimum cache lifetimes there, it makes no difference.
In my settings.php I have this:
$conf['varnish_version'] = 4;
$conf['reverse_proxy'] = True;
$conf['reverse_proxy_addresses'] = array('127.0.0.1');
$conf['page_cache_invoke_hooks'] = FALSE;
$conf['page_cache_maximum_age'] = 86400;
$conf['cache_backends'][] = 'sites/all/modules/varnish/varnish.cache.inc';
$conf['cache_class_cache_page'] = 'VarnishCache';
$conf['reverse_proxy_header'] = 'HTTP_X_FORWARDED_FOR';
$conf['omit_vary_cookie'] = True;
$conf['drupal_http_request_fails'] = FALSE;
and this
$conf['cache_backends'][] = 'sites/all/modules/filecache/filecache.inc';
$conf['cache_backends'][] = 'sites/all/modules/authcache/authcache.cache.inc';
$conf['cache_backends'][] = 'sites/all/modules/authcache/modules/authcache_builtin/authcache_builtin.cache.inc';
$conf['cache_class_cache_page'] = 'DrupalFileCache';
while this has been commented out from the Varnish config in settings.php because if I don't, Varnish fails:
//$conf['cache'] = 1;
//$conf['cache_lifetime'] = 01080;
I have turned off all modules that could interfere, such as captcha modules, and I will note that the statistics won't count node hits correctly now, so something is being cached...
The VCL I use is grabbed straight from this github master with minimum changes
How can I troubleshoot this X-Drupal-Cache: MISS issue?
Your backend is clearly sending cookies:
Set-Cookie: __cfduid=db5fd757e7485622ac16af86f292603f51467999908; expires=Sat, 08-Jul-17 17:45:08 GMT; path=/; domain=.adland.tv; HttpOnly
In default configuration, Varnish will not cache a object coming from the backend with a Set-Cookie header present. Also, if the client sends a Cookie header, Varnish will bypass the cache and go directly to the backend.

Receiving an image using Python 3 Sockets

I'm having a bit of a hard time receiving an image using sockets. I think the problem is related to the fact that sockets send both a header and the actual image, and that the two need different decoding.
This is the code:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('www.py4inf.com', 80))
mysock.send(
'GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n'.encode('utf-8'))
count = 0
fhand = open("stuff.jpg", "wb")
while True:
data = mysock.recv(512)
if len(data) < 1:
break
fhand.write(data)
mysock.close()
fhand.close()
Yes, there is a header. The end of it is after the first \r\n\r\n sequence. Once you see that sequence send the rest to a file. Here's a crude fix:
import socket
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as mysock:
mysock.connect(('www.py4inf.com', 80))
mysock.send(b'GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')
header = b''
while True:
data = mysock.recv(512)
if not data:
raise RuntimeError('no header?')
header += data
# end-of-header in buffer yet?
eoh = header.find(b'\r\n\r\n')
if eoh != -1:
break
# split the header off and keep data read after it.
header,data = header[:eoh+4],header[eoh+4:]
print(header.decode())
with open("stuff.jpg", "wb") as fhand:
fhand.write(data)
while True:
data = mysock.recv(512)
if not data:
break
fhand.write(data)
Here's the header. Note that the content length is in the header, so if you were to send an HTTP request with a keepalive, you would have to read exactly that many bytes after the header. since Connection: close is specified, you only have to read until no more data is received.
HTTP/1.1 200 OK
Date: Sun, 22 May 2016 23:22:20 GMT
Server: Apache
Last-Modified: Fri, 04 Dec 2015 19:05:04 GMT
ETag: "b294001f-111a9-526172f5b7cc9"
Accept-Ranges: bytes
Content-Length: 70057
Connection: close
Content-Type: image/jpeg

How to know the endtime of each request for each of user in Jmeter

I'm using Jmeter and would like to identify the endtime of each request for each user.
Please take a look my testplan:
Thread group: 2 users
loop:1
2 HTTP request (request_1, request_2)
Start testing Web performance, the View Result tree shows: 4 results (2 for request_1, 2 for request_2)
request_2: 1 passed and 1 failed. Look in request table of result tree, I see:
Thread Name: jp#gc - Stepping Thread Group 1-1
Sample Start: 2014-04-18 09:28:06 ICT
Load time: 1100554
Latency: 550450
Size in bytes: 408190
Headers size in bytes: 4774
Body size in bytes: 403416
Sample Count: 1
Error Count: 0
Response code: 200
Response message: OK
Response headers:
HTTP/1.1 200 OK
Date: Fri, 18 Apr 2014 02:28:15 GMT
Server: Apache
X-Powered-By: PHP/5.3.3
Set-Cookie: ls23166422738597439695-runtime-publicportal=h4knpfldt76e3kvmunrn5i4u16; path=/limesurvey/
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
P3P: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT"
Last-Modified: Fri, 18 Apr 2014 02:36:09 GMT
Cache-Control: post-check=0, pre-check=0
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8
HTTPSampleResult fields:
ContentType: text/html; charset=utf-8
DataEncoding: utf-8
The questions are:
How to identify the time which cause request_2 is fail ? and how to display the endtime of each request for each user ?
How to displays information in the log panel of Jmeter (enable DEbug log mode on GUI), like "This is error....due to..."
Besides, as in the log panel (active log debug in GUI), some time the log entries stop at Thread 1-n (n=1,2...), after that 30s, the log is continue showing. So, I wonder about this time, web server has error, and in this time, Jmeter still send request or waiting Web server response ?
Thanks.
It can be done via Beanshell Pre Processor which you can add as a child of any "interesting" request.Example code would look like:
import java.util.Date;
long end_time_ms = prev.getEndTime(); // obtain sampler end time (in milliseconds from 1st Jan 1970)
Date end_time_date = new Date(end_time_ms); //convert it to human-readable date if you prefer
String response_message = prev.getResponseMessage(); // get initial response message
StringBuilder response = new StringBuilder(); // initialize StringBuilder to construct new response
response.append(response_message); // add initial response message
response.append(System.getProperty("line.separator")); // add new line
response.append("Thread finished at: ").append(end_time_date); // add thread finish date
prev.setResponseMessage(response.toString()); // set new response message
log.info("Thread finished at:" + end_time_date"); // to print it to the log
See above for Beanshell code and image below for UI impact
Never use GUI for anything apart from developing or debugging tests. If you want to add something to the log use log.info("something"); as above or JMeter __log() function

Resources