How to check for a image update on a website? - image

How can I check if an image file is changed on a website, from another website and then store it and the old version?
I'm using this to log the images on the server.

This is just a quick sketch of the simplest approach. If you want more detail on some part, just ask in the comments.
Sketch of solution
Download the image, compute a hash for it and store the image in file system and image ID + hash + file system path (and possibly other info such as time of request) in database.
When checking for update, get last available info for the same ID from the database and if hashes are not the same, the image was not updated. If you use cryptographic hash like MD5 or SHA1 and the hash changed, it is almost sure that the image changed too.
Setup a cronjob to run the script periodically.
To download the image, you could use $img = file_get_contents($url);. MD5 can be computed via $hash = md5($img);, SHA1 via $hash = sha1($img);. For storing use file_put_contents($path, $img);.
Optimization
There are several ways to optimize the job.
To cut on memory consumption, download the file directly to file system using file_put_contents($path, $url); and only after that compute the hash using $hash = md5_file($path); or $sha1_file($path);. This is better for larger images. The bad thing is that you have to read the data from file system again to compute the hash, so I think it would be slower.
Side note: Never optimize anything before you know that it really makes the code better. Always measure before, after and compare.
Another optimization could be done for saving on data transfers from remote server if the server sends reliable headers for caching. ETag is the best one because it should be based on the contents of the file. If it does not change, file should be the same. If you want just to check the headers, use $headers = get_headers($url, 1);. To fetch really just the headers, you should issue just HTTP request via HEAD method instead of GET. Check get_headers() manual for more info. To check the headers while getting response body, use file_get_contents() along with $http_response_header special variable.
Issuing requests indicating that you cached the image on last visit (via If-Modified-Since et al.) could serve the same purpose.
Etiquette and legal aspects
I told you how. Now I’ll tell you when (not).
Do not abuse the remote server. Remember that its owner has expenses to keep it up and running and definitely does not want to let it be occupied by your scripts for more than a negligible amount of time, transferring not much data. Adapt your polling period to type of target server and size of image. Adapting it to estimated frequency of change is not a bad idea too.
Be sure to have consent of image rights holder when storing its copy. Licensing can be a messy thing. Be careful, otherwise you can get into trouble.
If you plan to somehow crawl for the images, robots.txt standard might be of your interest. This file could tell you that you are not welcome and you should respect it.
Related questions
Some are related more, some less. People want to watch mainly HTML pages. That has other specifics, which is also why I did not flag this question as duplicate of one of these.
https://stackoverflow.com/q/11336431/2157640
https://stackoverflow.com/q/11182825/2157640
https://stackoverflow.com/q/13398512/2157640
https://stackoverflow.com/q/15207145/2157640
https://stackoverflow.com/q/1494488/2157640

Related

Store the state inside golang binary

I am Developing an onpremise solution for a client without any control and internet connection on the machine.
The solution is to be monetized based on number of allowed requests(REST API calls) for a bought license. So currently we store the request count in an encrypted file on the file system itself. But this solution is not perfect as the file can be copied somewhere and then replaced when the requests quota is over. Also if the file is deleted then there's manual intervention needed from support.
I'm looking for a solution to store the state/data in binary and update it runtime (consider usage count that updates in binary itself)
Looking for a better approach.
Also binary should start from the previous stored State
Is there a way to do it?
P.S. I know writing to binary won't solve the issue but I think it'll increase the difficulty by increasing number of permutation and combinations for places where the state can be stored and since it's not a common knowledge that you can change the executable that would be the last place to look for the state if someone's trying to mess with the system (security by obscurity)
Is there a way to do it?
No.
(At least no official, portable way. Of course you can modify a binary and change e.g. the data or BSS segment, but this is hard, OS-dependent and does not solve your problem as it has the same problem like an external file: You can just keep the original executable and start over with that one. Some things simply cannot be solved technically.)
If your rest API is within your control and is the part that you are monetizing surely this is the point at which you would be filtering the licensed perhaps some kind of certificate authentication or key to the API and then you can keep then count on the API side that you can control and then it wont matter if it is in a flat file or a DB etc, because you control it.
Here is a solution to what you are trying to do (not to writing to the executable which) that will defeat casual copying of files.
A possible approach is to regularly write the request count and the current system time to file. This file does not even have to be encrypted - you just need to generate a hash of the data (eg using SHA2) and sign it with a private key then append to the file.
Then when you (re)start the service read and verify the file using your public key and check that it has not been too long since the time that was written to the file. Note that some initial file will have to be written on installation and your service will need to be running continually - only allowing for brief restarts. You also would probably verify that the time is not in the future as this would indicate an attempt to circumvent the system.
Of course this approach has problems such as the client fiddling with the system time or even debugging your code to find the private key and probably others. Hopefully these are hard enough to act as a deterrent. Also if the service or system is shut down for an extended period of time then some sort of manual intervention would be required.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates
You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

How to read a file from browsers cache?

I had made a GIF-file with more (hidden) information in it self, then only the picture-data.
Like so:
<?php
// set variabelen
$naam = "gebruikersinformatie";
$info['age'] = 27;
$info['number'] = '1234.56.789';
$info['name'] = 'Arie Noniem';
$info['unique_hash'] = base64_encode(implode("|", array($_SERVER['HTTP_USER_AGENT'],$_SERVER['HTTP_ACCEPT'],$_SERVER['REMOTE_ADDR'],$_SERVER['REMOTE_PORT'],$_SERVER['HTTP_ACCEPT_LANGUAGE'])));
// build the information
$info = base64_encode(http_build_query($info));
// build the image
header('Content-type: image/gif');
echo base64_decode("R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7\n".$info);
?>-
Now, i would like to check if the file is already in user's cache.
If not: make a new one (with code above).
If is in cache: read that gif.gif file with PHP, as there is more information stored in it.
Question: how to check if the file is in the cache? How to get the browser to cache images, with php? doesnt work correctly
And how to read the cached file? So PHP gets the real contents from the cached file?
Reason: to avoid EU-cookie-law
I'm afraid you won't be avoiding EU Cookie Law.
Although it's commonly become known as cookie law it is a privacy directive with principles applying to any technology where information is placed, held or read from the user's device. So necessity for compliance includes for instance Flash files (Locally Stored Objects), tracking pixels (invisible one pixel images typically used for tracking email opens) and other stuff too.
Just for reference, this answer is based in our experience in putting together ukcookieslaw.co.uk to deal specifically with the UK implementation of the EU Directive (noticing the German in your coding :-).
Assuming at the least privacy invasive your solution was doing the same as a session cookie and providing a necessary function (like maintaining a log-in) one could argue your solution is actually less compliant, as a session cookie will be (usually) destroyed at latest when the user quits the browser.
Your more obscured, difficult to inspect, deliberately hidden (I appreciate there's no malicious intent) payload can, and given that most people do not empty their cache each time they quit, will hang around for longer. In fact, in a way you're relying on that.
Without the details one can't take a view, but it may be that the information is more available to third parties, i.e. is there a possibility of caching of the image by intermediaries in the network that you would have to protect against?
You would still have to describe your use of personal data, and rely on either implied consent (or explicit consent) for placing data on the user's device for your site's compliance. Problem is that any consent must be INFORMED consent, and it would appear on the face of it that informing the user is furthest from your mind.
I think you need a better reason for your engineering effort :-)
kind regards,
Philip

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

Getting ETags right

I’ve been reading a book and I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.
I already know what ETags are and understand the risks, but is it that hard to get ETags right?
I’ve just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.
Is using MD5 hash of the response body as ETag wrong? If so, why?
Why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?
This last question is hard to answer unless you are the author :), so I’m trying to find the weak points of using an MD5 hash as an ETag.
ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.
An ETag needs to be a unique value representing the state and specific format of a resource (a resource could have multiple formats that each need their own ETag). Not unique across the entire domain of resources, simply within the resource.
Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.
You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.
Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.
Edit: I see your edit, and clarify.
MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cached is simply wasteful (i.e. dynamic content).
The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.
ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.
We're using etags for our dynamic content in instela.
Our strategy is at the end of output generating the md5 hash of the content to send and if the if-none-match header exists, we compare the header with the generated hash. If the two values are the same we send 304 code and interrumpt the request without returning any content.
It's true that we consume a bit cpu to hash the content but finally we're saving much bandwidth.
We have a facebook newsfeed style main page which has different content for every user. As the newsfeed content changes only 3-4 time per hour, the main page refreshes are so efficient for the client side. In the mobile era I think it's better to spend a bit more cpu time than spending bandwidth. Bandwidth is still more expensive than the CPU, and it's a better experience for the client.
Having not read the book, I can't speak on the author's precise concerns.
However, the generation of ETags should be such that an ETag is only generated once when a page has changed. Generating an MD5 hash of a web page costs processing power and time on the server; if you have many clients connecting, it could start to cause performance problems.
Thus, you need a good technique for generating ETags only when necessary and caching them on the server until the related page changes.
I think the perceived problem with ETAGS is probably that your browser has to issue and parse a (simple and small) request / response for every resource on your page to check if the etag value has changed server side.
I personally find these extra small roundtrips to the server acceptable for often changing images, css, javascript (the server does not need to resend the content if the browser's etag is current) since the mechanism makes it quite easy to mark 'updated' content.

Resources