I was wondering if you guys can help me out. Im making a website for an authorized Dish dealer. Ive been trying to retrieve the flash animation on dish.com. I was able to get it through firefox with 'Page Info' but all I get is a black rectangle without no animation. Its a .swf, any help is appreciated.
There are a couple of possible issues there.
There's a chance that the movie is loading something from that domain, which will not get loaded when the container movie is loaded in another domain due to cross-domain policy issues (and security sandbox restrictions).
Also, in its code the developer might have made it to check the URL and restricted some actions accordingly.
Another possibility is, it might be using some data that it gathers from the page source, or a server-side script which you might not be providing it.
These are the most likely possibilities I can think of, but if I get a chance to think more I'm sure I can come up with many others.
Related
My client needs data scraped from a website. I am planning to use php_curl. The problem is, the site is using Google reCAPTCHA. Few powerful data items are visible only when you click "show this information link". then the reCAPTCHA appears in lightbox and vanishes, and information is displayed.
I have checked the source html, the protected item is actually loaded when someone clicks, and there is no way for me to automate this click. I have even tried to open the site in iframe and then use JS to click it, but it fails as both domains are different. I have also tried to use Selenium stand alone version but its downloads are corrupt.
Unless there is a design flaw with the website, the reCAPTCHA will prevent you from scraping the material without human intervention.
Technically, your best bet is to employ humans to solve CAPTCHAs all day and write some software to automatically scrape the material it protects for each one they solve. A number of viable businesses have been created this way, where the data is valuable and there is a genuine public interest in opening the data-set. (For example I heard that flight companies use CAPTCHA devices to prevent price comparison sites from driving down the cost to the consumer, and I'd argue in such a case there is an overwhelming public interest to defeating such defences).
Morally, however, you would need to tell us what you are doing in order for us to advise you. It is possible your client is merely planning to steal other people's material and then attempt to monetise it for him/herself, even though they had no hand in creating it. That may breach some copyright laws, but moreover, they (and you) need to decide if the scraping is fair.
I am facing the same problem but resolved it using clear my cookies in httprequest in useragent after clear cookie wait time function (tread sleep) for some time and then start scrapping again. But I am doing this in C#, not in PHP. Applying this logic may help you.
I have a Magento website and I have been noticing an increase in warnings from Catchpoint that various images, CSS files, and javascript files are taking longer than usual to load. We use Edgecast for our CDN and have all images, CSS, and JS files hosted there. I have been in contact with them and they determined that the delays happen when the cache for the resource has expired and it must contact the origin for an updated file. The problem is that I can't figure out why it would take longer than a second to return a small image file. If I load the offending image off our server (not from the CDN) in my browser it always returns quickly. I assume that if you call up an image file directly using the full URL to the image file (say a product image, for example), that would bypass any Magento logic or database access and simply return the image to you. This should happen quickly, and it normally does, but sometimes it doesn't.
We have a number of things in play that may have an effect. There are API calls to the server for various integrations, though they are directed at a secondary server and not the web frontend. We may also have a large number of stale images since Magento doesn't delete any images even if you replace them or delete the product.
I realize this is a fairly open ended question, and I'm sorry if it breaks SO protocol, but I'm grasping at straws here. If anyone has any ideas on where to look or what could cause small resource files, like images, to take upwards of 8 seconds to load, I'm all ears. As an eCommerce site, it's getting close to peak season, and I can feel the hot breath of management on my neck. Any help would be greatly appreciated.
Thanks!
Turns out we had stumbled upon some problems with the CDN that they were somewhat aware of and not quick to admit. They made some changes to our account to work around the issues and things are much better now.
Google and Twitter, just to name a couple, have code in their pages that detects slow page load times, and presents the user with a nice message. I would like to implement something similar, and before I dig in and try to reverse-engineer their implementations, I was wondering if there are any existing components thay may help achieve this goal. (my search-fu failed me)
Thanks!
Haven't seen any off the shelf libraries that do this, but isn't it really just loke an AJAX spinner but one that loads after a short delay?
We have members-only paid content that is frequently copied and republished without our permission.
We are trying to ‘watermark’ our content by including each customer’s user id in a fake css class, for example <p class='userid_1234'> (except not so obivous, of course :), that would help us track the source of the copying, and then we place that class somewhere in the article body.
The problem is, by including user-specific information into an article, it makes it so that the article content is ineligible for caching because it is now unique to each user.
This bumps the page load time from ~.8ms to ~2.5sec for each article page view.
Does anyone know of any watermarking strategies that can still be used with caching?
Alternatively, what can be done to speed up database access? ( ha, ha, that there’s just a tiny topic i’m sure.. )
We're using the CMS Expression Engine, but I'd like to hear about any strategies. They don't have to be EE-specific.
If you're talking about images then you could use PHP to add a watermark to the images.
How can I add an image onto an image in PHP like a watermark
its a tool to help track down the lazy copiers who just copy the source code as-is. this is not preventative, nor is it a deterrent. – Ian 12 hours ago
Going by your above comment you are happy with users copying your content, just not without the formatting etc. So what you could do is provide the users an embed type of source code for that particular content just like YouTube does with videos. Into that embed source code you could add your own links back to your site, utilize your own CSS etc.
That way you can still allow the members to use the content but it will always come out the way you intended it with links back to your site.
Thanks
You could always cache a version that uses a special string, like #!username!#, and then later fill it in with PHP based on which user is viewing it.
Another way I believe is to switch from caching on the server to instead let the browser cache it locally for a little. That way it is only cached per user, and it reduces the calls to your database. Because an article is pretty static, you could just let the local computer cache it, and pull in comments via javascript.
This last one is probably not one you are really looking for, but I'm gonna come out and say it anyway. You could not treat your users like thieves, and instead treat the thieves as thieves. Go to the person hosting the servers your content is on and send them an email telling them copyrighted premium content is being hosted on their servers without your permission. You can even automate that process.
How to find out what sites are posting your content? Put a link in the body content to your site, and do a Google Search/Blog Search for articles linking to that site. To automate it, use Google Blog Search because it offers RSS feeds. Any one that has a link back to your site could go into a database with a link to the page, someone could look at it, and if it is the entire article, go do a Whois and send them an email.
What makes you think adding css to something is going to stop people from copying it without that CSS? It's more likely that they are just coping the source of the content you are showing them and ignoring all the styling around it. For example, I use tamper data to look at all HTTP requests made by Firefox, if I can see it on the page, I can see it in the logs. Even with all the "protection" some sites try to put in place, they generally will never work. I can grab what I want, without using any screen capture/recording.
If you were serving flv's, for example, I would easily be able to grab the source of that even if you overlayed it with some CSS. I think the best approach would be to get the sites publishing your premium content and ask them to remove it. It's either that or watermark the actual content on the fly while sending it to the browser.
I'm working on the quite popular website, which looks good if user has turned on "Load images" option in his browser's settings.
When you try to open the website with "turned off images" option, it becomes not usable, many components won't work, because user won't see "important" buttons(we don't use standard OS buttons).
So, we can't understand and measure negative business impact of this mistake(absent alt/title attributes).
We can't set priority for this task - because we don't know how much such users comes to our website.
Please give me some advice how this problem can be solved?
Look in the logs for how many hits you get on a page without the subsequent requests from the browser for the other images.
Of course the browser might have images cached, so look for the first time you get a hit.
You can even use IP address for this, since it's OK if you throw out good data (that is, hits that are new that you disregard). The question is just: Of the hits you know are first-time, how many don't get images?
If this is a public page (i.e. not a web application that you've logged in to), also disregard search engine bots to the greatest extent possible; they usually won't retrieve images.