The Google documentation says there's a limit of 50k URLs in sitemaps you send to them, and I want my sitemap to be submitted by an automated job periodically. Therefore, shouldn't I just have the sitemap contain only the N most recent URLs added to my site? Yes, I know you can have multiple sitemaps, and I do have a separate one for the static HTML pages in the site. But I also need one for the database content that may not be reachable in one hop from the main pages, and I don't like the idea of a growing list of sitemaps(It may sound like 50k is more than enough, but I don't want to code with that assumption).
Sure - if you know your previous pages (from an older sitemap.xml or simply crawled) are indexed, you should be fine by including only new links.
Related
I'm building my first ever website using laravel 5.2. Right now, I'm only serving static content with a few API requests for things like the current weather. I've never built a website before, but I'm running my own droplet at DO, so there's no shared hosting limitations.
How would I implement a search that allows users to search my site's content from the main screen? Currently there's no interaction on a DB, it's all just Blade/HTML. I want to avoid using Google Custom Search as there should be no ads, and I want to learn along the way.
Please advise.
It depend on the amount of static contents you have in your site. Maybe you could try implementing it the following way.
Create lookup of the views pages with the most relevant keywords
When user search for keyword which match to any of my relevant keywords, I would load the respective view page.
I would store the the lookup in json format. It would be similar to contents with multiple tags on them.
Here, creating json file is going to be tedious and needs to be done manually.
Is there a certain way to check which pages on a website use a specific image?
Say I have some image which I don't use on a page anymore, so I'd like to delete it from my server. But I'm not entirely sure if it's being used on other pages, is there a way to check if it's still being shown on other pages?
You can hook your website to google webmaster tools and wait a little bit after a while 404 errors will appear there. This way you can track unused resources and dead ends.
This includes images.
There is a better way if you have direct access to the web server.
Visit every page in your website or let google crawl it.
You can later sort the files by date modified and ones which are not modified lately are not used.
You have to make sure you get the images from the pages so I would use a historyless cahceless session.
How to sort the files according to the time stamp in unix?
having sent normally the first time my sitemap.xml through webmaster tools, I notice every day submitted url's plots (beside indexed ones under optimisation->sitemaps menu) without doing anything from my own. I use drupal7 with sitemap module (http://drupal.org/project/xmlsitemap) and there's no automated tasks enabled.
Does it mean that url's are submitted "internally" by google every day? Or there's something wrong that I need to resolve?
Many thanks for help.
Google will remember any sitemaps you submit and their crawler will automatically download those and associated resources more or less whenever it feels like doing so. This is usually reflected in your Webmaster Tools. In all likelihood it'll even do so without you entering your sitemap on their website if your site gets linked to. Same goes for pretty much any other bot and crawler out in the wild.
No need to worry, everything is doing what it's supposed to. It's a Good Thing(tm) when Google crawls your site frequently :).
I'm in the process of creating a sitemap for my website. I'm doing this because I have a large number of pages that can only be reached via a search form normally by users.
I've created an automated method for pulling the links out of the database and compiling them into a sitemap. However, for all the pages that are regularly accessible, and do not live in the database, I would have to manually go through and add these to the sitemap.
It strikes me that the regular pages are those that get found anyway by ordinary crawlers, so it seems like a hassle manually adding in those pages, and then making sure the sitemap keeps up to date on any changes to them.
Is it a bad to just leave those out, if they're already being indexed, and have my sitemap only contain my dynamic pages?
Google will crawl any URLs (as allowed by robots.txt) it discovers, even if they are not in the sitemap. So long as your static pages are all reachable from the other pages in your sitemap, it is fine to exclude them. However, there are other features of sitemap XML that may incentivize you to include static URLs in your sitemap (such as modification dates and priorities).
If you're willing to write a script to automatically generate a sitemap for database entries, then take it one step further and make your script also generate entries for static pages. This could be as simple as searching through the webroot and looking for *.html files. Or if you are using a framework, iterate over your framework's static routes.
Yes, I think it is not a good to leave them out. I think it would also be advisable to look for a way that your search pages can be found by a crawler without a sitemap. For example, you could add some kind of advanced search page where a user can select in a form the search term. Crawlers can also fill in those forms.
We have members-only paid content that is frequently copied and republished without our permission.
We are trying to ‘watermark’ our content by including each customer’s user id in a fake css class, for example <p class='userid_1234'> (except not so obivous, of course :), that would help us track the source of the copying, and then we place that class somewhere in the article body.
The problem is, by including user-specific information into an article, it makes it so that the article content is ineligible for caching because it is now unique to each user.
This bumps the page load time from ~.8ms to ~2.5sec for each article page view.
Does anyone know of any watermarking strategies that can still be used with caching?
Alternatively, what can be done to speed up database access? ( ha, ha, that there’s just a tiny topic i’m sure.. )
We're using the CMS Expression Engine, but I'd like to hear about any strategies. They don't have to be EE-specific.
If you're talking about images then you could use PHP to add a watermark to the images.
How can I add an image onto an image in PHP like a watermark
its a tool to help track down the lazy copiers who just copy the source code as-is. this is not preventative, nor is it a deterrent. – Ian 12 hours ago
Going by your above comment you are happy with users copying your content, just not without the formatting etc. So what you could do is provide the users an embed type of source code for that particular content just like YouTube does with videos. Into that embed source code you could add your own links back to your site, utilize your own CSS etc.
That way you can still allow the members to use the content but it will always come out the way you intended it with links back to your site.
Thanks
You could always cache a version that uses a special string, like #!username!#, and then later fill it in with PHP based on which user is viewing it.
Another way I believe is to switch from caching on the server to instead let the browser cache it locally for a little. That way it is only cached per user, and it reduces the calls to your database. Because an article is pretty static, you could just let the local computer cache it, and pull in comments via javascript.
This last one is probably not one you are really looking for, but I'm gonna come out and say it anyway. You could not treat your users like thieves, and instead treat the thieves as thieves. Go to the person hosting the servers your content is on and send them an email telling them copyrighted premium content is being hosted on their servers without your permission. You can even automate that process.
How to find out what sites are posting your content? Put a link in the body content to your site, and do a Google Search/Blog Search for articles linking to that site. To automate it, use Google Blog Search because it offers RSS feeds. Any one that has a link back to your site could go into a database with a link to the page, someone could look at it, and if it is the entire article, go do a Whois and send them an email.
What makes you think adding css to something is going to stop people from copying it without that CSS? It's more likely that they are just coping the source of the content you are showing them and ignoring all the styling around it. For example, I use tamper data to look at all HTTP requests made by Firefox, if I can see it on the page, I can see it in the logs. Even with all the "protection" some sites try to put in place, they generally will never work. I can grab what I want, without using any screen capture/recording.
If you were serving flv's, for example, I would easily be able to grab the source of that even if you overlayed it with some CSS. I think the best approach would be to get the sites publishing your premium content and ask them to remove it. It's either that or watermark the actual content on the fly while sending it to the browser.