I have a CodeIgniter application that I'm maintaining and all internal links follow a structure like:
website.com/name-of-application/page
However, the page is only served if the middle section follows a specific capitalization structure:
(ie: website.com/Name-Of-application/page)
This is strange to me, because all of the files themselves are lowercased. Is there an easy way to redirect/rewrite URLs to follow the strange casing in either IIS or CodeIgniter itself? Thanks so much
We have a dspace repository of research publications that the gsa is indexing via a web crawl, ie start at the homepage and follow all the links.
I'm thinking that using a connector to submit urls for indexing from sitemap.xml file, might be more efficient. The gsa would then only need to index and recrawl the urls on the sitemap and could ignore the result of the site.
The suggestion from the gsa documentation is that this is not really a target for a connector, as the content can all be discovered by a web crawl.
What do you think?
Thanks,
Georgina.
This might be outdated (so I'm not sure if it still work), but there's an example of a python connector that will parse a sitemap.xml and send it as Content Feed or Metadata feed.
Here are 2 links to help you
https://github.com/google/gsa-admin-toolkit/blob/master/connectormanager/sitemap_connector.py
https://github.com/google/gsa-admin-toolkit/wiki/ConnectorManagerDocumentation
If anything, this will give you an idea of the logic to implement if you write your own Connector 3.x or Adaptor 4.x
You can generate sitemaps making from /bin directory "dspace generate-sitemaps". It will generate a sitemaps directory with link to all items from dspace.
An output example:
<html><head><title>URL List</title></head><body><ul><li>http://localhost:8080//handle/123456789/1</li>
<li>http://localhost:8080//handle/123456789/2</li>
<li>http://localhost:8080//handle/123456789/3</li>
<li>http://localhost:8080//handle/123456789/5</li>
</ul></body></html>
You could easily create a GSA "Feed" that lists the URLs that you want to crawl. However, since your "Follow" patterns must include the host name of your web site, the crawler is going to follow all the pages that are linked from the pages in your feed.
If you truly only want to index the items in your "Site Map" then you should probably look at writing an Adaptor (4.x). You would then be responsible for writing the logic to parse your sitemap.xml file to extract the URLs you want crawled.
I have a website that is rewritten so URLs are .html
eg: mysite.com/about-us.html
I'm going to add a search feature in which could have a number of different criteria. So my question.... I know the following would work ok as I tried it:
mysite.com/search.html?var1=xxx&var3=xxx
Is there any reason why I should do this as html pages generally wouldn't have variables? I will test, but would there be any browser issues (old browser perhaps)? Any SEO disadvantages?
Thanks :)
Of course ".html"-files can contain variables.
It is not dependent of the Browser but the Server Configuration.
The Server respectively the php-parser must adjusted to parse .html files.
But I don't think that ".html" ending are relevant for google see:
https://webmasters.stackexchange.com/questions/5333/url-rewrite-should-i-write-a-fake-file-suffix-html-or-something-more-realis)
We have an application which has about 15000 pages. For better SEO reasons we had to change the URL's. Google had already crawled all of these pages earlier and due to the change, we see a lot of duplicate titles/meta description on webmasters. Our impressions on google have dropped and we believe this is the reason. Correct me if my assumption is incorrect. Now we are not able to write a regular expression for the change of URL's using a 301 redirect, because the change was such. The only way to do it would be to write 301 redirects for individual URL's which is not feasible for 10000 URL's. Now can we use a robots meta tag with NOINDEX? My question basically is if I write a NOINDEX metatag will google remove the already indexed URL's? If not what are the other ways to remove the old indexed URL's from google? ANother thing which I can do is make all the previous pages 404 errors to avoid the duplicates, but will that be a right thing to do?
Now we are not able to write a regular expression for the change of
URL's using a 301 redirect, because the change was such. The only way
to do it would be to write 301 redirects for individual URL's which is
not feasible for 10000 URL's.
Of course you can! I'm rewriting more than 15000 URLs with mod_rewrite and RewriteMap!
This is just a matter of scripting / echo all URLs and mastering vim, but you can, and easily. If you need more information, just ask.
What you can do is a RewriteMap fils like this:
/baskinrobbins/branch/branch1/ /baskinrobbins/branch/Florida/Jacksonville/branch1
I've made a huge answer here and you can very easily adapt it to your needs.
I could do that job in 1-2 hours max but I'm expensive ;).
Reindexing is slow
It would take weeks for Google to ignore the older URLs anyways.
Use Htaccess 301 redirects
You can add a file on your Apache server, called .htaccess, that is able to list all the old URLs and the new URLs and have the user instantly redirected to the new page. Can you generate such a text file? I'm sure you can loop through the sections in your app or whatever and generate a list of URLs.
Use the following syntax.
Redirect 301 /oldpage.html http://www.yoursite.com/newpage.html
Redirect 301 /oldpage2.html http://www.yoursite.com/folder/
This prevents the 404 File Not Found errors, and is better than the meta refresh or redirect tag because the the old page is not even served to clients.
I did this for a website that had gone through a recent upgrade, and since google kept pointing to the older files, we needed to redirect clients to view the new content instead.
Where's .htaccess?
Go to your site’s root folder, and create/download the .htaccess file to your local computer and edit it with a plain-text editor (ie. Notepad). If you are using FTP Client software and you don’t see any .htaccess file on your server, ensure you are viewing invisible / system files.
When I look at Amazon.com and I see their URL for pages, it does not have .htm, .html or .php at the end of the URL.
It is like:
http://www.amazon.com/books-used-books-textbooks/b/ref=topnav_storetab_b?ie=UTF8&node=283155
Why and how? What kind of extension is that?
Your browser doesn't care about the extension of the file, only the content type that the server reports. (Well, unless you use IE because at Microsoft they think they know more about what you're serving up than you do). If your server reports that the content being served up is Content-Type: text/html, then your browser is supposed to treat it like it's HTML no matter what the file name is.
Typically, it's implemented using a URL rewriting scheme of some description. The basic notion is that the web should be moving to addressing resources with proper URIs, not classic old URLs which leak implementation detail, and which are vulnerable to future changes as a result.
A thorough discussion of the topic can be found in Tim Berners-Lee's article Cool URIs Don't Change, which argues in favour of reducing the irrelevant cruft in URIs as a means of helping to avoid the problems that occur when implementations do change, and when resources do move to a different URL. The article itself contains good general advice on planning out a URI scheme, and is well worth a read.
More specifically than most of these answers:
Web content doesn't use the file extension to determine what kind of file is being served (unless you're Internet Explorer). Instead, they use the Content-type HTTP header, which is sent down the wire before the content of the image, HTML page, download, or whatever. For example:
Content-type: text/html
denotes that the page you are viewing should be interpreted as HTML, and
Content-type: image/png
denotes that the page is a PNG image.
Web servers often use the file extension if the file is served directly from disk to determine what Content-type to assign, but web applications can also generate pages with any Content-type they like in response to a request. No matter the filename's structure or extension, so long as the actual content of the page matches with the declared Content-type, the data renders as intended.
For websites that use Apache, they are probably using mod_rewrite that enables them to rewrite URLS (and make them more user and SEO friendly)
You can read more here http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html
and here http://www.sitepoint.com/article/apache-mod_rewrite-examples/
EDIT: There are rewriting modules for IIS as well.
Traditionally the file extension represents the file that is being served.
For example
http://someserver/somepath/image.jpg
Later that same approach was used to allow a script process the parameter
http://somerverser/somepath/script.php?param=1234&other=7890
In this case the file was a php script that process the "request" and presented a dinamically created file.
Nowadays, the applications are much more complex than that ( namely amazon that you metioned )
Then there is no a single script that handles the request ( but a much more complex app wit several files/methods/functions/object etc ) , and the url is more like the entry point for a web application ( it may have an script behind but that another thing ) so now web apps like amazon, and yes stackoverflow don't show an file in the URL but anything comming is processed by the app in the server side.
websites urls without file extension?
Here I questions represents the webapp and 322747 the parameter
I hope this little explanation helps you to understand better all the other answers.
Well how about a having an index.html file in the directory and then you type the path into the browser? I see that my Firefox and IE7 both put the trailing slash in automatically, I don't have to type it. This is more suited to people like me that do not think every single url on earth should invoke php, perl, cgi and 10,000 other applications just in order to sent a few kilobytes of data.
A lot of people are using an more "RESTful" type architecture... or at least, REST-looking URLs.
This site (StackOverflow) dosn't show a file extension... it's using ASP.NET MVC.
Depending on the settings of your server you can use (or not) any extension you want. You could even set extensions to be ".JamesRocks" but it won't be very helpful :)
Anyways just in case you're new to web programming all that gibberish on the end there are arguments to a GET operation, and not the page's extension.
A number of posts have mentioned this, and I'll weigh in. It absolutely is a URL rewriting system, and a number of platforms have ways to implement this.
I've worked for a few larger ecommerce sites, and it is now a very important part of the web presence, and offers a number of advantages.
I would recommend taking the technology you want to work with, and researching samples of the URL rewriting mechanism for that platform. For .NET, for example, there google 'asp.net url rewriting' or use an add-on framework like MVC, which does this functionality out of the box.
In Django (a web application framework for python), you design the URLs yourself, independent of any file name, or even any path on the server for that matter.
You just say something like "I want /news/<number>/ urls to be handled by this function"