I'm making a crawler that fetches all relative and absolute links. But if there is a relative url that is incorrect, then the crawler continues to prepare new absolute url in the website that handles incorrect urls with 200 response code.
Let's say, there is a relative link : "example/example.php", when I try to crawl http://example.com/example.com. When I find that page, I'll append and create a new link to crawl i.e. http://example.com/example/example.php. The problem is the page will again contain example/example.php which then appends to http://example.com/example/example/example.php.
Is there a better way of getting rid of this other than content comparison?
Related
In my laravel 9 project, when I have more than 4 '/' in my route, the project doesn't assets properly for that page. Because it includes the first keyword of my defined route.
For example: If I define a get route:
example.com/word1/word2/word3/word4/word5
In this case, all my other related routes such as my image links, where I've used route('/images/..')
The application loads the link: example.com/word1/images/... instead of example.com/images/...
I used '/' to solve this.
When you use a relative URL (e.g. /images/example.jpg), the browser will interpret it as being relative to the current page's URL. So if you're on a page with a URL like example.com/word1/word2/word3/word4/word5, then the relative URL /images/example.jpg will be interpreted as example.com/word1/word2/word3/word4/images/example.jpg.
To avoid this issue, you can use absolute URLs instead of relative URLs. An absolute URL includes the full URL, including the protocol (e.g. https://) and domain name. In your case, you can use the URL helper function to generate absolute URLs for your assets, like this: This will generate an absolute URL that includes only the domain name and the path to the asset, regardless of the current page's URL
<img src="{{ url('/images/example.jpg') }}" alt="Example">
There are many http request tools in ruby, httparty, rest-client, etc. But most of them get only the page itself. Is there a tool that gets the html, javascript, css and images of a page just like a browser does?
Anemone comes to mind, but it's not designed to do a single page. It's capable if you have the time to set it up though.
It's not hard to retrieve the content of a page using something like Nokogiri, which is a HTML parser. You can iterate over tags that are of interest, grab their "SRC" or "HREF" parameters and request those files, storing their content on disk.
A simple, untested and written-on-the-fly, example using Nokogiri and OpenURI would be:
require 'nokogiri'
require 'open-uri'
html = open('http://www.example.com').read
File.write('www.example.com.html', html)
page = Nokogiri::HTML(html)
page.search('img').each do |img|
File.open(img['src'], 'wb') { |fo| fo.write open(img['src']).read }
end
Getting CSS and JavaScript are a bit more difficult because you have to determine whether they are embedded in the page or are resources and need to be retrieved from their sources.
Merely downloading the HTML and content is easy. Creating a version of the page that is stand-alone and reads the content from your local cache is much more difficult. You have to rewrite all the "SRC" and "HREF" parameters to point to the file on your disk.
If you want to be able to locally cache a site, it's even worse, because you have to re-jigger all the anchors and links in the pages to point to the local cache. In addition you have to write a full site spider which is smart enough to stay within a site, not follow redundant links, obey a site's ROBOTS file, and not consume all your, or their, bandwidth and get you banned, or sued.
As the task grows you also have to consider how you are going to organize all the files. Storing one page's resources in one folder is sloppy, but the easy way to do it. Storing resources for two pages in one folder becomes a problem because you can have filename collisions for different images or scripts or CSS. At that point you have to use multiple folders, or switch to using a database to track the locations of the resources, and rename them with unique identifiers, and rewrite those back to your saved HTML, or write an app that can resolve those requests and return the correct content.
I'm new to mod_rewrite. And SEO.
I wanted to create a RewriteRule which essentially converts the following request:
http://xyz.com/property/state/city/name/propertyid/
into
http://xyz.com/property/?id=propertyid
This is what I used:
RewriteRule ^property/([^/]+)/([^/]+)/([^/]+)/([1-9][0-9]*)/$ /property/?id=$4 [NC]
As you can see, I don't consider the 3 preceding parameters, the id alone is sufficient to display the right page.
Now what I'm wondering is - how would a search engine know the 'desired' link to the property?
In other words, if this page were to be indexed, what link would it have in the search results?
(or does this depend on which link I spread around?)
Thanks.
Search engine crawlers can only obtain resources to which they know URLs to. So in order to have some resource crawled, the crawler needs to know its URL. This is primarily done by links on other web pages or by submission.
Now if you’re linking to /property/state/city/name/propertyid/, crawlers will request that URL. You server will then rewrite that URL internally to /property/?id=propertyid and return its contents back to the crawler. That’s it.
Unless you’re also linking to /property/?id=propertyid somewhere, crawlers won’t notice that /property/state/city/name/propertyid/ is actually mapped onto /property/?id=propertyid.
What search engines will do with the URL and the contents of the resource is a different story.
I'm programming a website with SEO friendly links, ie, put the page title or other descriptive text in the link, separated by slashes. For example: h*tp://www.domain.com/section/page-title-bla-bla-bla/.
I redirect the request to the main script with mod_rewrite, but links in script, img and link tags are not resolved correctly. For example: assuming you are visiting the above link, the tag request the file at the URL h*tp://www.domain.com/section/page-title-bla-bla-bla/js/file.js, but the file is actually http://www.domain.com/js/file.js
I do not want to use a variable or constant in all HTML file URLs.
I'm trying to redirect client requests to a directory or to another of the server. It is possible to distinguish the first request for a page, which comes after? It is possible to do with mod_rewrite for Apache, or PHP?
I hope I explained well:)
Thanks in advance.
Using rewrite rules to fix the problem of relative paths is unwise and has numberous downsides.
Firstly, it makes things more difficult to maintain because there are hundreds of different links in your system.
Secondly and more seriously, you destroy cacheability. A resource requested from here:
http://www.domain.com/section/page-title-bla-bla-bla/js/file.js
will be regarded as a different resource from
http://www.domain.com/section/some-other-page-title/js/file.js
and loaded two times, causing the number of requests to grow dozenfold.
What to do?
Fix the root cause of the problem instead: Use absolute paths
<script src="/js/file.js">
or a constant, or if all else fails the <base> tag.
This is an issue of resolving relative URIs. Judging by your description, it seems that you reference the other resources using relative URI paths: In /section/page-title-bla-bla-bla a URI reference like js/file.js or ./js/file.js would be resolved to /section/page-title-bla-bla-bla/js/file.js.
To always reference /js/file.js independet from the actual base URI path, use the absolute path /js/file.js. Another solution would be to set the base URI explicitly to / using the BASE element (but note that this will affect all relative URIs).
I have read a lot about URL rewriting but I still don't get it.
I understand that a URL like
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=19
can be replaced with a friendlier one like
http://www.example.com/Blog/2006/12/19/
and the server code can remain unchanged because there is some filter which transforms the new URL and sends it to the old, but does it replace the URLs in the HTML of the response too?
If the server code remains unchanged then it is possible that in my returned HTML code I have links like:
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=20
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=21
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=22
This defeats the purpose of having the nice URLs if in my page I still have the old ones.
Does URL rewriting (with a filter or something) also replace this content in the HTML?
Put another way... do the rewrite rules apply for the incoming request as well as the HTML content of the response?
Thank you!
The URL rewriter simply takes the incoming URL and if it matches a certain pattern it converts it to a URL that the server understands (assuming your rewrite rules are correct).
It does mean that a specific resource can be accessed multiple ways, but this does not "defeat the point", as the point is to have nice looking URLs, which you still do.
They do not rewrite the outgoing content, only the incoming URL.