More than one top level domain? - domain-name

In a normal URL, you have a protocol, subdomains (optional), domain name, top level domain and subdirectories.
For example: http://www.google.com/path. Here www is subdomain, google is domain name and com is TLD; path is subdirectory here. Parsing this is simple programming task.
But the problem comes when there are more than one TLD's. For example: www.google.co.in/path. Here co.in is TLD. But I see that there is a website with name www.co.in also present.
My doubts are:
How many Top level domains can a URL have? In a URL how to find the top level domains, if there could be multiple TLDs?
In the above example google.co.in is not a subdomain of co.in, so how come www.co.in is resolving to a different website than google.co.in?

If I would have to write an algorithm that decides that "www.co.in" belongs to India Top Level Domain (TLD) and "www.google.co.in" belongs to India Second Level Domain (SLD), I would go here and grab the list:
https://wiki.mozilla.org/TLD_List
Then, I would process my URL like this:
Compare the the last part of the URL to all TLDs in the list and find a matching one. [www.google.co.in -> in, www.co.in -> in]
If no TLD was found, the URL is invalid.
If a TLD was found and the URL has three parts or less, return the TLD as result and exit.
If a TLD was found and the URL has more than three parts, do a second search in the list of SLDs. Compare the end of the URL against the pattern ".SLD.TLD".
If no entry was found, return the TLD as result and exit.
If an entry was found, return SLD.TLD as result and exit.

Very slow yet comprehensive regex you could use:
(sourced from Wikipedia and Mozilla)
[a-z0-9-]{1,63}(.ab.ca|.bc.ca|.mb.ca|.nb.ca|.nf.ca|.nl.ca|.ns.ca|.nt.ca|.nu.ca|.on.ca|.pe.ca|.qc.ca|.sk.ca|.yk.ca|.co.cc|.com.cd|.net.cd|.org.cd|.co.ck|.ac.cn|.com.cn|.edu.cn|.gov.cn|.net.cn|.org.cn|.ah.cn|.bj.cn|.cq.cn|.fj.cn|.gd.cn|.gs.cn|.gz.cn|.gx.cn|.ha.cn|.hb.cn|.he.cn|.hi.cn|.hl.cn|.hn.cn|.jl.cn|.js.cn|.jx.cn|.ln.cn|.nm.cn|.nx.cn|.qh.cn|.sc.cn|.sd.cn|.sh.cn|.sn.cn|.sx.cn|.tj.cn|.xj.cn|.xz.cn|.yn.cn|.zj.cn|.us.com|.com.cu|.edu.cu|.org.cu|.net.cu|.gov.cu|.inf.cu|.gov.cx|.com.dz|.org.dz|.net.dz|.gov.dz|.edu.dz|.asso.dz|.pol.dz|.art.dz|.com.ec|.info.ec|.net.ec|.fin.ec|.med.ec|.pro.ec|.org.ec|.edu.ec|.gov.ec|.mil.ec|.com.ee|.org.ee|.fie.ee|.pri.ee|.com.es|.nom.es|.org.es|.gob.es|.edu.es|.aland.fi|.tm.fr|.asso.fr|.nom.fr|.prd.fr|.presse.fr|.com.fr|.gouv.fr|.com.ge|.edu.ge|.gov.ge|.org.ge|.mil.ge|.net.ge|.pvt.ge|.co.gg|.net.gg|.org.gg|.com.gi|.ltd.gi|.gov.gi|.mod.gi|.edu.gi|.org.gi|.com.gp|.net.gp|.edu.gp|.asso.gp|.org.gp|.com.gr|.edu.gr|.net.gr|.org.gr|.gov.gr|.com.hk|.edu.hk|.gov.hk|.idv.hk|.net.hk|.org.hk|.com.hn|.edu.hn|.org.hn|.net.hn|.mil.hn|.gob.hn|.iz.hr|.from.hr|.name.hr|.com.hr|.com.ht|.net.ht|.firm.ht|.shop.ht|.info.ht|.pro.ht|.adult.ht|.org.ht|.art.ht|.pol.ht|.rel.ht|.asso.ht|.perso.ht|.coop.ht|.med.ht|.edu.ht|.gouv.ht|.gov.ie|.co.in|.firm.in|.net.in|.org.in|.gen.in|.ind.in|.nic.in|.ac.in|.edu.in|.res.in|.gov.in|.mil.in|.ac.ir|.co.ir|.gov.ir|.net.ir|.org.ir|.sch.ir|.co.je|.net.je|.org.je|.com.jo|.org.jo|.net.jo|.edu.jo|.gov.jo|.mil.jo|.co.kr|.or.kr|.edu.ky|.gov.ky|.com.ky|.org.ky|.net.ky|.gov.lk|.sch.lk|.net.lk|.int.lk|.com.lk|.org.lk|.edu.lk|.ngo.lk|.soc.lk|.web.lk|.ltd.lk|.assn.lk|.grp.lk|.hotel.lk|.gov.lt|.mil.lt|.gov.lu|.mil.lu|.org.lu|.net.lu|.com.lv|.edu.lv|.gov.lv|.org.lv|.mil.lv|.id.lv|.net.lv|.asn.lv|.conf.lv|.com.ly|.net.ly|.gov.ly|.plc.ly|.edu.ly|.sch.ly|.med.ly|.org.ly|.id.ly|.co.ma|.net.ma|.gov.ma|.org.ma|.tm.mc|.asso.mc|.org.mg|.nom.mg|.gov.mg|.prd.mg|.tm.mg|.com.mg|.edu.mg|.mil.mg|.com.mk|.org.mk|.com.mo|.net.mo|.org.mo|.edu.mo|.gov.mo|.org.mt|.com.mt|.gov.mt|.edu.mt|.net.mt|.com.mu|.co.mu|.gov.nr|.edu.nr|.biz.nr|.info.nr|.com.nr|.net.nr|.com.pf|.org.pf|.edu.pf|.com.ph|.gov.ph|.com.pk|.net.pk|.edu.pk|.org.pk|.fam.pk|.biz.pk|.web.pk|.gov.pk|.gob.pk|.gok.pk|.gon.pk|.gop.pk|.gos.pk|.com.pl|.biz.pl|.net.pl|.art.pl|.edu.pl|.org.pl|.ngo.pl|.gov.pl|.info.pl|.mil.pl|.waw.pl|.warszawa.pl|.wroc.pl|.wroclaw.pl|.krakow.pl|.poznan.pl|.lodz.pl|.gda.pl|.gdansk.pl|.slupsk.pl|.szczecin.pl|.lublin.pl|.bialystok.pl|.olsztyn.pl.torun.pl|.biz.pr|.com.pr|.edu.pr|.gov.pr|.info.pr|.isla.pr|.name.pr|.net.pr|.org.pr|.pro.pr|.edu.ps|.gov.ps|.sec.ps|.plo.ps|.com.ps|.org.ps|.net.ps|.com.pt|.edu.pt|.gov.pt|.int.pt|.net.pt|.nome.pt|.org.pt|.publ.pt|.com.ro|.org.ro|.tm.ro|.nt.ro|.nom.ro|.info.ro|.rec.ro|.arts.ro|.firm.ro|.store.ro|.www.ro|.com.ru|.net.ru|.org.ru|.pp.ru|.msk.ru|.int.ru|.ac.ru|.gov.rw|.net.rw|.edu.rw|.ac.rw|.com.rw|.co.rw|.int.rw|.mil.rw|.gouv.rw|.com.sc|.gov.sc|.net.sc|.org.sc|.edu.sc|.com.sd|.net.sd|.org.sd|.edu.sd|.med.sd|.tv.sd|.gov.sd|.info.sd|.org.se|.pp.se|.tm.se|.brand.se|.parti.se|.press.se|.komforb.se|.kommunalforbund.se|.komvux.se|.lanarb.se|.lanbib.se|.naturbruksgymn.se|.sshn.se|.fhv.se|.fhsk.se|.fh.se|.mil.se|.ab.se|.c.se|.d.se|.e.se|.f.se|.g.se|.h.se|.i.se|.k.se|.m.se|.n.se|.o.se|.s.se|.t.se|.u.se|.w.se|.x.se|.y.se|.z.se|.ac.se|.bd.se|.com.sg|.net.sg|.org.sg|.gov.sg|.edu.sg|.per.sg|.idn.sg|.ac.tj|.biz.tj|.com.tj|.co.tj|.edu.tj|.int.tj|.name.tj|.net.tj|.org.tj|.web.tj|.gov.tj|.go.tj|.mil.tj|.gov.to|.gov.tp|.co.tt|.com.tt|.org.tt|.net.tt|.biz.tt|.info.tt|.pro.tt|.name.tt|.edu.tt|.gov.tt|.gov.tv|.edu.tw|.gov.tw|.mil.tw|.com.tw|.net.tw|.org.tw|.idv.tw|.game.tw|.ebiz.tw|.club.tw|.com.ua|.gov.ua|.net.ua|.edu.ua|.org.ua|.cherkassy.ua|.ck.ua|.chernigov.ua|.cn.ua|.chernovtsy.ua|.cv.ua|.crimea.ua|.dnepropetrovsk.ua|.dp.ua|.donetsk.ua|.dn.ua|.ivano-frankivsk.ua|.if.ua|.kharkov.ua|.kh.ua|.kherson.ua|.ks.ua|.khmelnitskiy.ua|.km.ua|.kiev.ua|.kv.ua|.kirovograd.ua|.kr.ua|.lugansk.ua|.lg.ua|.lutsk.ua|.lviv.ua|.nikolaev.ua|.mk.ua|.odessa.ua|.od.ua|.poltava.ua|.pl.ua|.rovno.ua|.rv.ua|.sebastopol.ua|.sumy.ua|.ternopil.ua|.te.ua|.uzhgorod.ua|.vinnica.ua|.vn.ua|.zaporizhzhe.ua|.zp.ua|.zhitomir.ua|.zt.ua|.co.ug|.ac.ug|.sc.ug|.go.ug|.ne.ug|.or.ug|.ak.us|.al.us|.ar.us|.az.us|.ca.us|.co.us|.ct.us|.dc.us|.de.us|.dni.us|.fed.us|.fl.us|.ga.us|.hi.us|.ia.us|.id.us|.il.us|.in.us|.isa.us|.kids.us|.ks.us|.ky.us|.la.us|.ma.us|.md.us|.me.us|.mi.us|.mn.us|.mo.us|.ms.us|.mt.us|.nc.us|.nd.us|.ne.us|.nh.us|.nj.us|.nm.us|.nsn.us|.nv.us|.ny.us|.oh.us|.ok.us|.or.us|.pa.us|.ri.us|.sc.us|.sd.us|.tn.us|.tx.us|.ut.us|.vt.us|.va.us|.wa.us|.wi.us|.wv.us|.wy.us|.com.vi|.org.vi|.edu.vi|.gov.vi|.com.vn|.net.vn|.org.vn|.edu.vn|.gov.vn|.int.vn|.ac.vn|.biz.vn|.info.vn|.name.vn|.pro.vn|.health.vn|.com|.org|.net|.int|.edu|.gov|.mil|.arpa|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cu|.cv|.cw|.cx|.cy|.cz|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sx|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.za|.zm|.zw|.dz|.am|.bh|.bd|.by|.bg|.cn|.cn|.eg|.eu|.ge|.gr|.hk|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.ir|.iq|.jo|.kz|.mo|.mo|.my|.mr|.mn|.ma|.mk|.om|.pk|.ps|.qa|.ru|.sa|.rs|.sg|.sg|.kr|.lk|.lk|.sd|.sy|.tw|.tw|.th|.tn|.ua|.ae|.ye|.academy|.accountant|.adult|.aero|.africa|.agency|.apartments|.app|.archi|.associates|.audio|.auto|.bar|.bargains|.bible|.bike|.biz|.black|.blackfriday|.blog|.blue|.builders|.cam|.cam|.camera|.camp|.cancerresearch|.car|.cards|.cars|.center|.cheap|.christmas|.church|.click|.clothing|.cloud|.club|.codes|.coffee|.college|.coop|.country|.dance|.date|.dating|.design|.dev|.diet|.directory|.download|.eco|.education|.email|.events|.exchange|.exposed|.faith|.farm|.flowers|.game|.gdn|.gift|.glass|.global|.gop|.green|.guitars|.guru|.help|.hiphop|.hiv|.holdings|.hosting|.house|.info|.ink|.international|.jobs|.kim|.land|.lgbt|.life|.lighting|.link|.live|.loan|.lol|.love|.map|.market|.med|.meet|.menu|.mobi|.moe|.mom|.movie|.museum|.music|.name|.new|.NGO_and_.ONG|.org_(top-level_domain)|.one|.one|.onl|.ooo|.organic|.pharmacy|.photo|.photos|.pics|.pink|.pizza|.plumbing|.porn|.post|.pro|.properties|.property|.realtor|.rich|.rocks|.sale|.science|.sex|.sexy|.shop|.singles|.social|.solar|.stream|.sucks|.support|.tattoo|.tel|.today|.top|.travel|.ventures|.video|.voting|.wedding|.wiki|.win|.work|.wtf|.xxx|.XYZ|.kaufen|.desi|.shiksha|.moda|.futbol|.juegos|.uno|.africa|.asia|.krd|.taipei|.tokyo|.alsace|.amsterdam|.bcn|.barcelona|.berlin|.brussels|.bzh|.cat|.cymru|.eus|.frl|.gal|.gent|.irish|.istanbul|.istanbul|.london|.paris|.saarland|.scot|.swiss|.wales|.wien|.miami|.nyc|.quebec|.vegas|.kiwi|.melbourne|.sydney|.lat|.rio|.ru|.aaa|.abb|.aeg|.afl|.aig|.airtel|.bbc|.bentley|.example|.invalid|.local|.localhost|.onion|.testa)$

Related

Codeigniter search?

This project is something like a social networking site built on codeigniter.
Here my default controller is MyController.php (which loads the login page)
and say my domain is aravind.com
In my application each user will have a unique id.
and what my requirement is, the particular page of the user should get opened when
the unique_id of the user is given immedetily after the domain name.
ie, aravind.com/123 should open the user page whose unique id is 123.
I know this can be acheived by placing a controller and a function in between the
domain and the unique_id, (like : aravind.com/Search_class/search_func/123).
But by doing so the url of the particular user becomes lengthy which is not acceptable.
So I require a logic to sort this issue.
Note:
I have other classes and folders in my controller package, SO the required soln should be
wise enough to diferentiate my classes and folders in the controller package with the UniqueID.
So for differentiating I can say like the classes and folder name will be of continuous String
Where UniqueID starts with an underscore (_).
The same thing is done in facebook, like if we type facebook.com/aravind.pillai.983, it will open my account,
where as if the url goes like facebook.com/settings it will open the settings page for
the user who have an active session.
Set route["404_override"] to one of your custom class (say Search.php).
So wheneven a user tries to enter with an invalid url it will be redirected to Search.php.
Here using $this->uri->segment(1) you will get the value of the first uri segment.
You can write a logic there to identify whether the user entered a valid unique_id or not.
Now you got the uri value in one class (Search.php).
You should use routing where you can set every routes you want. In APPPATH.'config/routes.php' file add this:
$route['(:any)'] = 'Search_class/search_func/$1';
You should read documentation about routing and check other examples.
Also, pay attention that application will always check for Search_class/search_func/$1 in first URI segment so you need to put this rule at the end of routes.php file. Before that rule you need to set other rules like:
$route['about'] = 'pages_c/show/about';
$route['contact'] = 'pages_c/show/contact';
// then after all declared routes, you can set wild card placeholder
$route['(:any)'] = 'Search_class/search_func/$1';

How to crawl/index the links on a single page: Google Search Appliance

Am new to the GSA and also don't have full admin access to the system so have to forward requests through to ICT Services to have changes made to our crawls and collections.
I hope someone can help with this question:
I have a single web page which has a list of links to about 180 documents (most of which are stored in the same subdirectory /docs/ which contains some 2400 documents). The rest are scattered across the site in a number of other subdirectories ie /finance/, /hr/ etc
At the moment all that happens is that I either get the single webpage indexed and none of the 180 links. Or I get the 1 page plus ALL of the 2400 documents in the /docs/ subdirectory.
I want to be able to just crawl/index this page and the 180 links and create a separate collection
Is there a simple way to do this?
Regards
Henry
Another possible solution is to use a robots.txt file to disallow crawling of the other pages you don't want. This would be a lot of work if you have to enumerate all of them though.
Your best bet is to see if there is some common URL pattern you can use to specify only the 180 pages you do want. For example, are the pages you do want all PDFs, and the other files you do not want are all some other type? If you can find something that is common for all the pages you want that isn't true for the other pages, you can use that to formulate a pattern (maybe using regex) to do what you want.
Instead of configuring the URL pattern under start urls and follow pattern,
configure the complete url. Get the 180 urls + 1 single web page url and put all 181 urls under start urls and follow pattern.By configuring complete urls, we could avoid GSA being crawling the other urls in the application as we are not keeping any common url pattern under follow urls.
Create a new collection and place all 180 doc urls + single web page
url (or generic pattern matching 181 urls) in that collection under "Include Content Matching the Following Patterns".
I assume that you do not want to index other 2400 documents on GSA.
Hope it helps.
Regards,
Mohan.
You would be better off using a meta and url feed for this.
It will allow you to control whether the GSA follows links in your 180 pages if you fed them in or whether you index your list page if you just feed that. You do this by specifying noindex or nofollow.
You'll still need to have your follow and crawl patterns and collections set up correctly but it's the easiest way to control what gets indexed.
You don't necessarily need to write code for this either, you can use curl and hand craft the xml.
The documentation is pretty good and easy to follow. Feeds Protocol Developers Guide

Check if two urls are for the same website

I'm looking for a way to compare two urls. I can do:
URI('http://www.test.com/blabla').host
to have the base name, but this not reliable. For example:
URI('http://www.test.com/blabla').host == URI('http://test.com/blabla').host
returns false, but they can be the same site. To have the IP address is not reliable too because if I do:
IPSocket.getaddress(URI('http://hello.herokuapp.com').host) ==
IPSocket.getaddress(URI('http://test.herokuapp.com').host)
It returns true, but they are not the same site. Is there a more reliable way?
The site under http://foo.com can be the same as under http://www.foo.com, but it can be a totally different site, due to web server configuration. It depends on the DNS config too, which IP points to www and which one to without www.
If you want compare two sites, you need to fetch the content, and compare key parts (using nokogiri for example) about similarities.
Nowadays due to sidebars and news, two consequent request to the same url, gives slight different html responses.

How to differentiate from the server side, between the first request of the browser (HTML file) and the following (images, CSS, scripts...)?

I'm programming a website with SEO friendly links, ie, put the page title or other descriptive text in the link, separated by slashes. For example: h*tp://www.domain.com/section/page-title-bla-bla-bla/.
I redirect the request to the main script with mod_rewrite, but links in script, img and link tags are not resolved correctly. For example: assuming you are visiting the above link, the tag request the file at the URL h*tp://www.domain.com/section/page-title-bla-bla-bla/js/file.js, but the file is actually http://www.domain.com/js/file.js
I do not want to use a variable or constant in all HTML file URLs.
I'm trying to redirect client requests to a directory or to another of the server. It is possible to distinguish the first request for a page, which comes after? It is possible to do with mod_rewrite for Apache, or PHP?
I hope I explained well:)
Thanks in advance.
Using rewrite rules to fix the problem of relative paths is unwise and has numberous downsides.
Firstly, it makes things more difficult to maintain because there are hundreds of different links in your system.
Secondly and more seriously, you destroy cacheability. A resource requested from here:
http://www.domain.com/section/page-title-bla-bla-bla/js/file.js
will be regarded as a different resource from
http://www.domain.com/section/some-other-page-title/js/file.js
and loaded two times, causing the number of requests to grow dozenfold.
What to do?
Fix the root cause of the problem instead: Use absolute paths
<script src="/js/file.js">
or a constant, or if all else fails the <base> tag.
This is an issue of resolving relative URIs. Judging by your description, it seems that you reference the other resources using relative URI paths: In /section/page-title-bla-bla-bla a URI reference like js/file.js or ./js/file.js would be resolved to /section/page-title-bla-bla-bla/js/file.js.
To always reference /js/file.js independet from the actual base URI path, use the absolute path /js/file.js. Another solution would be to set the base URI explicitly to / using the BASE element (but note that this will affect all relative URIs).

Detecting URL rewrites (SEO urls)

How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate

Resources