I need to change the user-agent string for each crawled domain. I use standard Nutch crawl utility code, it crawls one domain per time. It's being started in multithreading mode to crawl many domains. I need to pass to domain string [botname]+domainID to, but I'm unsure how to implement it?
Since the user agent is manifested in the config file (nutch-site.xml) there is no possibility to change that for a certain domain.
I suggest that you create an instance of nutch for each domain you want to crawl. Within each instance you set the url-filter, seed url and user agent matching the domain you want to crawl.
This should allow you to execute each crawl with custom settings.
cheers mana
Related
I would like to create simply search engine like Google or Bing. What does Google or Bing take a data from? What can I take data from?
I am going to create it in Spring framework.
I am not sure about framework
but you need few things
All active domains list (check this https://whoisdatacenter.com/
you need a static IP and tell websites about use agent name --user-agent mybot
you must be very good with curl / sed/ awk/ grep
example I am using my bot and when I do nmap / nslookup / fetching websites data in index in my database, I simply tell all websites that who am i..
dont use user-agent like Google/Yahoo , you will be blocked by many servers
PS: I am also a new guy... but somehow I manage to do my work by above steps.
there are 184-210 million active domains. you need a powerful server. I made on bash
I'm trying to use HAProxy as a dynamic proxy for backend hosts based on partial /path regex match. The use case is routing from an HTTPS frontend to a large number of nodes that come and go frequently, without maintaining an explicit mapping of /path to server hostnames.
Specifically in this case the nodes are members of an Amazon EMR cluster, and I'd like to reverse-proxy/rewrite HTTP requests like:
<haproxy>/emr/ip-99-88-77-66:4040 -> 99.88.77.66:4040
<haproxy>/emr/ip-55-44-33-22/ganglia -> 55.44.33.22/ganglia
<haproxy>/emr/ip-11-11-11-11:8088/cluster/nodes -> 11.11.11.11:8088/cluster/nodes
...etc
dynamically.
As-in, parse the path beginning at /emr and proxy requests to an IP captured by the regex:
emr\/ip-(\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})(.*)
Is this possible with HAProxy? I know it's probably not the right tool for the job, but if possible (even non-performant) I'd like to use the tooling we already have in place.
tl;dr basically nginx proxy_pass, but with HAProxy and plucking a backend IP from the url.
Thanks!
Yes its possible by using url filters in haproxy, see below link for more details.
https://fossies.org/linux/haproxy/doc/internals/filters.txt
Yes this can be done. I would recommend you use ACLs, as well as Roundrobin & checks, which will allow you to check to see if that instance is up before routing to it with a check. That way, the system will only route to service instances that are up and running, and will only have them preloaded for use if they are up.
In addition, this will also allow you to constantly cycle in and out instances, such as if your AWS instance costs change with any other providers you may have, and allows you to load balance with maximum cost savings in mind.
yes, this is possible.. check the official manual:
Using ACLs and fetching samples
I'm currently trying to integrate an SSO with Active Directory. The SSO Service has told me that my server is responding with LDAP "referrals".
Is there a way to disable these referrals? There is only one server/domain, and the server is the domain controller, so I don't know why I would even be getting these in the first place. Any help is appreciated. Thanks!
Turns out it was that the "base DN" in the search wasn't specific enough. Apparently you'll get a referral if you don't pinpoint into the exact OU or CN that the user resides. Since I only really have one active OU I just hard-pointed it to there and everything seems to be working now.
Instead of port 389, use the Microsoft-specific port 3268.
From MSDN:
Avoid unnecessary SearchResultReference referral chasing
With referral chasing enabled, your code could go from domain to domain in the Active Directory tree trying to satisfy the request if the query cannot be satisfied by the initial domain. This method can be extremely time-consuming. When performing a query for objects and the domain for the objects is unknown, use the global catalog as a base for the search instead of using referral chasing.
then:
Connecting to the Global Catalog
There are several ways to connect to a global catalog. If you are using LDAP, then use port 3268 in the ldap_open or ldap_init calls.
You may think everything is satsified by the initial (only!) domain, but...this is a bureaucracy, and list of 1 thing is still a list.
When you create a Security Group, you can make it Global or Domain Local. If the user belongs to a Global Group, like my case, AD automatically assumes there may be more information to be found in the Global Catalog, so a query to port 389 will generate 3 references. There's probably other reasons references are triggered.
I had to solve this issue because I had many OUs directly below the top level, all of which I wanted to query in one authentication pass.
In particular the mod_ldap.c of ProFTPd was distracted by these referrals. It followed them in separate LDAP transactions without binding with the same credentials as the initial query. Although they added nothing, the ldap library must have returned an opaque error.
I am trying to develope an application with tomcat running in several computers of same LAN trying representing several nodes and each of them runs an application with a single shared session(Ex. shared document editor such as google docs.). in my understanding so far I need a single shared session and several users need to update the doc symultaneously and each others updates are reflected on each others we interfaces almost imidietly. Can I acheve this with with tomcat's clustering feature. http://tomcat.apache.org/tomcat-7.0-doc/cluster-howto.html#Configuration_Example or is this just a faluir recovery system.
Tomcat's clustering feature is meant for failover - if one node fails, user can carry on working while being transparently sent to another node without a need to log in again.
What you are trying to achieve is a totally different scenario and I think using session for this is just wrong. If you go back to Google Doc example, how would you achieve granting (revoking?) document access to another user? What do you do when session times out - create the document again? Also, how would you define which users would be able to access selected documents?
You would need to persist this data somewhere (DB?) anyway so implement or reuse some existing ACL system where you could share information about users and document permissions.
Most solutions I've read here for supporting subdomain-per-user at the DNS level are to point everything to one IP using *.domain.com.
It is an easy and simple solution, but what if I want to point first 1000 registered users to serverA, and next 1000 registered users to serverB? This is the preferred solution for us to keep our cost down in software and hardware for clustering.
alt text http://learn.iis.net/file.axd?i=1101
(diagram quoted from MS IIS site)
The most logical solution seems to have 1 x A-record per subdomain in Zone Datafiles. BIND doesn't seem to have any size limit on the Zone Datafiles, only restricted to memory available.
However, my team is worried about the latency of getting the new subdoamin up and ready, since creating a new subdomain consist of inserting a new A-record & restarting DNS server.
Is performance of restarting DNS server something we should worry about?
Thank you in advance.
UPDATE:
Seems like most of you suggest me to use a reverse proxy setup instead:
alt text http://learn.iis.net/file.axd?i=1102
(ARR is IIS7's reverse proxy solution)
However, here are the CONS I can see:
single point of failure
cannot strategically setup servers in different locations based on IP geolocation.
Use the wildcard DNS entry, then use load balancing to distribute the load between servers, regardless of what client they are.
While you're at it, skip the URL rewriting step and have your application determine which account it is based on the URL as entered (you can just as easily determine what X is in X.domain.com as in domain.com?user=X).
EDIT:
Based on your additional info, you may want to develop a "broker" that stores which clients are to access which servers. Make that public facing then draw from the resources associated with the client stored with the broker. Your front-end can be load balanced, then you can grab from the file/db servers based on who they are.
The front-end proxy with a wild-card DNS entry really is the way to go with this. It's how big sites like LiveJournal work.
Note that this is not just a TCP layer load-balancer - there are plenty of solutions that'll examine the host part of the URL to figure out which back-end server to forward the query too. You can easily do it with Apache running on a low-spec server with suitable configuration.
The proxy ensures that each user's session always goes to the right back-end server and most any session handling methods will just keep on working.
Also the proxy needn't be a single point of failure. It's perfectly possible and pretty easy to run two or more front-end proxies in a redundant configuration (to avoid failure) or even to have them share the load (to avoid stress).
I'd also second John Sheehan's suggestion that the application just look at the left-hand part of the URL to determine which user's content to display.
If using Apache for the back-end, see this post too for info about how to configure it.
If you use tinydns, you don't need to restart the nameserver if you modify its database and it should not be a bottleneck because it is generally very fast. I don't know whether it performs well with 10000+ entries though (it would surprise me if not).
http://cr.yp.to/djbdns.html