Spider for technology profiling - identify Content Management System - full-text-search

I'm looking for a web spider that can crawl links (start from a specific url and follow links to other domains) and identify sites that have a directory called "abc" that has a page title that includes "123".
This may sound shady so let me explain, it's a tool to identify websites that use a certain CMS so I can build up a prospect list for CMS support services.
The alternate approach is a spider that can identify occurrences of certain strings in html that are familiar to this CMS.
Such services are provided by builtwith.com and wappalyzer.com though these commercial solutions are massively pricey and I'd like to first explore open source solutions.

Consider using a search engine.
Many search engines allow queries such as intitle:123 inurl:abc.
But beware, they tend to block requests that are known to aim at security issues. Like the Santy and myDoom worms, which relied on Google to find vulnerable phpBB installations.
Craling all of the interwebs yourself will take a lot of time, you know...
If you don't need the latest data, and have some bucks to spare, you could also process the commonsCrawl on AWS.

Related

Automation layer above a site

I'm looking into creating a website that sits on top of another site. I wish for this site to be a sort of driver/auto-mater of the original site. The original site is slow and you need to input the same data repetitively (and lots of it - which is infuriating)
What would be the best way of doing this.
I have started using watir-webdriver in ruby, and it seems to work well! Would I be able to host this? I know it launches an explorer (fire-fox in my case) and my worry is not being able to host the application?
I don't want to place all my eggs into this one basket and find out later there's a stumbling block to getting it done!
The short answer
I think there are better tools for web scraping than web testing tools (watir and others), and your end result might require a lot more work than you imagine.
The long answer
This sounds like a case of the façade pattern in which your application would act as the new frontend and the old/existing site as the backend for the improved experience of the service.
Some things to think about before jumping into programming:
If the old site requires users to register, would your users be willing to re-register to your site so that you could log them in into the old site programmatically?
How frequently is the same data required to be inputted and how would you prevent it?
The existing site may have expectations on the request headers which might cause you extra headache and require quite some work to circumvent.
Are you allowed to use the existing site's user interface material or do you need to start from scratch?
How often is the existing site changed and how would it affect your application?
In summary, there are lots of factors and issues to take into account depending on how the existing site is implemented and who are your visioned users. Suggesting a best way to do it would require a lot more knowledge of both the existing site and how you'd want to improve it.
I haven't used watir-webdriver myself but if it is like Selenium and starts a new browser instance any time you run it, then hosting it would most likely not work as you'd expect. There are better tools for what you are thinking of doing, i.e. web scraping, and you may want to take a look at the following, for example:
https://www.ruby-toolbox.com/categories/Web_Content_Scrapers
https://www.ruby-toolbox.com/categories/http_clients

Sitecore with DMS vs caching server - how do you handle it?

We're planning to introduce DMS to our customer's Sitecore installation. It's a rather popular site in our country and we have to use proxy caching server (it's Nginx in this case) to make it high-traffic-proof.
However, as far as we know, it's not possible to use all the DMS features with caching proxy enabled - for example personalization of content - if it gets cached it won't be personalized.
Is there a way to make use of all the DMS features with proxy cache turned on? If not, how do you handle this problem for high-traffic sites - is it buying more Content Delivery servers to carry the load, or extending current server with better hardware (RAM, CPU, bandwidth)?
You might try moving away from your proxy caching for some pages, or even all.
There's no reason not to use a CDN for static assets and media library assets, so stick with that
Leverage Sitecore's built-in html cache for sublayouts/renderings - there are quite a few options for caching
Use Sitecore's Debug feature to track down the slowest components on your site
Consider using indexes instead of doing "fast" or Sitecore queries
Don't do descendants query "//*" (I often see this when calculating selected state for navigation - hint: go the other way, calculate the ancestors of the current page)
#jammykam wrote an excellent answer on this over here.
John West wrote a great blog post on this also, though a bit older.
Good luck!
I've been wondering about this myself.
I have been thinking of implementing an ajax web service that:
- talks to the DMS and returns JSON
- allows you to render the personalised components client side
- allows you to trigger anlaytics events
I have been googling around and I haven't found anyone that has done it and published the information yet. The only place I have found something similar is actually in the mobile sdk, but I haven't had a chance to delve into it yet.
I have also not been able to use proxy server caching and DMS together successfully. For extremely high loads, I have recommended to clients to follow the standard optimization and scaling guidelines, especially architecting for proper Sitecore sublayout and layout caching for as much of the site as possible. With that caching done, follow it up by distributing across multiple Content Delivery nodes with load balancing to help support high volume with personalization at the same time.
I've heard that other CMS's with personalization use a javascript approach to load the personalized content on the client-side, but I would be worried about losing track of the analytics data that is gathered when personalized content is loaded and interacted with.

Hosted full-text search options - IndexTank vs Solr vs Lucene

I am building an app using Ruby on Rails on Heroku and am confused about which full-text search option I should proceed with. A few things I care about:
Real-Time search: I am building a dynamic user-generated website.
Understands Rails Models: I would like to restrict search results based on who the user is (so, I don't really want "just" a site-wide search)
Additionally, something that is easy to configure on Heroku with Rails would be a bonus.
Heroku currently provides three options for full-text search: FlyingSphinx, Searchify IndexTank and WebSolr. Can anyone outline the pro's and cons of each.
Based on my research, it seems that a lot of people have been happy with IndexTank. In particular, this blog post by Gautam Rege briefly outlines his experience with the three options and how he prefers IndexTank.
However, after LinkedIn's acquisition of IndexTank, some key components of IndexTank were open-sourced and the IndexTank service discontinued. It seems that Searchify is one of the first few (if not, currently, the only) replacement for IndexTank. Does anyone have any experience using this? How does Searchify compare to IndexTank and the other two options - WebSolr and FlyingSphinx?
I'll address your question with regards to Searchify/IndexTank:
Searchify has true real-time indexing. The millisecond you add a document, it becomes searchable. No need to commit or reindex.
There is a Ruby client library for Searchify, here are the docs & download links: http://www.searchify.com/documentation/ruby-client
There is also a nice 3rd party client by kidpollo called Tanker that some Ruby folks prefer: https://github.com/kidpollo/tanker

Design Question for Notification System

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr
Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).
You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

Best traffic / performance / usage monitoring module?

Are there any open source (or I guess commercial) packages that you can plug into your site for monitoring purposes? I'd like something that we can hook up to our ASP.NET site and use to provide reporting on things like:
performance over time
current load
page traffic
SQL performance
PU time monitoring
Ideally in c# :)
With some sexy graphs.
Edit: I'd also be happy with a package that I can feed statistics and views of data to, and it would analyse trends, spot abnormal behaviour (e.g. "no one has logged in for the last hour. is this Ok?", "high traffic levels detected", "low number of API calls detected") and generally be very useful indeed. Does such a thing exist?
At my last office we had a big screen which showed us loads and loads of performance counters over a couple of time ranges, and we could spot weird stuff happening, but the data was not stored and there was no way to report on it. Its a package for doing this that I'm after.
It should be noted that google analytics is not an accurate representation of web site usage. This is because the web beacon (web bug) used on the page does not always load for these reasons:
Google analytics servers are called by millions of pages every second and can not always process the requests in a timely fashion.
Users often browse away from a page before the full page has loaded and thus there is not enough time to load Googles web beacon to record a hit.
Google analytics require javascript to be installed which can be disabled.
Quite a few (but not substantial amount) of people block google-analytics.com from their browsers, myself included.
The physical log files are the best 'real' representation of site usage as they record every request. Alternatively there are far better 'professional' packages, of which Omniture is my favourite, which have much better response times, alternative methods for recording actions and more functionality.
If you're after things like server data, would RRDTool be something you're after?
It's not really a webserver type stats program though, I have no idea how it would scale.
Edit:
I've also just found Splunk Swarm, if you're interested in something that looks "cool".
Google Analytics is free (up to 50,000 hits per month I think) and is easy to setup with just a little javascript snippet to insert into your header or footer and has great detailed reports, with some very nice graphs.
Google Analytics is quick to set up and provides more sexy graphs than you can shake a stick at.
http://www.google.com/analytics/
Not Invented here but it's on my todo list to setup.
http://awstats.sourceforge.net/
#Ian
Looks like they've raised the limit. Not very surprising, it is google after all ;)
This free version is limited to 5 million pageviews a month - however, users with an active Google AdWords account are given unlimited pageview tracking.
http://www.google.com/support/googleanalytics/bin/answer.py?hl=en&answer=55543
http://www.serverdensity.com/
One option is to use external monitoring tools, which will monitor the web performance from outside the firewall by simulating end user activities.
Catchpoint Systems has an interesting approach that requires very little coding and gives you the performance stats from outside the datacenter and from inside the asp.net (like processing time, etc)
http://www.catchpoint.com/products.html

Resources