Hosted full-text search options - IndexTank vs Solr vs Lucene - full-text-search

I am building an app using Ruby on Rails on Heroku and am confused about which full-text search option I should proceed with. A few things I care about:
Real-Time search: I am building a dynamic user-generated website.
Understands Rails Models: I would like to restrict search results based on who the user is (so, I don't really want "just" a site-wide search)
Additionally, something that is easy to configure on Heroku with Rails would be a bonus.
Heroku currently provides three options for full-text search: FlyingSphinx, Searchify IndexTank and WebSolr. Can anyone outline the pro's and cons of each.
Based on my research, it seems that a lot of people have been happy with IndexTank. In particular, this blog post by Gautam Rege briefly outlines his experience with the three options and how he prefers IndexTank.
However, after LinkedIn's acquisition of IndexTank, some key components of IndexTank were open-sourced and the IndexTank service discontinued. It seems that Searchify is one of the first few (if not, currently, the only) replacement for IndexTank. Does anyone have any experience using this? How does Searchify compare to IndexTank and the other two options - WebSolr and FlyingSphinx?

I'll address your question with regards to Searchify/IndexTank:
Searchify has true real-time indexing. The millisecond you add a document, it becomes searchable. No need to commit or reindex.
There is a Ruby client library for Searchify, here are the docs & download links: http://www.searchify.com/documentation/ruby-client
There is also a nice 3rd party client by kidpollo called Tanker that some Ruby folks prefer: https://github.com/kidpollo/tanker

Related

Spider for technology profiling - identify Content Management System

I'm looking for a web spider that can crawl links (start from a specific url and follow links to other domains) and identify sites that have a directory called "abc" that has a page title that includes "123".
This may sound shady so let me explain, it's a tool to identify websites that use a certain CMS so I can build up a prospect list for CMS support services.
The alternate approach is a spider that can identify occurrences of certain strings in html that are familiar to this CMS.
Such services are provided by builtwith.com and wappalyzer.com though these commercial solutions are massively pricey and I'd like to first explore open source solutions.
Consider using a search engine.
Many search engines allow queries such as intitle:123 inurl:abc.
But beware, they tend to block requests that are known to aim at security issues. Like the Santy and myDoom worms, which relied on Google to find vulnerable phpBB installations.
Craling all of the interwebs yourself will take a lot of time, you know...
If you don't need the latest data, and have some bucks to spare, you could also process the commonsCrawl on AWS.

What is a good choice for Fulltext indexing when developing a OSX application?

Hy,
I'm implementing an IMAP client as a Mac OSX application using MacRuby.
For the sake of offline availability, I wanted to allow fulltext indexing and attribute based indexing of all messages. Attributes include common E-Mail stuff like from:, to:, etc...
This would allow for advanced results sprinkled with faceting, analytic calculations and such.
Now I'm unsure about the choices and good practices when it comes to integrating such a search feature. I have a strong web development background, therefore my intuitive action would be to setup a Solr server and start feeding it with data. This might just work theoretically, as I could write an Agent that manages the solr instance for my application in the background. But to me, this approach seems like an infrastructure hassle.
On the other side, I've read about people using the FTS3 functionality from SQLite. This approach is easily accessible by CoreData. I haven't used SQLite's FTS3 but I don't think it is as powerful as Solr can be.
What is your weapon of choice for a use case like mine?
I'm mainly interested in solutions that are actually in use by Objective-C/Cocoa/MacRuby developers.
In you're going to develop the app with Ruby give a try to picky. It is very simple to use.
There is an Objective-C Lucene port
http://svn.gna.org/viewcvs/etoile/trunk/Etoile/Frameworks/LuceneKit/
I have not used it, but in your situation, I'd at least check it out. In my experience, SQL based full text search can't compete with Lucene, but haven't tried SQLite for this.
EDIT: just noticed the ruby tag -- this started out as port of Lucene
https://github.com/dbalmain/ferret

C++ Web-framework with cookie and SQL support

Good Evening,
I'm building a website which will will look something like this:
So probably a widget-centred web-framework would be best...
Which C++ web-framework supports cookies (for user-login [session] storage+config storage) and SQL (MySQL or SQLite)?
My information about Wt was outdated, it looks like they now have full-support for cookies (http://redmine.webtoolkit.eu/boards/1/topics/2111)
CppCMS however has a vibrant community, and there product seems to scale better.
I will do the diplomatic thing, create a project using both frameworks.
It will be a cutdown version with only:
User registration
User login/logout (including redirect if deauthed and trying to access auth-req page)
Search
Some basic argument passing of results across screen (see initial wireframe for ref)
Should be an interesting project... I wonder if any have done this in the past?!
Cookie support and SQL backend support are basic things, so I guess all web frameworks support them.
I am a very happy user of cppcms and I can assure you it can do all the things you ask for.
Cppcms' SQL backend uses cppdb, created by the same developer, which supports MySQL, SQLite, Postgres, and others, in addition to supporting connection pooling and other nice features.
Config storage is easy, using a json format. The cppcms also has a nice caching framework, as well as a nice API to create forms, asynchronous requests, long pulling, etc. The templating engine is both simple and very powerful, allowing for a separation of the interface and the application logic.
What's more, probably the thing I like the most about cppcms, the support is very good. Subscribe to the official cppcms mailing list: Artyom, the cppcms creator, is always very patiently replying to requests. Personally, I wished the people asking for support were contributing more to the cppcms wiki, so remember that when you join us.
[Edit: Actually, I noticed you asked the very same question on the cppcms mailing list in April this year, posting the very same screenshot. Myself and other kindly took the time to answer you, but you never replied nor thanked us for our time and advice. You may continue asking the same question all over the place, but please try to be more appreciative of the people who are giving a bit of their time to answer you. Good social manners never hurt.]
I recognise these Wt (http://webtoolkit.eu/wt) widgets you can use for your app:
charts: WCartesianChart
dropdown boxes: WComboBox
models and filter proxy models: WSortFilterProxyModel, WAbstractItemModel
the lists (views): WTableView
layout managers with draggable splitters: WHBoxLayout
tabs: WTabWidget
panel on the right: WPanel
suggestion popup on the left: WSuggestionPopup
WLineEdit
database access: Wt::Dbo (an ORM), or anything else you prefer
cookies are well supported in Wt, see cookie related methods in WEnvironment and WApplication
Simply combine them for your application...
BinaryTiers provides a complete web development environment, including all the tools that make common web development tasks easier out of the box. Some of the fundamental tools and features built-in BinaryTiers include:
Forms validation system architecture
Abstract Publishing Architecture with built-in categorization and content translation
User account registration and maintenance
Menu management and friendly URLs
RSS-feed aggregation and syndication
System administration and web interface for the GNU C++ compilers
Coherent programming interface for No-SQL data stores as well as relational databases with Redis and MySQL
Template System and easy page layout customization
Modular design that provides high extensibility
High Speed HTTP Communication (Get, Post, Cookies, Files)
Built-in Encoding and Encryption functions
Multiplatform: Linux, BSD, OSX and Windows
Have look at ffead-cpp, it probably does what you need and provides a lot more...

Design Question for Notification System

The original post was posted at https://stackoverflow.com/questions/6007097/design-question-for-notification-system
Here is more clarification of the problem: The notification system purpose is to get user notified (via email for now) when content of the site has changed or updated, or new posting is made. This could be treated as a notification system where people define a rule or keyword for 3rd party site and notification system goes out crawle 3rd party site and crate search inverted indexes. Then a new link or document show up for user defined keyword or rule (more explanation at bottom regarding use case),
For clarified used case: Let suppose I am craigslist user and looking for used vehicle. I define a rule “Honda accord”, “year “ 1996 and price range from “$2000 to $3000”.
For above use case to work what is best approach and how can I leverage on open source technology such as Apache Lucent, Apache Solr and Apache Nutch, and Apache Hadoop to solve this use case.
You can thing of building search engine and with rule and keyword notification system. I just need some pointers and help on how to integrate these open source package to solve use case ?
Any help and pointer will be appreciated. We need three important components are :
1) Web Crawler
2) Index Creator
3) Rule or keyword Mather
Any help will be greatly appreciated. I was referring this wiki which integrates Nutch and Solr together for above purpose http://wiki.apache.org/nutch/RunningNutchAndSolr
Your question is a big one but I'll take a stab at it as I've designed and implemented systems like this before.
Ignoring user account management, your system will need to provide the means to:
retrieve new prospect data (web spider)
identify and extract pertinent results from prospect data (filtering)
collect, maintain and organize results (storage)
select results based on various metadata (querying)
format results for delivery to users (templating)
deliver formatted results to users (delivery)
If the scope of your project is small (say less than 100 sites requiring spidering per day), you could probably get along with one of the many open-source web spiders including wget, Nutch, WebSphinx, etc. You might need to provide instrumentation (custom software) for scheduling, monitoring and control. If your project scope is larger than this, you may need to "roll your own" spidering solution (custom software). Typically this would be designed as a distributed, parallel architecture.
For simple filtering, regular expressions would suffice but for more complex tasks requiring knowledge of HTML layout (extract the textual component of the fifth list element (<LI/>) of the fourth table on the page) you'd need to use an XHTML parser. However you proceed, you'll need to provide custom software to conduct filtering based on your users' needs.
While any database technology can be used to store results extracted from retrieved documents, using an engine optimized for text like Apache SOLR will allow you to easily expand your search criteria as your needs dictate. Since SOLR supports the attachment of and search for metadata associated with each document, it would be a good choice. You'll also need to provide custom software here to automate this step.
Once you've selected a list of candidate results from SOLR, any scripting language could be used to template them into one or more emails and would also inject them into your mail transport agent (MTA). This also requires custom software to automate this process (and if required, to inject user-specific data into each message).
You should probably look at Google's Custom Search API also before diving into crawling the web yourself. This way, google can help you with returning keyword based search results, which you could later filter in your application based on your additional algorithms/rules etc, and make the whole thing work.

Google Visualization API

I want a real and honest opinion what do you think of Google Visualization API?
Is it reliable to use becasue when i was reading the documentation i noticed that there are alot of issues and defects to overcome and can i use it to retrieve data from mysql database.
Thank you.
I am currently evaluating it. As compared to other javascript data visualization frameworks, i think it has a lot going for it:
dynamic loading is built-in
diverse, many things to choose from.
looks really great!
framework mostly takes care of picking whatever implementation fits the current browser
service based, you don't need to download anything in advance
unified data source: just create one data table, and have multiple visalizations draw from that data.
As a disadvantage, I'd like to mention security. I mean, because it's all service based, it is not so transparent what happens when you pass data into these API calls. And as far as I know, the API is free, but not open source, so I can't really check what is going on behind the covers.
I think the Google visualization API really shines if you want to very quickly whip up a visualization gadget for use in a blog or so, and you are not interested in deploying all kinds of plugins and libraries (for eaxmple, with jQuery based frameworks, you need may need to manage multitple javascript libraries that work together to deliver the goods). If on the other hand you are creating an application that you want to sell, you might want to keep more control over what components you are using, and I would probably consider using something like Flot
But like I said, I am only evaluation atm, I am not using this in production.
Works really great for me. Can be customized fairly easily. Haven't seen any scaling issues. No data is exposed so security should not be an issue. - Arunabh Das
One point I want to add here is that, Google Visualization API cannot be downloaded, its not available for offline usage. So application which is going to use it must be always connected to internet, otherwise I think it wont be able to render charts. Due
to this limitation, this API cannot be used in some applications for which internet connection is not available.
I am currently working on a web based application that will have the Google Visualization API added to it and from the perspective of a developer the Google Visualization API is very limited in what you can do with each individual Chart and if I had a choice I would probably look at dojox charting just because of the extra flexibility that the framework gives you.
If you are doing any kind of large web application that will use charting extensively then I would not recommend the Google Visualizations API it does not have enough flexibility for a large web application.
I am using Google Visualization API and I want to stress that they still won't let you download it, which means if their servers are down, your app will be down if you depend on it. I have been using it for about 4 months, and they have crashed once me once so I'd say they pretty reliable and their documentation is really nice.

Resources