indexing and searching textfiles in ruby and sinatra - ruby

I'm making a wiki with Ruby and Sinatra and need to search the wiki's that are stored as text files with markup in a few HTML-renderers (redcarpet, markdown, creole, slim, haml..).
You have a lot of options in Ruby like ferret, solr or lucene gems that handle structured data in a database but not for searching files with free text.
For now I open the text files and search with a regExp but as the wiki's grow that will soon be too slow. Are there any gems that index all the text files in a map and which index I can then use to search the files? It needs to be a Ruby only solution or something that can be easily used from Ruby.
I'm not using one of the common wiki's since none has the features I need.
I do use windows indexing service in a few old ASP apps but I'm far from satisfied with that solution.
My OS'es are Windows Vista, 7 and Windows Server 8.
EDIT: a no database installation needed/no keep server running solution is preferable, so eg with sqlite or file besed storage or something like that

Personally I would choose ElasticSearch: http://www.elasticsearch.org/
It's very easy to get running, and there are some gems for which make it really easy to communicate with it from Ruby ( for instance tire)
I'm not aware of any performant text-file-based full-search engines, so I'd really think that you would be best of by looking for a simple server, which ElasticSearch provides imho.

Take a look at ThinkingSphinx, a ruby bridge between Sphinx and ActiveRecord. Using this gem will allow you to index your models in an easy way, and to perform simple searchs and full text searches on all the models of your application.
Home page and documentation of the project:
http://freelancing-god.github.com/ts/en/
Nice little intro of how to use it:
http://www.synbioz.com/blog/2012/05/18/full_text_search_with_sphinx
I believe that the integration with Sinatra would be seamless if work with ActiveRecord.

I have been using Solr with both SQL and Mongo databases in Rails 2.3 - 3.2 and it has been doing great for me so far. Take a look at this railscast. Solr is a full text search java client develpoed by Apache that can index microsoft documents, text file, rich text documents
and even do OCR on images.

Related

MS Access full text and file search

I am trying to integrate the Windows desktop file search feature into MSAccess to search files based on content .
For eg:
I want to search for all the files containing "Noble" in its content( preferably it also searches PDF content ) in a specific fodler(s) form MS Access.
Can anyone suggest good place to start?
I've been down this road. Windows Search or Google Search are quite problematic, particularly if you want to search data on a server, because you have to maintain indexes on each client workstation. There's a server version for Windows Search but the API is very complicated.
Office versions from 97 to 2003 provided a FileSearch object that was quite versatile, but that was removed in Office 2007.
Because of that, I coded up a FileSearch class module for use in Access to replace the core functionality provided by the old FileSearch object. You can find the code on my website. It still needs a lot of work, but I've had in production use since June 2009. It does have some issues on Vista/Win7 if you try to search folders that aren't available to non-admin users, and some other problems, too. I've wanted to get back to it and change the progress bar to use WithEvents, but as I've already got a working implementation for the two applications where I'm using it, it wasn't really worth my time.
Try it and see if you have any problems. For searching files for strings in those files, it works pretty well (much faster than the built-in WinXP search functionality!), but it's not going to be as fast as Vista/Win7's search, since it's not index-based.
At work I use Google Desktop because we're still on Windows XP and I don't know if that is the reason, but I'm not impressed with Windows Search.
I don't even think you can go into Access itself and do a search to look everywhere (data, objects, code, etc.).

Anything better than ruby alchemy for extracting keywords?

I've currently written an algorithm in Ruby based on the arc90 readability code to extract an article from a web page.
Now that I have the article, I want to extract keywords and specific information from it (names, author, etc)
I heard Alchemy was a great ruby gem for doing this though it consumes a lot of resources. Are there any better gems I can use for this?
fast, leightweight and easy-to-use gem for extracting keywords from longer content:
https://rubygems.org/gems/highscore
i use it in production, works like a charm.
The question is a bit older, but i'll leave this here for others who will come from google to see this question.
There is an OpenCalais gem which provides similar capability. In addition to entity extraction it can also detect events and relations between entities. It's not lightweight, though I couldn't tell if it's better or worse than Alchemy as I haven't used the Alchemy gem. Hope this helps.

Which full-text search package should I use for SQLite3?

SQLite3 appears to come with three different full-text search engines, called FTS1, FTS2, and FTS3. The documentation available on the website mentions that FTS1 is stable, FTS2 is in development, and that you should use FTS2. Examples I find online use FTS3, which is in CVS, and not documented versus FTS2. None of the full-text search engines come with the amalgamated source, as near as I can tell.
So, my question: which of these three engines, if any, should I use for full-text indexing in SQLite? Or should I simply use a third-party tool like Sphinx, or a custom solution in Lucene, instead?
As of 3.6.21, FTS3 is well documented, and gained a more officially visible status.
FTS3 is part of the standard sqlite DLL build on Windows, not sure about the amalgamated source.
We've been using it on production for about a year with no particular issues.
I've looked into full-text solutions recently too. It seems like SQLite has no de facto choice right now. No matter what you choose, it's inevitable that you'll have to re-architect it as the various FT2, FT3, etc. solutions mature. So bite the bullet and assume you'll need to do more development in the future to keep pace with changing full-text technology.
Sphinx Search has no direct support for SQLite yet. It supports only MySQL and PostgreSQL right now (ca. August 2009). So you'd have to hack your own SQLite connector or else migrate SQLite data to MySQL or PostgreSQL and then index the data with Sphinx Search. I think someone is working on a Sphinx Search patch to support Firebird, so maybe it's not so hard if you're willing to roll up your sleeves.
Also be aware that Sphinx Search has some limitations about incrementally adding data to the index. You should spend an hour or so reading the doc before you decide to use it.
I don't know of any direct way to index SQLite data in Lucene either. You'd probably have to write your own code to process batches of SQLite data, adding rows to the Lucene index one at a time. This seems to be the usage of Lucene no matter what the database.
update: Solr is a great companion technology for Lucene. Solr gives that search engine many features, including the ability to bulk-load query result data from any JDBC data source.

Ruby off the rails

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
Sometimes it feels that my company is the only company in the world using Ruby but not Ruby on Rails, to the point that Rails has almost become synonymous with Ruby.
I'm sure this isn't really true, but it'd be fun to hear some stories about non-Rails Ruby usage out there.
One of the huge benefits of Ruby is the ability to create DSLs very easily. Ruby allows you to create "business rules" in a natural language way that is usually easy enough for a business analyst to use. Many Ruby apps outside of web development exist for this purpose.
I highly recommend Googling "ruby dsl" for some excellent reading, but I would like to leave you with one post in particular. Russ Olsen wrote a two part blog post on DSLs. I saw him give a presentation on DSLs and it was very good. I highly recommend reading these posts.
I also found this excellent presentation on Ruby DSLs by Obie Fernandez. Highly recommended reading!
I use Ruby extensively in my work, and none of it is Rails (or even web) based.
My domain is usually client-side Windows applications (wxRuby GUI) and scripts, automating Excel, Internet Explorer, SQL Server queries and report generation (win32ole COM automation). I also use the sqlite, pdf-writer, and gruff libraries for various data munging and graph generation tasks.
Rails' success has been great for Ruby, but I agree that Rails has received so much attention that Ruby's value beyond the web is often overlooked.
We are mainly a C++ shop, but we've found several areas where Ruby has proven quite useful. Here are a few:
Code Generation - Built several DSLs to generate C++/Java/C# code from single input files
Build Support
scripts to generate Makefiles for unix from Visual Studio Project Files
scripts for building projects and formatting the output for Cruise Control
scripts for running our unit tests and formatting the output for Cruise Control
scripts for manipulating Visual Studio projects and solutions from the command line
Integration Tests - We can crank out tests much quicker and cleaner using Ruby than C++
QA's entire testing suite is written in Ruby
Ruby is basically my go to tool for where it makes sense. And it makes sense in a lot of places.
Google Sketchup uses Ruby as an embedded scripting language. You can use it to perform all sorts of 3d modeling and import/export tasks. The scripting works with the free version and there's even decent documentation.
Ruby with a homebrew extension written in C++ does all the heavy pixel pushing for my photography processing. I was using Python+numpy but when doing artsy stuff, Ruby is just more fun. Also the relative lack of, or lesser maturity of, good image processing libraries makes me feel less like i'm reinventing wheels. I am clueless about Rails, other than i've heard of it, have a fuzzy idea what it is, and actually have a book on it (unopened)
We use Watir (Ruby library) to test our .net web application.
Check out Shoes, a simple API for building GUIs in Ruby aimed at novice programmers.
Or you could use Ruby to make music ala Giles Bowkett's Archaeopteryx. This presentation by Giles about Archaeopteryx is one of the best presentations ever. I highly recommend it.
RubyCocoa and MacRuby. Possible to make full Cocoa-based GUI apps without Rails. And then you get to use Interface Builder, too.
I worked on a museum project last year that used a lot of Ruby. (http://http://ourspace.tepapa.com/home)
The part that I spent most of my time on was an interactive floor map. The Map on the floor has sensors so when people walk on it lights are triggered and displays in the wall show images or videos and audio tracks are played.
All the control code for this part of the exhibit is ruby. I wrote C interfaces with ruby wrappers to communicate with the floor sensors and the lighting controllers. The system queries a MYSQL database for the media files to be displayed and then tells computers in the walls to play the media via UDP.
It's the most reliable part of the entire exhibit.
Ruby was used for the other major part of the exhibit, the Wall though I didn't have much to do with that. Most of the graphics were prototyped in ruby using interfaces to OpenGL, a bit of Cocoa and a physics library before being ported to pure Obj-C.
Puppet and Chef: DevOps
I didn't see a mention of Puppet or Chef in the 30 answers that preceded my arrival. Ruby appears to dominate current work in cloud automation and is the base, extension, and templating language of these two big players. They are used primarily to distribute system and application configuration information for server arrays and for general IT workstation management.
The DevOps field is quite Ruby-aware. Today, Perl has a competitor. While a really simple script may often still be written directly for sh(1), a complex task now might be done in Ruby rather than Perl.
The only site I've done with Ruby at work is using Rails, but I'd like to try Merb.
Other than that I do a lot of little utility programs in Ruby - for instance an app that reads RSS feeds and imports new posts into a dabase.
It's fun, so I also write some dumb stuff just because it's so quick. Yesterday I wrote an app to play the Monty Hall problem 100,000 times to help a friend convince her professor that switching is the correct strategy.
I almost take insult that ruby is a rails thing. It is like back when CGI was the latest trend and everyone figured that if you knew perl you must be doing it only because you programmed CGI apps. Ruby is just a scripting language for me, although not as mature as python so I somewhat regret having to jump through some of its hoops and recent changes, I still like it and use it. Although I work in a java shop and therefore groovy is the ideal choice for a scripting language, I still use ruby at home and for throw away scripts that aren't needed to be shared at work.
I was considering getting into RoR from all the buzz and how quick/simple it is, but after looking over rails I didn't see anything at all that was amazing or even the least bit innovative or rapidly fast about its development compared to any other framework. The only benefit I saw was that I could code in ruby, which would be nice, but initial setup, server maintenance and scaling is more difficult, thus re-offsetting the pleasure of coding in ruby.
I created a presentation -- coincidentally named Off The Rails -- to discuss Rack-based web applications:
https://github.com/alexch/Off-The-Rails
The git repo includes slides in Markdown format and sample code (in the form of running applications and middleware). Here's the abstract:
Ruby on Rails is the most popular web application framework for Ruby. But it's not the only one! If you think Rails is too big, or too opinionated, or too anything, you might be happy to learn about the new generation of so-called microframeworks built on Rack. And since Rails 3 is itself a Rack app, you don't have to give up Rails to get the benefit of Sinatra routes or Grape APIs.
And here are some references:
This talk lives at https://github.com/alexch/off-the-rails
Yehuda's #10 Favorite Thing About Ruby
Rack
rack-test
rack-client
Sinatra
Grape
Vegas
Siesta
Rerun
Hope you find it useful!
I'm mostly a Web developer, and I learned Ruby to use Rails, but I like the language so much that I started developing a desktop Swing application in Ruby, using JRuby and Monkeybars. I'm competent in Java, but don't much like using it, and the Swing API is horrible, so putting Ruby on top has been a big win.
We mainly use rails, but we have plenty of other non-rails ruby things - for example a standalone authentication daemon thing for centralized authentication of users, and an 'image processing server' which runs arbitrary numbers of ruby processes to process images in parallel.
Oh, and don't forget good old Rake :-)
Ruby is also used for Desktop application. Especially the use of JRuby to develop Swing desktop application.
I've used Ruby at work for
A data extractor, generating csv files from binary output.
A .ini file generator, turning a simple syntax into a repetitive .ini format.
A simple TCP/IP server, acting as stand-in for the customer's system during testing.
We use Ruby to implement our test automation software. This includes test framework and driver code for Selenium RC, WATIR and AutoIT.
Ruby is powerful enough to create comprehensive applications that can interface with Test tools like Selenium or WATIR, while at the same time reading from data files, interacting with a remote Windows UI and performing near transparent network communication. All while running on Windows or Linux.
The uncluttered syntax makes it ideal for new and inexperienced programmers to read. While its totally OO nature makes it easy for these same programmers to apply good (recently learned) OO techniques, from the start.
The flexible nature of Ruby's syntax also makes the use and creation of DSLs much easier. This allows less-technical people to get invovled, read and possibly create there own tests.
I have used Ruby for code generation of C# and T-SQL stored procedures in a project with unstable requirements. The data model was encoded in a YAML file and .erb templates were used for the classes and stored procedures. It also allowed for a much more DRY solution than would have been possible with straight C# as repetitve code could be factored out into a single method in the code generator.
Where I work, we use Ruby to do a number of different one-off type batch jobs. One example of that is a job that interacts with Amazon's S3 service. At the time, the Ruby S3 library was probably the easiest one out there for us to get up and running in a short amount of time.
I wrote an order processing expert system (see DSL answer as well), converted 100k lines of customer specific perl into about 10k lines of ruby handling dozens of customers. No web components at all, no Rails.
I am a webdriver user. ruby is used by webdriver for automating the build process thanks to rake. see http://code.google.com/p/webdriver/ for details
Heh, great question.
I used Ruby to convert Excel spreadsheet airport facility data to sqlite3 for the android phone platform while making an app for pilots.
I use Ruby with Sinatra which is much simpler than Rails. I did use Rails but just found that it has turned into a bit of a monster, although Rails is still amazing compared to web frameworks available for Java.
The main feature of Ruby that I love however is "eval" and "method_missing", which Rails actually uses for example in ActiveRecord so that you can use the amazing "find_by-field-name-" queries.
I used Ruby for a lot of back-end code simply because I was the only person who was tasked to do it and needed a nice clean language that allowed me to be very productive and write easy to maintain code. I find Ruby allows me to do that easier than Perl and Python. Other people's mileage might vary on that but it works well for me.
Besides that, I like how Sequel and Nokogiri work. I also used ActiveRecord for a while separately from Rails.
We use some Ruby for file manipulation but have not been able to incorporate rails yet.
I've used Ruby a lot professionally for quick scripts for things like shuffling files around. I'm the same way in that I was using Ruby first before touching Rails at all.
In Boulder there was an excellent group of Ruby users who met monthly. This point was made - that Ruby does have an existence beside its use in Rails. Plain Ruby users do exist, are begging for attention, have neat things to show, and can find each other at user group meetings.
They also had better pizza than the Python group, who met also the same day of the month. Can only pick one...
While we do have several Rails apps at work, we also use Ruby for some fairly intensive non-web stuff.
We've got an SMS delivery daemon, which pulls messages from a queue and then delivers them, and credit card processing daemon which other apps can call out to, which makes sure there's a central audit trail.

How to implement in-process full text search engine

In one of our commercial applications (Win32, written in Delphi) we'd like to implement full text search. The application is storing user data in some kind of binary format that is not directly recognizable as a text.
Ideally, I'd like to find either an in-process solution (DLL would be OK) or a local server that I could access via TCP (preferably). The API should allow me to submit a textual information to the server (along with the metadata representing the binary blob it came from) and, of course, it should allow me to do a full-text search with at least minimal support for logical operators and substring searching. Unicode support is required.
I found extensive list of search engines on Stack Overflow (What are some Search Servers out there?) but I don't really understand which of those engines could satisfy my needs. I thought of asking The Collective for opinion before I spend a day or two testing each of them.
Any suggestions?
There are a number of options on the market. Either fully fledge commercial products or open source variants. Your choice of a search provider is very dependent on the customers you are targetting.
Microsoft has a free Express version of their Search Server. As far as I know the Express edition is limited to running the Application Tier on one server.
There is also the Apache Lucene project which is open source. It has a nice API that's easy to use and a large community of users. The original project is based on Java, but there are also other implementations such as NLucene for .NET that I have used personally.
I'd recommend having a look at SQLite -- full-text search is included in the latest version.
I suppose the answer depends on your db. For example SQL Server has full text search and also English Language Queries if ever needed.
Take a look at using PostgreSQL and tsearch.
Try using postgresql with tsearch
Sphinx is probably the most efficient and scalable option while SQLite - FTS3 is the most straightforward option.
While not in-process, Solr is very fast (based on Lucene) and easily accessible from any platform (HTTP)

Resources