Fast Text Search Over Logs - algorithm

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.
The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.
Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.
Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.
The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.
Edit 2:
Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.
The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.

grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.

Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.
If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.
Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?

You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?

More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?
How long are these searches taking now?

You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?

Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

Related

XPages performance - 2 apps on same server, 1 runs and 1 doesn't

We have been having a bit of a nightmare this last week with a business critical XPage application, all of a sudden it has started crawling really badly, to the point where I have to reboot the server daily and even then some pages can take 30 seconds to open.
The server has 12GB RAM, and 2 CPUs, I am waiting for another 2 to be added to see if this helps.
The database has around 100,000 documents in it, with no more than 50,000 displayed in any one view.
The same database set up as a training application with far fewer documents, on the same server always responds even when the main copy if crawling.
There are a number of view panels in this application - I have read these are really slow. Should I get rid of them and replace with a Repeat control?
There is also Readers fields on the documents containing Roles, and authors fields as it's a workflow application.
I removed quite a few unnecessary views from the back end over the weekend to help speed it up but that has done very little.
Any ideas where I can check to see what's causing this massive performance hit? It's only really become unworkable in the last week but as far as I know nothing in the design has changed, apart from me deleting some old views.
Try to get more info about state of your server and application.
Hardware troubleshooting is summarized here: http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Domino_Server_performance_troubleshooting_best_practices
According to your experience - only one of two applications is slowed down, it is rather code problem. The best thing is to profile your code: http://www.openntf.org/main.nsf/blog.xsp?permaLink=NHEF-84X8MU
To go deeper you can start to look for semaphore locks: http://www-01.ibm.com/support/docview.wss?uid=swg21094630, or to look at javadumps: http://lazynotesguy.net/blog/2013/10/04/peeking-inside-jvms-heap-part-2-usage/ and NSDs http://www-10.lotus.com/ldd/dominowiki.nsf/dx/Using_NSD_A_Practical_Guide/$file/HND202%20-%20LAB.pdf and garbage collector Best setting for HTTPJVMMaxHeapSize in Domino 8.5.3 64 Bit.
This presentation gives a good overview of Domino troubleshooting (among many others on the web).
Ok so we resolved the performance issues by doing a number of things. I'll list the changes we did in order of the improvement gained, starting with the simple tweaks that weren't really noticeable.
Defrag Domino drive - it was showing as 32% fragmented and I thought I was on to a winner but it was really no better after the defrag. Even though IBM docs say even 1% fragmentation can cause performance issues.
Reviewed all the main code in the application and took a number of needless lookups out when they can be replaced with applicationScope variables. For instance on the search page, one of the drop down choices gets it's choices by doing an #Unique lookup on all documents in the database. Changed it to a keyword and put that in the application Scope.
Removed multiple checks on database.queryAccessRole and put the user's roles in a sessionScope.
DB had 103,000 documents - 70,000 of them were tiny little docs with about 5 fields on them. They don't need to be indexed by the FTIndex so we moved them in to a separate database and pointed the data source to that DB when these docs were needed. The FTIndex went from 500mb to 200mb = faster indexing and searches but the overall performance on the app was still rubbish.
The big one - I finally got around to checking the application properties, advanced tab. I set the following options :
Optimize document table map (ran copystyle compact)
Dont overwrite free space
Dont support specialized response hierarchy
Use LZ1 compression (ran copystyle compact with options to change existing attachments -ZU)
Dont allow headline monitoring
Limit entries in $UpdatedBy and $Revisions to 10 (as per domino documentation)
And also dont allow the use of stored forms.
Now I don't know which one of these options was the biggest gain, and not all of them will be applicable to your own apps, but after doing this the application flies! It's running like there are no documents in there at all, views load super fast, documents open like they should - quickly and everyone is happy.
Until the http threads get locked out - thats another question of mine that I am about to post so please take a look if you have any idea of what's going on :-)
Thanks to all who have suggested things to try.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

Most performant live search technique for mobile safari

I am building a mobile web application that targets webkit. I have a requirement to perform a live search (on keypress) against a database of ~5000 users.
I've tried a number of different techniques:
On page load, making an AJAX call which loads an in-memory representation of all 5000 users, and querying them on the client. I tried sending JSON, which proved to be too large, and also a custom delimited string, which was then parsed using split(). This was better, but ultimately searches against this array of users was slow.
I tried using a conventional AJAX call, which would return users based on a query, also using the custom delimited string technique. This was better, but I was forced to tune it so that searches were only performed with a minimum of 3 characters. This is not optimal, as I would like to be able to start filtering after 1 character. I could also throttle the calls so that not every keystroke within a certain threshold triggered a request. This could help with performance, but I'd rather not have to fiddle with that sort of thing.
Facebook mobile does this very well if you try their friend search. Searches happen instantaneously, and are triggered after 1 character.
My question is, does anyone have any suggestions for faster live searches for a mobile app? Should I be looking at localStorage? Is this reliable, feasible?
Is there any reason you can't use a binary search? The names you're looking for should be in a block. If you want first and last name search, you could create a second copy of the data sorted by last name and look in both sets.
Some helpful but more complicated data structures that address this type of problem include:
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
http://en.wikipedia.org/wiki/Trie

How can I quickly search my code using Windows?

I've got the same problem as in this question, except in Windows. Our product has a 100+ MB code base, and searching for stuff in there takes an awful amount of time (several minutes). It's nice when you can narrow your search to a specific subfolder, but that isn't always possible.
I was wondering if there is some tool that would make it faster, probably by indexing. Accuracy is paramount, if a substring exists somewhere, it must be found, even if the file is not indexed or the index is out of date. Also it would be ideal if .svn folders would be ignored when searching.
Failing that, I was wondering if I could make something like that myself. Is there maybe a ready made indexing engine available for such tasks? I was wondering about Windows Indexing Service (or whatever it is called these days), but so far my experience with it (the Windows standard file search facility) has been rather dismal, with it often missing files that were right in front of its nose.
Yes, I have seen Window Indexing service miss files too, but I haven't checked KBs or user forums for explanations. I'm glad to see it confirmed that it's not just me ;-)!
There look to be alot of file index programs available, I would be surprised if you can't find one that meets your needs (although, see later).
Here are some things to consider:
If your team is using an IDE, isn't there an index feature/plug-in? (none of the SVNs provide Indexing capabilites?). Also, add some tags to your question so this will be seen by other windows developers using the same dev enviorment that you are using.
The SO link you provided mentions several options: slocate, rlocate, and I found mlocate. The wikipedia page for slocate says
Locate32 for Windows Windows analog of GNU locate with GUI, released under GNU license
which seems to meet your main requirement. Looking at the screen shots with the multi-tab interface (one labeled advanced) would give me hope that you can exclude svn (at least from results, possibly from what is indexed).
Your requirement for
if a substring exists somewhere, it
must be found, even if the file is not
indexed or the index is out of date.
seems contradictory. For the substring requirement, I can see many indexing programs ignore c lang syntax elements ( {([])}, etc), and, for example, 'then' is either removed because it is considered a noise word, or that it gets stemmed-down to 'the' and THEN is removed because it is noise word.
To get to 'must be found', and really be sure, you would have to develop a test suite to see what the index program is doing for anything that is corner case. (For a 100 MB code base, not out of the question, especially since you are considering rolling your own).
Finally 'even if the file is not indexed ...'. Well, you either use an index or your don't (obviously). Unfortunately, for your requirement, while rlocate is looking for changes all the time, slocate (on Unix) doesn't seem to. Probably if you read/check on the docs or user forums for locate32 you'll get the answers you need.
Rlocate would give you what you need, but from an rlocate page 'rlocate will work only on Linux with version 2.6.'. mlocate doesn't seem to be have a Windows port either only.
Finally here is a link I found that is interesting about mlocate : mlocate vs rlocate. This is the google cache, because the redhat.com said 'not available'.

Thinking sphinx returning bad results (many documents missing)

I am developing Ruby on Rails application which uses Thinking Sphinx. Unfortunately, from time to time (few times per month) search tends to return bad results (many documents missing). Reindexing helps, but this is not a solution for production.
I am experiencing bad results even when I am typing simple queries into rails console (like ThinkingSphinx.search 'skalee'). Sphinx search tool returns proper results, so indexing apparently works properly.
When I type in ThinkingSphinx.search('skalee').results[:words] I see proper numbers of hits (for example, term found in 30 documents) but ThinkingSphinx.search('skalee').results[:matches] contains, let's say, 2 documents. The numbers in results[:words] are equal to those I am getting with search.
I am using delayed delta but this problem is appearing even when I am not running ts:dd.
Just happened to stumble across this:
http://freelancing-god.github.com/ts/en/common_issues.html#deltas
Perhaps your user permissions are off?
Thinking Sphinx (or Delayed Delta, I don't remember well) adds special, internal attribute (sphinx_deleted or something like this) to all models. It is used to filter out destroyed records. Unfortunately, it works bad from time to time. After modifying the gem (getting rid of this attribute) everything works fine. Of course I need to wait till full reindexing (which I perform every night) to remove destroyed records from indexes, but this is minor disadvantage in my case. Alternatively, I could use Sphinx's kill list feature to filter out removed entries.

Resources