How can I quickly search my code using Windows? - windows

I've got the same problem as in this question, except in Windows. Our product has a 100+ MB code base, and searching for stuff in there takes an awful amount of time (several minutes). It's nice when you can narrow your search to a specific subfolder, but that isn't always possible.
I was wondering if there is some tool that would make it faster, probably by indexing. Accuracy is paramount, if a substring exists somewhere, it must be found, even if the file is not indexed or the index is out of date. Also it would be ideal if .svn folders would be ignored when searching.
Failing that, I was wondering if I could make something like that myself. Is there maybe a ready made indexing engine available for such tasks? I was wondering about Windows Indexing Service (or whatever it is called these days), but so far my experience with it (the Windows standard file search facility) has been rather dismal, with it often missing files that were right in front of its nose.

Yes, I have seen Window Indexing service miss files too, but I haven't checked KBs or user forums for explanations. I'm glad to see it confirmed that it's not just me ;-)!
There look to be alot of file index programs available, I would be surprised if you can't find one that meets your needs (although, see later).
Here are some things to consider:
If your team is using an IDE, isn't there an index feature/plug-in? (none of the SVNs provide Indexing capabilites?). Also, add some tags to your question so this will be seen by other windows developers using the same dev enviorment that you are using.
The SO link you provided mentions several options: slocate, rlocate, and I found mlocate. The wikipedia page for slocate says
Locate32 for Windows Windows analog of GNU locate with GUI, released under GNU license
which seems to meet your main requirement. Looking at the screen shots with the multi-tab interface (one labeled advanced) would give me hope that you can exclude svn (at least from results, possibly from what is indexed).
Your requirement for
if a substring exists somewhere, it
must be found, even if the file is not
indexed or the index is out of date.
seems contradictory. For the substring requirement, I can see many indexing programs ignore c lang syntax elements ( {([])}, etc), and, for example, 'then' is either removed because it is considered a noise word, or that it gets stemmed-down to 'the' and THEN is removed because it is noise word.
To get to 'must be found', and really be sure, you would have to develop a test suite to see what the index program is doing for anything that is corner case. (For a 100 MB code base, not out of the question, especially since you are considering rolling your own).
Finally 'even if the file is not indexed ...'. Well, you either use an index or your don't (obviously). Unfortunately, for your requirement, while rlocate is looking for changes all the time, slocate (on Unix) doesn't seem to. Probably if you read/check on the docs or user forums for locate32 you'll get the answers you need.
Rlocate would give you what you need, but from an rlocate page 'rlocate will work only on Linux with version 2.6.'. mlocate doesn't seem to be have a Windows port either only.
Finally here is a link I found that is interesting about mlocate : mlocate vs rlocate. This is the google cache, because the redhat.com said 'not available'.

Related

LinkedIn bot. Search profile for keywords and export to word

I was thinking about a way to sort of automate my job/have to look through less LinkedIn profiles.
So here is my question. Would it be possible to write a program that would search LinkedIn for you with your "keywords" and have to program automatically click through the profiles, then when it clicks a profile search for each individual keyword and keep count, then export the amount of times the keywords are mentioned to a word document, then go back and do that to each profile. I have no ides what language could do this though and I only have a highschool class worth of Javascript so I would be teaching myself how to do this. I could run this program and night and come back in the morning and be able to look through the best profiles and waste less time looking though ones where people do not have the experience they say they do.
Basically it would go:
Execute search
click first profile
find total number of keywords
export to word
click back or return to results button
next profile
repeat for, say, 300 profiles.
I don't know how feasible this would be to figure out how to write or if its even really possible. Thanks for the helpful replies!
I got some help on reddit, and the replier said that it would probably be easiest in Ruby/RubyGems?
Your best option would probably be to use a process called "scraping"; You extract the html from the page and sort through it for useful information.
Programming languages are like religons; different people say different languages are the best. For parsing html most people (not all) would agree a high-level language like Ruby or Python would be best. However, you did specify ruby, so start by installing it.
After installing ruby (see here), run gem install nokogiri
You can look for general guides on nokogiri here. Start by looking at the source code and seeing where the interesting information is (eg. links to the profiles on the search page). 300 profiles should be no problem. However, when you are testing make sure you only try 3 or 4 profiles at a time. A program requesting 300 pages being run many times may get noticed, but a one-time run should be fine (no guarantees).
Also, I would not recomend exporting to word. You can scan the raw text for keywords and it will be much faster.
As a final note, this will take a long time. From what it sounds like you haven't programmed much before (although previous experience in javascript will help). A lot of your time will most likely be spent reading through tutorials and searching your problem on google. Feel free to come back here when you have specific problems, and good luck!

Exact Code segment size for a windows process

The linux file proc/{pid}/status as we know gives us some fine grain memory footprint for a particular process. One of the parameters thrown by it is the 'VmExe' or the size of the text segment of the process. I'm particularly interested in this field but am stuck in a windows environment with no proc file system to help me. cygwin mimics most of the procfs but the {pid}/* files seem to be one of those parts which cygwin ignores. I tried playing around with the VmMap tool on windows sysinternals, but the closest field I got was was 'Private Data Size' on a Private working set. I'm not really sure if this is what I'm looking for.
I would take a look at the vmmap.exe from sysinternals, and see if it displays the information you are looking for, for a given process. If the information you are seeking is displayed there, you could take a look at the api calls the application uses, or ask on the sysinternals forums on msdn. I know this isnt exactly what you were looking for in an answer, but it hopefully points you in the right direction.
If you are talking about the :text segment in the PE itself, you can get that information from the debughlp library, and a number of other ways (there are a few libraries floating around for binary analysis).

Automatically take list of terms, import into Windows search function (for content), and export lists of results. (AutoIT?)

My next big challenge is to write a script (I assume it would be in AutoIT, an area I have little experience with) to automate the Windows search function.
The end goal is to take a list of search terms from a .txt file (one string per line), and search the contents of every document on the computer for said search terms (one at a time).
I can make this happen by hand - turn on the search by content function, index all files on all attached drives, search the terms one by one, and highlight all > shift-click > Copy as path > paste in notepad, and save as [searchterm].txt.
However, I need to automate that whole process. I understand that I might need to write a separate script for each version of Windows it would be used with (XP, Vista, 7, 8).
Is this an easy enough task to accomplish, or would it take a lot of programming hours? Can anyone point me in the right direction? All help is appreciated.
Well, assuming your text file of queries is large enough, and you don't want to actually iterate the entire file system for each, you are describing a classic information retrieval problem.
Index the data from your file system (this is a preprocessing that is done only once)
For each query - search for it in the index, and get the relevant documents.
The field of Information Retrieval is a huge area of research, and I really don't encourage you to try implementing it from scratch.
I do encourage using built in libraries that are already developed and tested for you that do it. For example, in java a popular choice is lucene - which is very widely used for searching everywhere.
If you are not familiar with java, I am also aware of python (pylucene) and .NET (lucene.NET) bindings of this library.
To learn more about Information Retrieval I recommend Manning's Introduction to Information Retrieval

Determining whether mdworker (Spotlight) has completed first scan

How do I determine that mdworker (Spotlight) has completed its first scan? I'm basically looking for the point at which the little "." in the spotlight search icon would go away and you'd be able to perform searches. (Obviously the OS has a way to determine this since it displays a dot until it's ready...) I'm not seeing anything from mdutil and I can't find anything in the Spotlight APIs.
I'm currently forcing my own scan synchronously using mdimport, but this introduces a long delay (from minutes to hours depending on how aggressive I'm being about where to search) and duplicates work that mdworker is already doing.
Any solution (programmatic, scripted, documented, or undocumented) is fair game here.
I opened a DTS for this with Apple. The answer is that there is no supported way to do this as of 10.7. The "little dot" that the spotlight search icon uses is controlled with a private interface.
My goal has been to get an inventory of installed applications.
My solution currently is to gather a list of all the apps in /Applications using fts and searching for things named ".app", and pruning as I go so I don't get sub-applications. (This would be easier to do with NSDirectoryEnumeration, but this particular piece of code is in C++ with Core Foundation. It would be easier to do with CFURLEnumerator, but I need to support 10.4. So fts is fine.)
Scanning for this list is very fast. Once I know the minimum number of apps on the box, I compare that to what system_profiler outputs. If system_profiler tells me that there are fewer apps than I know are in /Applications, then I scan all the bundles myself. Otherwise, I use the output from system_profiler.
This isn't ideal, but it's a decent heuristic, is "mostly" right, and prevents drastic underreporting of applications.

Fast Text Search Over Logs

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.
The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.
Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.
Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.
The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.
Edit 2:
Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.
The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.
grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.
Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.
If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.
Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?
You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?
How long are these searches taking now?
You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?
Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

Resources