Intro: (Feel free to skip it).
I`ve dealt in a casual way with elastic stack almost daily for the last couple of years as the two companies I worked for use it to aggregate logs. The problem is that in neither of those companies we had an "expert" in elastic stack so any time I need to search through logs I am by myself with little help from google.
Usually, I have a trace id field I can search for so I don't have to deal with queries; but when I need to search strings or other things inside other fields everything seems like an overwhelming struggle.
The problem is probably me, but after an insane amount of grind and googling i feel like I am missing something.
Main Issue:
We usually have some logging entries with the format: "methodName, inputOfTheMethod" or "methodName, outputOfTheMethod" or "methodName, errorMessage", or we need to find specific ids for people, customers or account numbers like "0042342672101" inside the logs or certain keywords like "Account Opening"
So, I have a couple of objectives I am currently wasting hours every time i have to do some serious debugging:
Search for the equivalent of "LIKE %methodName, %" in SQL for the "message" field on kibana.
Search for all the entries that have the string "0042342672101" in any position of the "message" field
Search for all the entries that have the literal string "Account Opening" in any position of the "message" field. It seems to be specially difficult to search for literal strings with spaces on it.
I did a lot of googling on these issues but i usually end in the official documentation that brings little help outside the generic examples or with similar questions on forums with no answers.
I would really love to solve these specific issues or even deeper insight on how to approach them or why this wouldn`t work and what are the alternatives to achieve this.
Related
When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.
I saw an intern opportunity in a bank in dubai. They have a defined problem statement to be solved in 2 months. They told us just 2 lines -
"Basically the problem is about name matching logic.
There are two fields (variables) – both are employer names, and it’s a free text field. So we need to write a program to match these two variables."
Can anyone help me in understanding it? Is it just a simple pattern matching stuff?
Any help/comments would be appreciated.
I think this is what they are asking for:
They have two sources of related data, for example, one from an internal database, and the other from name card input.
Because the two fields are free text fields, there will be inconsistency. For example, Nitin Garg, or Garg, Nitin, or Mr. Nitin Garg, etc. Here is an extreme case of Gadaffi.
What you are supposed to do is to find a way to match all the names for a specific person together.
In short, match two pieces of data together by employer names, taking possible inconsistency into account.
Once upon a time there was a nice simple answer to the problem of matching up names despite mis-spellings and different transliterations - Soundex. But people have put a lot of work into this problem, so now you should probably use the results of that work, which is built into databases and add-ons - some free. See Fuzzy matching using T-SQL and http://anastasiosyal.com/archive/2009/01/11/18.aspx and http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
assuming that I know nothing about everything and that I'm starting in programming TODAY what do you say would be necessary for me to learn in order to start working with Natural Language Processing?
I've been struggling with some string parsing methods but so far it is just annoying me and making me create ugly code. I'm looking for some fresh new ideas on how to create a Remember The Milk API like to parse user's input in order to provide an input form for fast data entry that are not based on fields but in simple one line phrases instead.
EDIT: RTM is todo list system. So in order to enter a task you don't need to type in each field to fill values (task name, due date, location, etc). You can simply type in a phrase like "Dentist appointment monday at 2PM in WhateverPlace" and it will parse it and fill all fields for you.
I don't have any kind of technical constraints since it's going to be a personal project but I'm more familiar with .NET world. Actually, I'm not sure this is a matter of language but if it's necessary I'm more than willing to learn a new language to do it.
My project is related to personal finances so the phrases are more like "Spent 10USD on Coffee last night with my girlfriend" and it would fill location, amount of $$$, tags and other stuff.
Thanks a lot for any kind of directions that you might give me!
This does not appear to require full NLP. Simple pattern-based information extraction will probably suffice. The basic idea is to tokenize the text, then recognize/classify certain keywords, and finally recognize patterns/phrases.
In your example, tokenizing gives you "Dentist", "appointment", "monday", "at", "2PM", "in", "WhateverPlace". Your tool will recognize that "monday" is a day of the week, "2PM" is a time, etc. Finally, you can find patterns like [at] [TIME] and [in] [Place] and use those to fill in the fields.
A framework like GATE may help, but even that may be a larger hammer than you really need.
Have a look at NLTK, its a good resource for beginner programmers interested in NLP.
http://www.nltk.org/
It is written in python which is one of the easier programming languages.
Now that I understand your problem, here is my solution:
You can develop a kind of restricted vocabulary, in which all amounts must end witha $ sign or any time must be in form of 00:00 and/or end with AM/PM, regarding detecting items, you can use list of objects from ontology such as Open Cyc. Open Cyc can provide you with list of all objects such beer, coffee, bread and milk etc. this will help you to detect objects in the short phrase. Still it would be a very fuzzy approach.
I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.
For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.