architecture: building text suggestions based on existing text - text-classification

trying build a text suggestions similar to stackoverflow question suggestions. Do not know where to start from. Any suggestions what tools/servers/algorithms i should be researching. Does this come under text classification?
Any links towards this would be helpful

I think you need to string searching algorithms. They try to find a place where one or several strings (also called patterns) are found within a larger string or text.
The following links help you understand this topic:
Algorithm for autocomplete?
What is the best autocomplete/suggest algorithm,datastructure [C++/C]
https://web.stanford.edu/class/cs97si/10-string-algorithms.pdf

Related

How do I practice with multi_index of boost library

Do you have any document or project which I can practice with boost:multi_index?
You could answer questions on this site. Or look at them and think up variations or see if you can duplicate them without looking at the solutions.
A good place to start:
search multi_index_container
search Joaquin's answers
or look into other top uses of the tag: https://stackoverflow.com/tags/boost-multi-index/topusers
Also there are a number of introductory online documents, e.g. https://theboostcpplibraries.com/boost.multiindex

Feature Extraction from Images to use with LIBSVM

I'm really stuck right now. I want to apply LIBSVM for Image Classification. I captured lots of Training-Images (BITMAP-Format), from which I want to extract features.
The Training-Images contain people who are lying on the floor. The classifier should decide if there is a person lying on the floor or not in the given Image.
I read lots of papers, documentary, guides and tutorials, but in none of them is documented how to get a LIBSVM-Package. The only thing that is described is how to convert a LIBSVM-Package from a CSV-File like this one: CSV-File. On the LIBSVM-Website several Example-Data can be downloaded. The Example-Data is either prepared as CSV-Files or as ready-to-use Training- and Testdata.
If you look at the Values which are in the CSV-File, the first column are the labels (lying person or not) and the other Values are the extracted features, but I still can't reconstruct how those values are achieved.
I don't know if it's that simple that nobody has to mention it, but I just can't get trough it, so if anybody knows how to perform the feature extraction from Images, please help me.
Thank you in advance,
Regards
You need to do feature extraction first. There are many methods that are available. These include LBP,Gabor and many more.. These methods will help you get the features to input into libsvm..Hope this helps...

How does google know if I type in redflower.jpg I mean Red Flower?

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

Inline data representation

I would like to represent data that gives an overview but allows them to drill down in an inline fashion - so if you had a grouping of say 6 objects the user could expand the data and it would show the 6 objects immeadiately below it before any more high level data.
It would appear that MSHFlexgrid gives this ability but I can't find any information about actually using it, or what it's limitations are (can you have differing number of fields and/or can they have different spacing, what about column headers, indentation at for the start, etc).
I found this site, but the images are broken (in ie8 and ff3.5). Google searches show people just using the flat data representation but nothing using the hierarchical properties). Does anyone know any good tutorials or forums with a good discussion about pitfalls?
Due to lack of information about using it, I am thinking of coding my own version but if anyone has done work in this area I haven't found it - I would of thought it would be a natural wish for data representation. If someone has coded a version of this (any language) then I wouldn't mind reading about it - maybe my idea of how to do it wouldn't be the best way.
You might want to check out vbAccelerator. He has a Multi-Column Treeview control that sounds like what you may be looking for. He gives you the source and has some pretty decent samples.
The MSHFlexGrid reference pages and the "using the MSHFlexGrid" topic in the Visual Basic manual?
Sorry if you've already looked at these!

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

Resources