Sequential browsing of an elasticsearch index - elasticsearch

I am building a system that uses Elasticsearch to store and retrieve library catalogue data. One thing I've been asked for is a browse interface.
Here's a definition of what this is:
The user does a search, for example "Author starts with" and they
supply "Smith"
The system puts them into the middle of a list of authors, at or near
the position of the first one that starts with "Smith", so they might
see:
Smart, Murray
Smart, Murray J.
Smeaton, Duncan
Smieliauskas, Wally
Smillie, John
Smith Milway, Katie <-- this being the first actual search result
Smith, A. M. C.
Smith, Andrew
Smith, Andrew M. C.
etc.
The one with the marker is the one actually searched for, but you can see the ones around it according to the sort order, including ones that don't actually match the query.
These will be paged, so having ~20 or so results per page. If the user pages back, they head towards the start of the alphabet, if they page forwards they will go onward.
Each result shown will have a count beside it showing how many results (i.e. catalogue items) are associated with that author.
Clicking on a result takes you to everything by that author (this and everything beyond it is fairly easy and mostly implemented already.)
I'm wondering if anyone has any good ideas on how to approach this. At this stage, I don't care too much about handling searches that aren't "field starts with" searches, as exactly how that will be done is currently up in the air and I'll deal with it when the time comes.
Here's what I'm thinking, but there are serious issues with it:
All the fields that are going to be browsed are faceted
I get a list of all the facets for that field, search through it to find the starting point, and handle the paging manually in code.
This has the big problem that I might be fetching hundreds of thousands of terms and processing them, which won't be quick.
In retrospect, it's no different to loading all the values into its own index and fetching all them in sorted order.
I'm open to any options here, whether I can somehow jump into the middle of a large set of facets like the query "from" field, or if I should instead put everything into another index specifically for this purpose (though I don't know how I'd structure and query it), or something else.
From what I can see, my ideal solution would be that I can specify the facet field, tell ES that I want to start at the one that starts with "Smith", and it displays from around there, then I have the ability to say "go 20 back", but I'm not sure that this is possible.
You can see an example of the sort of thing I'm talking about in action here: http://hollisclassic.harvard.edu/ - put in Smith as "Author (last name first)", and it gives you a (terribly ugly looking) browse list.
Any thoughts?

On:
The one with the marker is the one actually searched for, but you can
see the ones around it according to the sort order, including ones
that don't actually match the query.
I had a similar requirement: "Show the user how many records we would have found if the search-conditions were more relaxed".
I solved this by doing two searches (one exact, one more relaxed), as the performance of ES is so good that doing one or two searches does not matter. The time gets eaten up in the displaying (in my case) and not in the search.
Still you would need to merge these two results in you application to generate one list to display.

Related

Optimize Google Places API Query for Prominent Parks, Mountains, Conservation Areas

First post on Stackoverflow.
I am using the Google API to sort images taken while traveling into organized folders, append tags and rename files with relevant information. I have my code working well but am not always happy with the results. I want to be able to focus my query results on major tourist attractions such as National Parks, Ski Resorts, Beaches, etc. The problem I am finding is that the prominence "rankby" variable and the "radius" are not giving satisfactory results. Here is a typical query for Zion National Park.
https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=37.269486111111,-112.948141666667&rankby=prominence&radius=50000&type=natural_feature,tourist_attraction,point_of_interest&keyword=&key=MYAPIKEY
The most prominent result is Springdale which is the town where you enter the part. Zion National Park is listed much further down in the results. What my code does is use the LAT and LON extracted using EXIF and does a Google API nearby search request to find the Place ID for where the photo was taken. It then does another API request for Place Details using the place_id provided by the previous step to cut down on the information I need to parse.
https://maps.googleapis.com/maps/api/place/details/json?place_id=ChIJ8R5RCzaNyoARegi3rqVkstk&fields=name,address_component&key=MYAPIKEY
I can force the nearby search to return a National Park by searching against "National Park" in the keywords variable but that limits my project to only being able to provide National Park results since the keywords field can only accept one string.
I would like a park of my query to be able to return the most prominent tourist attraction at the general level, i.e. Zion National Park, Yosemite National Park, etc. so I can sort images into the general name folders and another part of the query provides the exact location. i.e. I am on this trail or at this lookout. The problem is the Google API sees these specific locations "Trail, Lookout" as tourist attractions, parks, establishments, etc. as well so it chooses those first.
What I need help with is trying to figure out if there is a better way to structure my query to return the high-level / name of the major park. From my understanding, the types field only searches on the first type even if there is more in the list and the keywords field can only accept one string as well making it impossible for one phase to capture all major destinations at a high level.
Perhaps it needs to be done with more queries but I am trying to limit the number of queries to stay inside the free quota. Maybe it will just take a long time to fully sort my files.
Read through and implemented Google API structure. I hoping someone can provide a more detailed query structure or method to parse out truly prominent locations rather than googles interpretation of prominence as it can be affected by user ratings, etc. It is not always accurate.

dynamically classify categories

I am new at the idea of programming algorithms. I can work with simplistic ideas, but my current project requires that I create something a bit more complicated.
I'm trying to create a categorization system based on keywords and subsets of 'general' categories that filter down into more detailed categories that requires as little work as possible from the user.
I.E.
Sports >> Baseball >> Pitching >> Nolan Ryan
So, if a user decides they want to talk about "Baseball" and they filter the search, I would like to also include 'Sports"
User enters: "baseball"
User is then taken to Sports >> Baseball
Now I understand that this would be impossible without a living - breathing dynamic program that connects those two categories in some way. It would also require 'some' user input initially, and many more inputs throughout the lifetime of the software in order to maintain it and keep it up to date.
But Alas, asking for such an algorithm would be frivolous without detailing very concrete specifics about what I'm trying to do. And i'm not trying to ask for a hand out.
Instead, I am curious if people are aware of similar systems that have already been implemented and if there is documentation out there describing how it has been done. Or even some real life examples of your own projects.
In short, I have a 'plan' but it requires more user input than I really want. I feel getting more info on the subject would be the best course of action before jumping head first into developing this program.
Thanks
IMHO It isn't as hard as you think. What you want is called Tagging and you can do it Automatically just by setting the correlation between tags (i.e. a Tag can have its meaningful information plus its reation with other ones. Then, if user select a Tag well, you related that with others via looking your ADT collection (can be as simple as an array).
Tag:
Sport
Related Tags
Football
Soccer
...
I'm hoping this helps!
It sounds like what you want to do is create a tree/menu structure, and then be able to rapidly retrieve the "breadcrumb" for any given key in the tree.
Here's what I would think:
Create the tree with all the branches. It's okay if you want branches to share keys - as long as you can give the user a "choice" of "Multiple found, please choose which one... ?"
For every key in the tree, generate the breadcrumb. This is time-consuming, and if the tree is very large and updating regularly then it may be something better done offline, in the cloud, or via hadoop, etc.
Store the key and the breadcrumb in a key/value store such as redis, or in memory/cached as desired. You'll want every value to have an array if you want to share keys across categories/branches.
When the user selects a key - the key is looked up in the store, and if the resulting value contains only one match, then you simply construct the breadcrumb to take the user where you want them to go. If it has multiple, you give them a choice.
I would even say, if you need something more organic, say a user can create "new topic" dynamically from anywhere else, then you might want to not use a tree at all after the initial import - instead just update your key/value store in real-time.

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

How does google know if I type in redflower.jpg I mean Red Flower?

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

How to Build a User Friendly Filter

Our application displays tons of valuable information to our users in a table. We have a filtering capablity that is based on boolean/logic searches. Even after coaching, users still tend to not understand how to use filters because AND OR > >= etc are foreign to them. This filter is easy for programmers since it is easily translated into code. Any examples on how this can be made more user-friendly and less prone to error?
In the past, when I needed to solve this problem, I presented the users with a list of items (in one or more columns), and gave them a single text box to type text into. I would then match the text against the text in the columns, and collapse the list (removing records that do not match) as they type.
This approach reminds users of Google. Everyone knows how to Google.
If you don't like the idea of presenting a large list of all items initially, you can show an empty results pane first, and display results after a search is typed in.
Convert operators to plain English text and ask them to select from it.
For eg: To
Show me all Books whose author is [text field] and the price is [less than/greater than] [text field]
[less than/greater than] is a dropdown list
[text field] is an input box
The resulting text after the user has filled in all the fields should result in plain simple English
Eg: Show me all books whose author is Stephen King and the price is less than 10 $
I used this in an app of mine when I used to freelance and the users loved it.
Using some nifty UI programming you can give options to expand the filter to n levels.
In web applications, telerik had a good idea with their grid, you should be able to do that in desktop applications too.
you can provide some preset filters for the most common queries to that table - if that's possible with the application you are using
you can provide a "count instead of display" mechanism so the user sees how many rows he/she will potentially retrieve
you can provide them a Wiki page with some examples online
you can give them a QBE tool
hope that helps
good luck MikeD
In my experience you are simply not going to get end users to understand the difference between AND and OR conditions. Therefore I build my filters so that ANDing or ORing is built in. In general, my logic is as follows:
Criteria for different fields are ANDed together to restrict results.
Multiple values for the same field are effectively ORed together and then ANDed onto the criteria for other fields. I generally detect input into a single field of comma-separated lists (translated to IN ()), dash-separated ranges (translated to BETWEEN), wildcard values (translated to LIKE), and any combination (for example Customer ID: 1-10, 50, 52).
I find that most users intuitively understand this system.
Of course, from time to time a different interface with some degree of ORing is required and in those cases I generally have a section of the search user interface in a panel or group box labelled "Any of these is true".
I have recently been working on this problem. My solution is to be more descriptive, to use words instead of symbols and to change the words where it allows for a more readable layout. To illustrate, imagine the filter expression:
Breed == "Spaniel" AND (Age == 2 OR Colour == "White")
Certain linear Query builders might write this:
( And/Or Field Operator Value
[ ] [Breed] [=] [Spaniel]
[1] [AND] [Age] [=] [2]
[1] [OR] [Colour] [=] [White]
Or a hierarchical one may display this as:
AND
[Breed] [Is Equal To] [Spaniel]
OR
[Age] [Is Equal To] [Spaniel]
[Colour] [Is Equal To] [White]
Both of which might be readable to a developer but not so readable to the layperson.
My solution is more like:
Show ALL records where
[Breed] [Is Equal To] [Spaniel]
Show ANY records where
[Age] [Is Equal To] [Spaniel]
[Colour] [Is Equal To] [White]
So borrowing from the hierarchical approach but changing the AND and OR to an ALL or ANY. This means it can be read from top to bottom a little more easily.
I think Django's built-in admin interface has a very intuitive UI for filters.
There's a simple screenshot in the docs but there's a lot more you can do, especially when filtering on dates.
You might want to take a closer look at Django's admin interface to see if you can apply some of their tricks to your case.
I would think something similar to MS Access Query generator. You may also want to have good context sensitive help system that will guide first time users.
Theresa Neil illustrated several approaches for building complex rule interfaces (AKA predicate clauses) in the iTunes Solves the Nested Clause Dillema post. Some good examples there. I really like the way Apple does it in iTunes (although, I don't use iTunes).

Resources