Well formed query suggestions - data-structures

I am developing an autocomplete feature in which i intend to show query suggestions something like this:
students who live in {City_name} [ City_name could contain values from list of cities ]
example_type 1 :
students who live in New...
[ following query suggestions should pop-up ] :
students who live in New york
students who live in New
Jersey
(Looking up different entities [here cities, sports (eg: "students who play basketball" etc...]. )
example_type 2:
students who live in New york and play ba...
[ following query suggestions should pop-up ] :
students who live in New York and play basketball
students who
live in New York and play baseball
etc..
I have tried building basic autocomplete on entities index using ElasticSearch, which is gisted here.
(In my case, the child/entities index is dumped using a river-plugin.) I have naively checked on Nested Types and Parent / Child relationship but was not able to exactly figure out whether its the right fit for my requirement.
I am not sure on how to index these (parent) phrases alongwith
child index to enable autocomplete search and generate possible suggest trees by querying/searching a single index.
It would be great if i can get some help to solve this kind of problem.
Thanks in advance!

I'd index phrases such as:
live in New york
live in New Jersey
play basketball
play baseball
And then do some work client side to figure out you've started a new section in the query, and then only send the letters in the new section to ES for typeahead completion.
This will take some work on the front end, but this i could see working. The other alternative being indexing every possible variation on a query phrase for typeahead, but I highly doubt that's viable.

Related

Xpath query, making a certain query more generic

I'm trying to extract information from Wikipedia tables.
More specifically, I'm trying to make a list of all teams and all players in the premier league.
Until now I'm able to traverse over the whole teams in the premier league 2019-2020 table of teams, for every team there I get in it Wikipedia page and traverse over its player's getting their information.
I thought there is a fixed template that all premier league teams in Wikipedia have their table of players at position 3 but after traversing 6 teams it faced a team that it's table is in 2nd place.
So I was using the following XPath query on every team wiki page
"//table[3]/tbody//tr[position() > 1]//td[4]//span/a/#href"
but for example, the following team players table is at position 2, how can I make this query more generic and not fix it a certain position? I have noticed that all of my relevant tables have an element before it with the text "First-team squad"
The HTML of the table is too long, so I post here the wiki link of a certain team
https://en.wikipedia.org/wiki/Crystal_Palace_F.C.
Hope to get help! thanks.
You have to use another "anchor" which works for each page. The table you need is always the first after the span element "Players".
So with this :
//span[#id='Players']/following::table[1]//span[#class="fn"]//text()
You'll get the names of all players of the current squad team.
With this :
//span[#id='Players']/following::table[1]//span[#class="fn"]//#href
You'll get the associated URLs. /!\ Some players don't have a wikipedia webpage.
So you can have 26 player names but 25 urls. Like here :
https://en.wikipedia.org/wiki/Chelsea_F.C.

DAX COUNT/COUNTA functions

I've looked at many threads regarding COUNT and COUNTA, but I can't seem to figure out how to use it correctly.
I am new to DAX and am learning my way around. I have attempted to look this up and have gotten a little ways to where I need to be but not exactly. I think I am confused about how to apply a filter.
Here's the situation:
Four separate queries used to generate the data in the report; but only need to use two for the DAX function (Products and Display).
I have three columns I need to filter by, as follows:
Customer (Display or Products query; can do either)
Brand (Products query)
Location (Display query)
I want to count the columns based on if the data is unique.
Here's an example:
Customer: Big Box Buy;
Item: Lego Big Blocks;
Brand: Lego;
Location: Toys;
BREAK
Customer: Big Box Buy;
Item: Lego Star Wars;
Brand: Lego;
Location: Toys;
BREAK
Customer: Big Box Buy;
Item: Surface Pro;
Brand: Microsoft;
Location: Electronics;
BREAK
Customer: Little Shop on the Corner;
Item: Red Bicycle;
Brand: Trek;
Location: Racks;
In this example, no matter the fact that the items are different, we want to look at just the customer, the brand, and the location. We see in the first two records, the customer is "Big Box Buy" and the brand is "Lego" and the location is "Toys". This appears twice, but I want to count it distinct as "1". The next "Big Box Buy" store has the brand "Microsoft" and the location is "Electronics". It appears once and only once, and thus the distinct count is "1" anyway. This means that there are two separate entries for "Big Box Buy", both with a count of 1. And lastly there is "Little Shop on the Corner" which appears just once and is counted just once.
The "skeleton" of the code I have is basically just to see if I can get a count to work at all, which I can. It's the FILTER that I think is the problem (not used in the below example) judging by other threads I've read.
TotalDisplays = CALCULATE(COUNTA(products[Brand]))
Obviously I can't just count the amount of times a brand appears as that would give me duplicates. I need it unique based on if the following conditions are met:
Customer must be the same
Brand must be the same
Location must be the same
If so, we distinctly count it as one.
I know I ranted a bit and may seem to have gone in circles, but I was trying to figure out how to explain it. Please let me know if I need to edit this post or post clarification.
Many thanks in advance as I go through my journey with DAX!
I believe I have the answer. I used a NATURALINNERJOIN in DAX to create a new, merged table since I needed to reference all values in the same query (couldn't figure out how to do it otherwise). I also created an "unique identity" calculated column that combined data from multiple rows, but was hidden behind the scenes (not actually displayed on the report) so I could then take a measure of the unique values that way.
TotalDisplays = COUNTROWS(DISTINCT('GD-DP-Merge'[DisplayCountCalcCol]))
My calculated column is as follows:
DisplayCountCalcCol = 'GD-DP-Merge'[CustID] & 'GD-DP-Merge'[Brand] & 'GD-DP-Merge'[Location] & 'GD-DP-Merge'[Order#]
So the measure TotalDisplays now reports back the distinct count of rows based on the unique value of the customer ID, the brand, and the location of the item. I also threw in an order number just in case.
Thanks!
I am semi new to DAX and was struggling with Count and CountA formula, you post has helped me with answers. I would like to add the solution which i got for my query: Wanted count for Right Time start Achieved hence if anyone is looking for this kind of answer use below, filter will be selecting the table and adding string which you want to
RTSA:=calculate(COUNTA([RTS]),VEO_Daily_Services[RTS]="RTSA")

Efficient way to query

My app has a class that saves picture that users upload. Each object in the class has a city property that holds the name of the city that the picture was taken at, and a like property that tracks the number of likes.
I want to be able to send a query that returns one picture per city and each picture should have the highest ranking of likes in the city it belongs to. How can I do that?
One way which I first thought about is doing multiple queries by fetching the most liked picture of a city and save it in an array, and then do the same to other cities.
However, each country has more than one city, thus it's not that efficient.
Parse doesn't support the ordinary operations used in databases. Besides, I tried to use a compound query. Unfortunately, I can't set limit or ordering on the subqueries. Any good solution for this?
It would be easy using group by. Unfortunately, Parse does not support "select distinct" or "group by" features.
As you've suggested you need to fetch for each country all the cities, and for each one get the top most rated photo.
BUT, since Parse has strict restrictions on the duration time execution of a request ( 3 sec for an event listener, 7 sec for a custom function ), I suggest you to do this in a background job, saving in a new table the top rated photo for each city. In this way you can easily query the db from client. The Background jobs can be executed up to 15 minuted before parse drop them, so you could make that kind of queries without timeouts.
Hope it helps

Latest news using Sphinx

I'm using Sphinx for indexing news which i gather from about 100 sites daily.
Each news document has id,title,body,date fields.
For homepage of my project i want to show latest news of today group by topic.
For example site A has a news with title:
"Internet of Things Will Burn Privacy for a While, Cerf Warns"
And site B has one with title:
"Cerf Warns : Internet of Things Will Burn Privacy for a While"
I want to show these news as one item with sites that covered it. Like:
"Internet of Things Will Burn Privacy for a While, Cerf Warns"
Published by : a.com,b.org,...
Is it possible with Sphinx?
Sphinx wont do it on its own. It can't just 'magically' group similar items into clusters of likely duplicate items.
(if the titles where identical - charactor for charactor, could just group by, but thats not the case in your example)
Once you've got your documents into clusters - eg assigned them a 'cluster-id'. Eg the two items in your example, would have the same cluster-id. A unique article not mentioned by mulitple sources would have its own id. - Sphinx could then help you search or render results - using the built in group by.
So first you need to cluster your documents.
There are dedicated tools for this type of thing, for example: https://github.com/open-city/dedupe
But a very basic one could actully be built with sphinx. Would probably work ok in your example, because the titles contain the same words, just in different order.
Basically just need a script that loops though all documents that DONT have a cluster-id, then run a sphinx search against the index, looking for duplicates. If one is found, duplicate its cluster-id, otherwise just allocate a fresh unique id.
This script can then just be run after inserting news documents, to 'cluster' any new stories.
The exact sphinx query can be varied. eg just including the words in a basic query, would require all the same words - regardless of order. But could also perhaps use a quorum search to require most words matching etc.
Might also want to filter by date to avoid dupluicating stories from wildly differnt dates.

How to quickly search book titles?

I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.
I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.
Basically, I'm looking for a way to:
find the results quickly
sort them by relevance.
I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.
P.S. I'm currently using MySQL, but I could switch to anything else if needed.
Using a SOUNDEX is the best way i think.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This will match 'Saw' etc.
For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').
Example usage:
UPDATE `books` SET `soundex_title` = SOUNDEX(title);
You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...
One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store.
The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:
keyword : set of book titles
Redis already has a built-in set data-type that you can use.
Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.
Everything is done in memory; therefore the response time is very fast.
Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.
Apache Lucene with Solr is definitely a very good option for your problem
You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.
And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.
Say the book "A dime in a dozen". In books table it will be:
book_id=356
book_title='A dime in a dozen'
In titles table will be stored:
titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'
The query for this:
SELECT b.book_id, b.book_title
FROM books b JOIN titles t on b.book_id=t.titles_book_id
WHERE t.titles_title='%twelve%'
GROUP BY b.book_id
Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.

Resources