substring matching in lucene with a list of items - elasticsearch

lets say I have a field in my database: 'city'.
I have a list of cities that I would like to get results where the 'city' field is one of the cities in my list.
2 problems:
my list is long (around 3000 cities). is there a better way than:
city:"city1" OR city:"city2" OR ... OR city:"city3000"
sometimes the city is listed as part of a bigger string. so I want results where 'city:"city1"'
but also 'city:"my city is city1"' and 'city:"my city is city1 and its nice"' or 'city:"city1 is my city"'
I would imagine that using 'city:"*city1*"' would so but lucene does not support * in the beginning of the search term.

Related

How can l get a total sum from one table if it is linked with a lot of tables using linq?

l have multiple tables with one to may relationships eg Country -> Region -> Center -> Greater ->Section. The section has a column for census. l am trying to write a linq query for my view to get the total census grouped by the Country. l also need to know in a country how many regions are there, how many centers, how many greaters and the total census. They can be separate queries there is no problem.
In this case, the SelectMany method in LINQ is your friend. Each country has a collection of regions, right? You can use .SelectMany to combine all of the regions from several countries into a single collection of regions. Then you need to get a collection of centers from all of the regions, and so on and so on.
Consider this code:
context.Countries.SelectMany(country => country.Regions)
.SelectMany(region => region.Centers)
.SelectMany(center => center.Greaters)
.SelectMany(greater => greater.Sections)
.GroupBy(country => country.Id)
.Sum(section => section.Census);

How to optimize and improve neo4j cypher query with multiple match and an increase number of where clauses

in my neo4j graph database I have some Restaurants linked to Foods listed on the menu and also linked to the Cities where they are located.
I’m trying to search for restaurants in London each offering different kind of foods one from the other, so I started looking for two different restaurants in this way:
MATCH (f1:Food)--(r1:Restaurant)-[:LOCATED_IN]-(c1:City{name:'London'})
WHERE f1.name in [name1, name2, name3, name4]
MATCH (f2:Food)--(r2:Restaurant)-[:LOCATED_IN]-(c2:City{name:'London'})
WHERE r1.id<>r2.id and f1.name<>f2.name
RETURN, r1.name as Restaurant_A, collect(distinct f1.name) as Food_A,r2.name as Restaurant_B, collect(distinct f2.name) as Food_B
LIMIT 1
but if I want to find a third Restaurant I need to add more where clauses and so on if I want a fourth restaurant, a fifth one…
This is the example for the third restaurant:
MATCH (f1:Food)--(r1:Restaurant)-[:LOCATED_IN]-(c1:City{name:'London'})
WHERE f1.name in [name1, name2, name3, name4]
MATCH (f2:Food)--(r2:Restaurant)-[:LOCATED_IN]-(c2:City{name:'London'})
WHERE r1.id<>r2.id and f1.name<>f2.name
MATCH (f3:Food)--(r3:Restaurant)-[:LOCATED_IN]-(c3:City{name:'London'})
WHERE r2.id<>r1.id and r3.id<>r2.id and f3.name<>f3.name and f3.name<>f2.name
RETURN r1.name as Restaurant_A, collect(distinct f1.name) as Food_A,r2.name as Restaurant_B, collect(distinct f2.name) as Food_B, r3.name as Restaurant_C, collect(distinct f3.name) as Food_C
LIMIT 1
And I’d really like to know if there is an alternative way to do it, I’m new to neo4j and every suggestion is more than welcome.
Assuming that you provide a collection of :Food nodes in an array, you could do
MATCH (f:Food)--(r:Restaurant)-[:LOCATED_IN]-(c1:City{name:'London'})
WHERE f IN $foodNodes
RETURN r.name as restaurant, COLLECT(DISTINCT f.name) AS foods
to retrieve the restaurants and their foods.

Neo4j cypher query performance issue with movie recommendation query

I'm currently working on a movie recommendation query that should return the movies with the most "recommendation impact" using the following cypher query:
match (m:Movie)
with m, size((m)<-[:LIKED]-(:User)-[:LIKED]->(:Movie)) as score
order by score desc
limit 10
return m.title, score
After reading the graphdb (neo4j) e-book my assumption was that this whould be an easy query for neo4j but the execution time took 32737 ms which is not what I was expecting. Does any one have experience with these kind of queries and has any suggestions to improve performance? Or should this query perform well and do I need to do some neo4j / java configuration tuning?
The profile of the query:
The result:
Maybe this is something you can pre-calculate.
Your score is related to the number of movies liked by each user. Why not calculate and store the number of movies liked by each user (assuming a user can only like a movie once, not multiple times)?
Note that this only makes sense if you only care about the number of movies liked by each user, and are okay with adding those up, even if they represent multiple likes of the same movie across many users.
MATCH (u:User)
SET u.likedCount = SIZE((u)-[:LIKED]->(:Movie))
You will need to update this every time the user likes (or unlikes) another movie.
When this is pre-populated for all users, your scoring query now becomes:
MATCH (m:Movie)
WITH m
MATCH (m)<-[:LIKED]-(u:User)
WITH m, SUM(u.likedCount) as score
ORDER BY score desc
LIMIT 10
RETURN m.title, score
EDIT
This of course includes the likes from each user of the movie in question. If you really need to account for this, you'll need to adjust your with to:
WITH m, SUM(u.likedCount) - count(u) as score
If you only want to count distinct movies liked by users in your scoring, then you can't pre-calculate and have to use something like stdob--'s answer.
Try this query:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
WITH M,
size( collect(distinct R) ) as score
RETURN M.title as title,
score
ORDER BY score DESC LIMIT 10
As an option:
MATCH (M:Movie)<-[:LIKED]-(:User)-[:LIKED]->(R:Movie)
RETURN M.title as title,
count(R) as score
ORDER BY score DESC LIMIT 10

Finding elements that appear in groups most

Having trouble figuring out how to go about this algorithm.
Input: any number of lists each holding elements grouped by a common attribute
For example,
matched_by_first_name = {"bob" => [person, person, ...], "nancy" => [person, ...], ...}
matched_by_zip_code = {"12345" => [person, person, ...], "56789" => [person, ...], ...}
Output: List of groups of people that appear most frequently in the same groups, with separate "weightings" per input list. So, I might weight two people grouped by the same first name more than I would weight two people grouped by the same zip code.
In other words:
matches = [[person, person], [person], [person, person, person]]
Basically, if there are two persons and for every single grouping they are in the same group, then they should definitely be in the same final matched group. If there's only one group they're not in, then they should probably still be matched (depending on the weighting of that group type).

Slow Neo4j query despite indices

Here I'm trying to find all Twitter users who are followed by and who follow any members of some group G:
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE (x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}})
AND (y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}})
RETURN t.id
But for the group G I sometime have their screen names and sometimes have their ids, thus the OR clause above. Unfortunately this query is long running and doesn't appear to ever return.
I have indices and constraints on both on both id and screen_name:
Indexes
ON :User(screen_name) ONLINE (for uniqueness constraint)
ON :User(id) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.screen_name IS UNIQUE
ON (user:User) ASSERT user.id IS UNIQUE
If I get rid of the OR clause (for instance if I happen to have all screen_names or all ids for group G) then the query runs quite fast.
I'm using neo4j-community-2.1.3 on a Mac. My graph has 286039 nodes, all of which have the User label.
And ideas to improve this? Otherwise I'll have to chop this up into 4 queries to get all possible combinations of members. This is really even more problematic because I really want to keep track of how commonly a user appears in a G-->user-->G relationship, and I'll need to do a lot of extra bookkeeping if the counts are spread among 4 different queries.
Update
I created an issue related to this: https://github.com/neo4j/neo4j/issues/2834
I ended up using
MATCH (x:User) WHERE x.screen_name IN ["apple","banana","coconut"]
WITH collect(id(x)) as x_ids
MATCH (x:User) WHERE x.id in [12345,98765]
WITH x_ids+collect(id(x)) as x_ids
MATCH (y:User) WHERE y.screen_name IN ["apple","banana","coconut"]
WITH x_ids,collect(id(y)) as y_ids
MATCH (y:User) WHERE y.id in [12345,98765]
WITH x_ids,y_ids+collect(id(y)) as y_ids
MATCH (x:User)-[:FOLLOWS]->(t:User)-[:FOLLOWS]->(y:User)
WHERE id(x) in x_ids AND id(y) in y_ids
RETURN count(*) as c, t.screen_name,t.id
ORDER BY c DESC
LIMIT 1000
But this basically represents a hack to get around a place where neo4j isn't using the indices that it could be.
I guess the query does not make use of indexes due to the OR condition, you can verify by prefixing the query with PROFILE and run it in neo4j-shell.
If there's no notion of index usage, you might split the query up into two parts. The first one fetches the combined list of user ids, instead of the OR we do a UNION on two queries (each using a index lookup):
MATCH (x:User) WHERE x.screen_name in {G_SCREEN_NAMES} RETURN id(x) as ids UNION
MATCH (x:User) WHERE x.id in {G_IDS} RETURN id(x) as ids
On the client side, use the list of node ids as parameter for the next query:
MATCH (x:User)-[:FOLLOWS]->(t)-[:FOLLOWS]->(y)
WHERE id(x) in {ids} AND id(y) in {ids}
RETURN t.id
I've intentionally removed the labels for t and y with the assumption that you can only follow User and no other kind of nodes. This removes a unnecessary label check.
JnBrymn,
How about this query?
MATCH (x:User)
WHERE x.screen_name IN {{G_SCREEN_NAMES}} OR x.id IN {{G_IDS}}
WITH x
MATCH (x)-[:FOLLOWS]->(t:User)
WITH t
MATCH (t)-[:FOLLOWS]->(y:User)
WHERE y.screen_name IN {{G_SCREEN_NAMES}} OR y.id IN {{G_IDS}}
RETURN t.id
Grace and peace,
Jim

Resources