I have a huge file of for about 11M relationship .
When I run the query : Match (n) detach delete n, It seems to be taking forever to finish .
I did some researches and found that I need to delete the relationships with a limit and then the nodes using that query :
MATCH (n)
OPTIONAL MATCH (n)-[r]-()
WITH r LIMIT 200000
DELETE r
RETURN count(r) as deletedCount
Yet, and as I'm doing some performances comparison , it dows not seem logic to me to sum the total of deletion time to delete the hole graph .
and as it changes when changing the limit value of relationships to delete at once. (if i do 2000 relationships it is not the same as 20000 relationships at once)
How can I solve that problem ?
Any help would be appreciated
You can use apoc.periodic.commit to help you with batching. You must use apoc plugin, which has lots of cool functions to enhance cypher.
You can use the following cypher query.
call apoc.periodic.commit("
match (node)
with node limit {limit}
DETACH DELETE node
RETURN count(*)
",{limit:10000})
This will run the query in batches until the first match return null, which means in this case that no node exists in the database. You can play around with different limit settings to see what works best.
Hope this helps
Related
we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.
First, I will describe a simplified description of our schema. We have the following nodes:
Usergroup
Feature
Asset
Section
We also have the following relationships:
A Feature has only one Section (IS_IN_SECTION)
A Feature has one or more Asset (CONTAINS_ASSET)
An asset may be restricted for a Usergroup (HAS_RESTRICTED_ASSET)
A Feature may be restricted for a Usergroup (HAS_RESTRICTED_FEATURE)
A Section, and therefore, all the Feature of that Section, may be restricted for a Usergroup (HAS_RESTRICTED_SECTION)
A Usergroup may have a parent Usergroup (HAS_PARENT_GROUP) and it should fulfill its restrictions and those of its parents
The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.
The current query is similar to:
(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
WITH nodes(path) AS usergroups
UNWIND usergroups AS ug
(2)
MATCH (node:Asset)
WHERE NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
We have a few more types of restrictions but here we have the main idea.
Some observations we have made are:
If we execute the query part (1) adding return ug after unwind, this query is solved in 1ms
If we change the query part (1) to MATCH (ug:Usergroup {uid: $usergroup_uid}) ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.
Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.
For that, we have these questions:
The NOT <restrictions> (ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?
Do we need any type of index?
Is the structure of the schema right, or are there any inefficiencies?
How can we rewrite the part (1) of the query or what do you thinks is causing the overhead with it?
The database version is Neo4j 3.5.X
Thanks in advance.
Let me answer your questions one by one:
The NOT <restrictions> type of conditions can prove to be inefficient if provide a set of paths within restrictions because it can lead to duplicate work. Consider the following two sets of restrictions in your query.
NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
and NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)
In both of these checks, neo4j might look for the relationship CONTAINS_ASSET and nodes of type Feature separately, once for the first path match and then for the second. This duplicate processing should be reduced if it happens. You should profile your query in Neo4j Browser, to see how the query planner plans and executes the query.
In terms of indexes you can create two indexes, the first is on the Usergroup uid field, and the second on the date field of the Asset node, this might help if you have a lot of asset nodes and date key stores a string. Again, profile your query to see what indexes are coming into play during execution.
In terms of schema, I noticed that in second part of your query
NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug) AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug) , all these three checks are basically looking for whether an asset is restricted for a user group, either directly, via a feature, or via a section. One thing we can do here is to create intermediate relationships, between an Asset and UserGroup node. For example, we can have IS_RESTRICTED_DUE_TO_A_FEATURE relationship between an asset and user group node, if an asset is part of a feature for which the user group has restricted access. In this way, your path match reduces from NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug) to NOT (node)<-[:IS_RESTRICTED_DUE_TO_A_FEATURE]-(ug), which should be faster. Obviously, this change will impact your other CRUD operations, and you might want to store some properties in the new relationships as well.
For this part, I am not sure, what is causing the overhead, but I suggest you add the index on the Usergroup uid field, if not present, and then modify your first part to this:
MATCH (ug:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup) WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup) RETURN ug
If it performs well, then try modifying your second part to this:
MATCH (node:Asset)
OPTIONAL MATCH (node)<-[rel1:CONTAINS_ASSET]-(f:Feature)-[rel2:IS_IN_SECTION]->(s:Section)
WITH node
WHERE (s IS NULL OR NOT (s)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug))
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND (f IS NULL OR NOT (f)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug))
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
Please try out the above suggestions, also the above queries are not tested, so please modify them a bit, if the output format is unexpected. Do profile them out, to figure out the slowest part, in the queries. Hopefully, it helps.
I am trying to build out a social graph between 100k users. Users can sync other social media platforms or upload their own contacts. Building each relationship takes about 200ms. Currently, I have everything uploaded on a queue so it can run in the background, but ideally, I can complete it within the HTTP request window. I've tried a few things and received a few warnings.
Added an index to the field pn
Getting a warning This query builds a cartesian product between disconnected patterns. - I understand why I am getting this warning, but no relationship exists and that's what I am building in this initial call.
MATCH (p1:Person {userId: "....."}), (p2:Person) WHERE p2.pn = "....." MERGE (p1)-[:REL]->(p2) RETURN p1, p2
Any advice on how to make it faster? Ideally, each relationship creation is around 1-2ms.
You may want to EXPLAIN the query and make sure that NodeIndexSeeks are being used, and not NodeByLabelScan. You also mentioned an index on :Person(pn), but you have a lookup on :Person(userId), so you might be missing an index there, unless that was a typo.
Regarding the cartesian product warning, disregard it, the cartesian product is necessary in order to get the nodes to create the relationship, this should be a 1 x 1 = 1 row operation so it's only going to be costly if multiple nodes are being matched per side, or if index lookups aren't being used.
If these are part of some batch load operation, then you may want to make your query apply in batches. So if 100 contacts are being loaded by a user, you do NOT want to execute 100 queries each, with each query adding a single contact. Instead, pass as a parameter the list of contacts, then UNWIND the list and apply the query once to process the entire batch.
Something like:
UNWIND $batch as row
MATCH (p1:Person {pn: row.p1}), (p2:Person {pn: row.p2)
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
It's usually okay to batch 10k or so entries at a time, though you can adjust that depending on the complexity of the query
Check out this blog entry for how to apply this approach.
https://dzone.com/articles/tips-for-fast-batch-updates-of-graph-structures-wi
You can use the index you created on Person by suggesting a planner hint.
Reference: https://neo4j.com/docs/cypher-manual/current/query-tuning/using/#query-using-index-hint
CREATE INDEX ON :Person(pn);
MATCH (p1:Person {userId: "....."})
WITH p1
MATCH (p2:Person) using index p2:Person(pn)
WHERE p2.pn = "....."
MERGE (p1)-[:REL]->(p2)
RETURN p1, p2
I'm currently working on movie recommendation using MovieLens 20m dataset after reading https://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/. Node Movie connects to Genre with relationship hasGenre, Node Movie connects to User with relationship hasRating. I'm trying to retrieve all movies which are most highly co-rated (co-rated > 3.0) with a query (e.g. Toy Story) that share all genres with Toy Story. Here's my Cypher query:
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]-(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)-[:hasGenre]->(genre)
WITH inputGenres, r, o, movie, COLLECT(genre) AS genres
WHERE ALL(h in inputGenres where h in genres) and (r.rating>3 and o.rating>3)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
However, it seems that my system cannot handle it (using 16GB of RAM, Core i7 4th gen, and SSD). When I'm running the query, it peaks to 97% of RAM then Neo4j shutdowns unexpectedly (probably due to heap size or else, due to RAM size).
Do I make the query correct? I'm newbie in Neo4j so probably I make the query incorrectly.
Please suggest how to optimize such query?
How can I optimize the Neo4j so it can handle large dataset with my system's spec according to the query?
Thanks in advance.
First, your Cypher can be simplified for more efficient planning by only matching what we need, and handling the rest in the WHERE (so that filtering can possibly be done while matching)
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (inputMovie)<-[r:hasRating]-(User)-[o:hasRating]->(movie)
WHERE (r.rating>3 and o.rating>3) AND ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
Now, if you don't mind adding data to the graph to find the data you want, another thing you can do is split the query into tiny bits and "cache" the result. So for example
// Cypher 1
MATCH (inputMovie:Movie {movieId: 1})-[r:hasGenre]->(h:Genre)
WITH inputMovie, COLLECT (h) as inputGenres
MATCH (movie:Movie)
WHERE ALL(genre in inputGenres WHERE (movie)-[:hasGenre]->(genre))
// Merge so that multiple runs don't create extra copies
MERGE (inputMovie)-[:isLike]->(movie)
// Cypher 2
MATCH (movie:Movie)<-[r:hasRating]-(user)
WHERE r.rating>3
// Merge so that multiple runs don't create extra copies
MERGE (user)-[:reallyLikes]->(movie)
// Cypher 3
MATCH (inputMovie:Movie{movieId: 1})<-[:reallyLikes]-(user)-[:reallyLikes]->(movie:Movie)<-[:isLike]-(inputMovie)
RETURN movie.title,movie.movieId, count(*)
ORDER BY count(*) DESC
i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.
I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
If I understand your problem right, probably all you need is already built in the CouchDB.
To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
Here is a view function to make this:
function(doc) {
for (var name in doc.places) {
emit(name, doc._id);
}
}
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.
1) text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
2) Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE clause has to match the the index expression.
3) Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.
4) Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
5) Optimize disc representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.
6) Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7) Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.