I'm importing a big dataset (far over 10m nodes) into neo4j using the neo4j-import tool. After importing my data I run several queries over it. One of those queries performs very badly. I optimized it (PROFILING, using relationship types, splitting up for multicore support and so on) as much as I could.
Still it takes too long, so my idea was to tell neo4j to start at a specific type of nodes by using the USING INDEX clause. I then could check how my db hits change and possibly make it work. Right now my database doesn't have indexes though.
I wanted to create indexes when I'm done writing all the queries I need, it seems I need to start using them already though.
I'm wondering if I can create those indexes during the bulk import process. That seems to be a good solution to me. How would I do that?
Also I wonder if it's possible to actually write a statement that would create indexes for an attribute that exists on every single one of my nodes (let's call it "type").
CREATE INDEX ON :(type);
doesn't work (label is missing but I want to omit it)
Indexes are on Labels + Properties. You need indexes right after your import and before you start trying to optimize queries. Anything your query will use to find a starting point should be indexed (user_id, object_id, etc) and probably any dates or properties used for range queries (modified_on, weight, etc).
CREATE INDEX ON :Label(property)
Cypher queries are single threaded so I have no idea what you mean by multi-core support. What did you read about that, got a link? You can multi-thread Neo4j, but at this point you have to do it manually. See https://maxdemarzi.com/2017/01/06/multi-threading-a-traversal/
Most of the time, the queries can be greatly optimized with an index or expressing it differently. But sometimes you need to redo your model to fit the query. Take a look at https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/ for some hints.
Related
I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.
I use stored proceduers on DB instance "A" to store data in GTT. To get the original data i have to go over a DB-Link to DB instance "B". That for i put together the whole query and send it to remote DB instance.
This works fine. But sometimes it seems that Oracle is not using the best way or correct indexes for queries. Is there a way to force Oracle to use specific indexes? I tried to use hints, but honestly I dind't understand the difference between all these options.
Thanks for helping me!
There is a huge temptation to optimize a query one way when you want it to work another way. Adding hints is a temporary solution which can backfire on you when the amount or type of data in the table changes or when you upgrade to a newer version with a newer optimizer.
First, determine that there is a problem. Are all queries taking too long? Just some? Only the first one?
The easiest thing to do is to make sure the indexes on that table are up to date. Then look at optimizing the query by using the explain plan feature to see what indexes are being used.
It's also prudent to examine your data to see if the query is selecting different things or different amounts of records if it is time based.
As far as I understand, when we run SQL query with COUNT, DISTINCT or LIKE %query% (wildcards at both sides) keywords the indexes cannot be used and the database have to do the full table scan.
Is there some way to boost the performance of these queries?
Do they really cannot use indexes or we can fix this somehow?
Can we make an index-only scan if we need to return only one column? For example: select count(id) from MY_TABLE: probably in this case we can make index-only scan and avoid hitting the whole table if we have index on 'id'?
My question has a general meaning: could you give me some performance guidelines if we have to use the mentioned operators?
UPDATE
As for me I use PostgreSQL.
with PostgreSQL, you can create GIN pg_trgm indexes for text strings to make LIKE '%foo%' faster, though this requires addons, and PostgreSQL 9.1 or higher.
I doubt distinct by itself will ever use an index. I tried in fact and could not get it to use one. You can sort of force an index to be used by using a recursive CTE to pull individual records out (what can be called a "sparse scan"). We do something like this when pulling individual years out of the accounting record. This requires writing special queries though and so isn't really the general case.
count(*) is never going to be able to use an index due to mvcc rules. You can get approximate results by looking in the appropriate system catalogs however.
Is there any way to force oracle to use index except Hints?
No. And if the optimizer doesn't use the index, it usually has a good reason for it. Index usage, if the index is poor, can actually slow your queries down.
Oracle doesn't use an index when it thinks the index is
disabled
invalid (for example, after a huge data load and the statistics about the index haven't been updated)
won't help (for example, when there are only two different values in 5 million rows)
So the first thing to check is that the index is enabled, then run the correct GATHER command on your index/table/schema. When that doesn't help, Oracle thinks that loading your index will actually take more time than loading the actual row values. In this case, add more columns to the index to make it appear more "diverse".
You might take a look at oracle stored outlines. You can take an existing query and create a stored outline and tweak the query just like hints. It is just very hard to use. Do some research before you decide to implement stored outlines.
You can add hints into the query that will cause it to look more favorably on one index over another index.
In general if you have collected good statistics on all the tables and indexes Oracle usually implements very good execution plans.
If your query doesn't include the indexed field in its conditions, then the DB would be foolish to use the index. Thus, I second Donnie's answer.
Yes, technically, you can force Oracle to use an index (without hints), in one scenario: if the table is an index-organized table, then logically the only way to query the table is via its index because there is no table to query.
I am writing a stored procedure to perform a dynamic search that spans 10+ database tables. With millions of records in each table and a dynamic set of search parameters*, I am having some trouble optimizing the procedure.
Is there a "best practice" for building these kinds of queries? E.g. Use strings to build a dynamic query, use a huge list of IF THEN .. ELSE statements, etc? Can anyone provide a simple example or point me to some literature that will help? Here's some psuedocode for the stored procedure I am developing, which accepts a collection of parameters and a ref cursor.
v_query = "SELECT .....";
v_name = ... -- retrieve "name" parameter from collection
if v_name is not null then
v_query := v_query || ' AND table.Name = ' || v_name;
end if;
open search_cursor for v_query;
...
*By "dynamic set of search parameters," I mean that I pass in a collection of parameters. I figured this would be easier than making the caller pass in 20 parameters if they only want to search on one.
There are problems with using the static query approach; also be very careful about using the CURSOR_SHARING=FORCE option - it can really raise hell with your system if you haven't done a coverage test to ensure that all your other queries will work the way you want.
Problems with static queries:
The (x is null or x = col) predicates tend to kill any chance of using indexes. Since the query plan is computed at the time query is parsed the first time, the indexes you use will be based on the values for the first run of the query; later runs, which may not constrain on the same columns, will still use the same indexes.
Having one static statement with substitution variables will prevent the optimizer from making an intelligent choice about which index to use based on the data distribution. In a dynamic query (or in the first run of a query with bind variables), Oracle will see how selective your constraint is; a highly selective constraint will become a prime candidate for index use. For example, if your table had a row for every person in the U.S., STATE='Alaska' will be much more likely to use the index on STATE than STATE='California'.
Of course, in both these cases, if the dynamic columns in your WHERE clause are not indexed anyway, it doesn't matter, although I'd be surprised if that were the case in a database the size you're talking about.
Also, consider the real cost of all that hard parsing. Yes, hard parses serialize system resources, which makes them expensive, but only in the context of high volume queries. By their nature, ad-hoc queries do not get run very often. The cost you pay for all the hard parses you incur in an entire day will likely be hundreds of times less than the cost of a single query that uses the wrong indexes.
In the past, I've implemented these systems pretty much like you've done here - a base query portion, then iterating over a constraint list and adding WHERE clause predicates. I don't think it's hard for someone to maintain or understand, especially if you're talking about constraints that don't involve adding a lot of subqueries or extra tables to the FROM clause.
One thing to consider: If this system is primarily an offline one (in other words, not constantly being updated or inserted into - populated by periodic loads of bulk data), you may want to look into using BITMAP indexes. Bitmap indexes differ from regular b-tree indexes in that multiple indexes on a single table can be used simultaneously, and bitmap indexes are much, much smaller on disk than b-trees. They work very well for applications like this - where you will have a variety of constraints that can't be defined at design time. You will only want to put bitmap indexes on columns that have relatively few distinct values - say, one value constitutes no less than 1/1000 of the table - so don't use bitmaps on unique columns.
However, the downside is that bitmap indexes will noticeably degrade the performance of inserts and updates. The best practice for bitmaps is to use them in data warehouse applications, and they are dropped prior to loads and recreated afterwards.
Except in very particular cases, I don't think it is advisable (or even possible) to try to generate an optimized query. My advice is not to use dynamic SQL if you can : hard to read, hard to debug, hard to optimize, hard to maintain.
First, write a generic query that will work with any parameter sent to your procedure. According to your example, that would give something like :
SELECT * FROM table WHERE ((v_name IS NULL) OR (table.Name=v_name));
As you see, you could easily add other parameters to this query without using dynamic SQL. This query is much easier to read and debug. Ask your DBA for optimization tips.
Then, if you have a particular set of parameters that you know are often passed together, you could write a particular query for this set that you could specifically optimize. Pseudocode :
IF particular_set
THEN
/* Specific query */
ELSE
/* Generic query */
END IF;
The difficult part is to try not to have too many specific queries here, or you could fall into a maintenance hell.
We've had a similar requirement for one of our clients. They have half a dozen tables with millions of rows, and they wanted adhoc search capability on most of the columns.
The solution was a separate package for each table, which would take the search criteria and construct the SQL to run the search. We took advantage of the old system that was being replaced, to discover what the most common types of searches the users were doing, and made sure that those searches ran the best, by tuning the queries that were being generated (supported by the strategic use of indexes). Because each package was only responsible for queries against one table, it could have specific code designed to work with that table (including the odd hint, in a few rare cases).
One question/problem that you'll need to address is, do you hard-code the criteria (e.g. WHERE SURNAME='SMITH') or use bind variables? Using bind variables reduces hard parsing, which reduces load on the database server; however it can be impractical to use bind variables when the SQL is dynamically generated. The way we ended up going was to set CURSOR_SHARING=FORCE (which has its own disadvantages) which was a reasonable compromise in our case.
Read http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6711305251199