Using Neo4j 3.3.2 on Windows 10 64 bit.
Running a simple query on nodes in a database with about 250,000 nodes total.
UNWIND {LIST} AS i MATCH (ig:Ingrd {name: i.NAME}) SET ig.cas = i.CAS
If I create the following index before initially loading the nodes and before running the above query it runs slow:
CREATE INDEX ON :Ingrd(name)
If I drop that index and run the following it runs fast:
CREATE CONSTRAINT ON (ig:Ingrd) ASSERT ig.name IS UNIQUE
Using the INDEX the query runs about 15 transactions/s
Using the CREATE CONSTRAINT the query runs about 7000 transactions/s
If I profile, the only difference is 'NodeIndexSeek' vs. 'NodeUniqueIndexSeek'.
Any ideas why it runs 400x slower?
Thanks in advance.
Update: Added images. Everything looks like it should be fine, but when run adding 1000 updates to the list to unwind, the unique constraint is 400x faster.
Update2: Run in one thread, it is reproducible if run multiple times back to back. Actual query edited above. The above index/constraint is the only one for this node. For this node (Ingrd) the name field is actually unique so adding a constraint solved this specific issue, but it still exists for other nodes. Can reproduce for other nodes as well with non-unique search fields.
The performance problem seems to be related to UNWIND. If I run a transaction and using parameters in batch transactions it works fine with a non-unique index.
UNWIND {LIST} AS i MATCH (ig:Ingrd {name: i.NAME}) SET ig.cas = i.CAS
is slow.
MATCH (ig:Ingrd {name: {NAME}}) SET ig.cas = {CAS)
is fast.
I switched to using UNWIND to try and improve performance. That does not work on non-unique indexes. Using UNWIND is >100x slower.
Related
For existing table i have added the index to check the performance. Table has 1.5 million records. The existing cost is "58645". Once created the index the cost is reduced to "365". So that often time I have made the index as "unusable". Then I alter and rebuild the index to check. For yesterday known the index is being used by explain plan in oracle. But today when I unusable the index and rebuild, in explain plan the index scan was not working. But performance remains fast than older. I have dropped and created again. But still the issue is remaining. Fetching is fast. But the explain plan showing that the index is not being used and the cost is showing "58645". Am stuck with this.
Many times when you create the new index or rebuild it from scratch it doesn't show up in explain plan and sometime is not used for a while as well. To correct the explain plan the stats should be gathered on index.
EXEC DBMS_STATS.GATHER_INDEX_STATS should be used or use DBMS_STATS.GATHER_TABLE_STATS with cascade option.
Blocks of data are cached in the BUFFER_POOL, which will affect your results such that:
Run Query;
Change Index;
Run Query; - buffered data from 1 will skew the preformance
Flush buffer pool
Run Query - now you get a truer measure of how "fast" the query is.
Did you flush the buffer?
ALTER SYSTEM FLUSH BUFFER_POOL;
Title is probably not very clear so let me explain.
I want to process a in-process join (nodeJs) on 2 tables*, Session and SessionAction. (1-N)
Since these tables are rather big (millions of records both) my idea was to get slices based on an orderBy sessionId (which they both share), and sort of lock-step walk through both tables in batches.
This however proves to be awefully slow. I'm using pseudo code as follows for both the tables to get the batches:
table('x').orderBy({index:"sessionId"}.filter(row.sessionId > start && row.sessionId < y)
It seems that even though I'm essentially filtering on a attribute sessionId which has got an index, the query planner is not smart enough to see this and every query does a complete tablescan to do the orderby before filtering afterwards (or so it seems)
Of course, this is incredibly wasteful but I don't see another option. E.g.:
Order after filter is not supported by Rethink.
Getting a slice of the ordered table doesn't work either, since slice-enumeration (i.e.: the xth until the yth record) for lack of a better work doesn't add up between the 2 tables.
Questions:
Is my approach indeed expected to be slow, due to having to do a table scan at each iteration/batch?
If so, how could I design my queries to get it working faster?
*) It's too involved to do it using Rethink Reql only.
filter is never indexed in RethinkDB. (In general a particular command will only use a secondary index if you pass index as one of its optional arguments.) You can write that query like this to avoid scanning over the whole table:
r.table('x').orderBy({index: 'sessionID'}).between(start, y, {index: 'sessionId'})
I've encountered situation that I can't understand.
I've created patch for DB, it drops constraints(private keys) and then drops indexes that are bound to constraints. It worked flawlessly few times on test environment. But when we finally run it on prod it crashed on first index. Test DB was re-created for test from production few times(but I don't know how it was done exactly) and there wasn't any problems. How it is possible that error didn't occurred while we were testing?
It is possible that dropping a constraint also drops the supporting index, depending on how the constraint and index were created in the first place, so your script ought to check for that.
It is more robust to create non-unique indexes and then add primary and unique constraints that use those indexes, so that you can drop a constraint without losing the benefits of having an index in place.
The difference in behaviour when dropping constraints based on an existing or system-generated index is documented here: http://docs.oracle.com/database/121/SQLRF/clauses002.htm#CJAGEBIG
Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.
Using Informatica 9.1.0
Scenario
Get the Dimension key generated and inserted to the Fact table from the Fact load.
I have to load the Fact table with a dimension key along with other columns. This dimension record is created from within the same mapping. There are five different sessions using the same mapping and executes simultaneously to load the Fact table. In this case I'm using a dynamic lookup with 'Synchronize dynamic cache' enabled to get unique dimension records generated from the 5 sessions using some conditions. The dimension ID is generated using the Sequence-ID in associated expression of the lookup. When a single session alone is run it worked perfectly fine. But when the sessions were run parallely it started to show unique key violation error as random sessions tried to insert the same sequence which was already there.
To fix the issue I had to give persistent lookup cache enabled and Cache file name prefix. But I did not find this solution or this issue in any of the forums or in INFA communities. So not sure this is the right way of doing it or this is a bug of some kind.
Please let me know if you had similar issue or some different thoughts.
Thanks in advance
One other possible solution I can think of is to have the database generate a sequence instead of using Informatica's sequencer. The database should be capable of avoiding any unique key violations.