I have started working with neo4j recently and I have performance problem with Merge query for creating my graph.
I have a csv file with 100,000 records and want to load the data from this file.
My query for loading is as follows:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type:row.ActorType})
My system configuration:
8.00 GB RAM and Core i5-3330 CPU.
my neo4j config is as follows:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
mapped_memory_page_size=1048576
label_block_size=60
arrat_block_size=120
node_auto_indexing=False
string_block_size=120
when I run this query in neo4j browser it takes more than a day. Would you please help me to solve the problem? please let me know for example if I should change my JVM configuration or change my query or ... and how?
To increase the speed of MERGE queries you should create indexes on your MERGE properties:
CREATE INDEX ON :Country(Name)
CREATE INDEX ON :Actor(Name)
If you have unique node properties, you can increase performance even more by using uniqueness constraints instead of normal indexes:
CREATE CONSTRAINT ON (node:Country) ASSERT node.Name IS UNIQUE
CREATE CONSTRAINT ON (node:Actor) ASSERT node.Name IS UNIQUE
In general your query will be faster if you MERGE on a single, indexed property only:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName})
// if necessary, you can set properties here
ON CREATE SET a.Aliases = row.Aliases, a.Type = row.ActorType
As already answered on the google group.
It should just take a few seconds.
I presume:
you use Neo4j 2.3.2 ?
you created indexes / constraints for the things you merge on ?
you configured your neo4j instance to run with at least 4G of heap?
you are using PERIODIC COMMIT ?
I suggest that you run a profile on your statement to see where the biggest issues show up.
Otherwise it is very recommended to split it up.
e.g. like this:
CREATE CONSTRAINT ON (c:Country) ASSERT c.Name IS UNIQUE;
CREATE CONSTRAINT ON (o:Organization) ASSERT o.Name IS UNIQUE;
CREATE CONSTRAINT ON (a:Actor) ASSERT a.Name IS UNIQUE;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country
MERGE (c:Country {Name:Country});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.AffiliationTo as AffiliationTo
MERGE (o:Organization {Name: AffiliationTo});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (a:Actor {Name: row.ActorName}) ON CREATE SET a.Aliases=row.Aliases, a.Type=row.ActorType;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country, row.ActorName as ActorName
MATCH (c:Country {Name:Country})
MATCH (a:Actor {Name:ActorName})
MERGE(c)<-[:IS_FROM]-(a);
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MATCH (o:Organization {Name: row.AffiliationTo})
MATCH (a:Actor {Name: row.ActorName})
MERGE (a)-[r:AFFILIATED_TO]->(o)
ON CREATE SET r.Start=row.AffiliationStartDate, r.End=row.AffiliationEndDate;
Related
End goal:
Be able to identify a flow file as an insert or update
Idea:
Bring in record from source,
compare the PK in the flowfile to that sheet,file,tbd of a query from the target DB that pulls all pk's for that view. Then I can route that file as an insert or update. This is for speed purposes.
Question for Stack:
Is there a way I could compare the PK of my current flowfile to that of a flowfile of all values generated(I'd only generate this flowfile like every 6,12,18, or 24 hours)
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///sample.csv"
AS row WITH row
MATCH (server:Server { name:row.`SOURCE_HOST`, source:'sample' })
MATCH (server__1:Server__1 { name:row.`TARGET_HOST`, source:'sample' })
MERGE (server)-[:DEPENDS_ON]->(server__1)
this query is taking very long time to create the relationships.
there are 1875 server and server__1 nodes each
Thanks in advance
The best approach to defining graph schemas is to have a unique id for each entity in your graph. This means that you could look up your nodes by their unique id. That way you can define a unique constraint for that property, which will speed up the query execution. For example, if the name property of the server were enough to look up the Server node, you could first define a unique constraint:
CREATE CONSTRAINT constraint_name
ON (s:Server) ASSERT s.name IS UNIQUE;
Then when you will be searching for the server, the queries will perform much better.
MATCH (server:Server {name:row.`SOURCE_HOST`})
This is the best approach as far as I have seen. If you don't have unique property, you could create one by combining two properties, in your case name + source. If that is not an option for you, you can create an index on both name and source properties, and if you use the Enterprise version you can also create a composite index on both properties to optimize performance.
I implemented mapping same as below. can someone suggest that is the good approach or not.
old records copied from production so they didn't provide flag for those records. only new records we will get flag.
Source data:
col1 col2 col3 DML_FLAG
1 a 123 NULL(old record)
2 b 456 I
3 c 678 U
Mapping:
Source...>SQ...>exp...>lkp(on target to identify new or update)
..>exp..>...>RTR(for insert and update)-->upd(for update)...>target
First time load I have to load all records i.e full load(old records (DML_flag is null) and new records
From 2nd run I have to capture only changed records from source.For this I am using mapping variables
Here I have a question like, we have already I and U flags are available in source again I am using LKP,with out lookup, I can use DML_FLAG with two groups I and U in RTR.
But I need to refresh the data in every 30mints,with in 30 mints one record inserted(I) and same record got updated then flag changed to 'U' in the source, same record not available in the target, in that case can how can I capture that new record with flag 'U' without lkp.
can someone suggest how can I do this without lookup?
From what I understand of your question, you want to make sure you apply the insert in your target before you apply the update to that same record - is this correct? If so the just use a target load plan having inserts routed to separate alias of same target and higher up in the load order than updates
The answer to whether this is correct as a design choice depends on what the target db is... for a datawarehouse fact table you would usually insert all records be they inserts or updates because you would be reporting on the event rather than the record state. For a dimension table it would depend on your slow changing dimension strategy
I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?
Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference