How to enhance neo4j batch insertion? - jdbc

I have an Oracle DB of roughly 20 million record. I used the BatchInserter to insert the data in my model.
The problem is that I have to loop over a result set containing the whole 20 million data to get the needed properties to insert. But it take too long time to do just the loop process.
Anyone tried something like that? and What is the best way to do it in an optimum time?

Can you share more details? Where do you have to loop?
Check http://neo4j.org/develop/import for some options.
If you have JDBC you can also drive the import directly from your JDBC results.
Just loop twice over the results, once for nodes and once for rels.

Related

Spring read query concurrent executinon in multiple threads

I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.

MarkLogic 8 - Reporting and Aggregation from Large Collection

Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?
Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency

Specifing no of records to delete in Tibco JDBC Update activity

how to specify no of records to delete in Tibco JDBC Update activity in batch update mode.
Actually I need to delete 25 million of records from the database so I wrote Tibco code to do the same and it is taking lot of time .. So I am planning to use Batch mode in Delete query so I don't know how to specify no of records in JDBC Update activity.
Help me if any one has any idea.. thanks
From the docs for the Batch Update checkbox:
This field is only meaningful if there are prepared parameters in the
SQL statement (see Prepared Parameters).
In which case the input will be an array of records. It will execute the statement once for each record.
To avoid running out of memory, you will still need to iterate over the 25mil, but you can iterate in groups of 1000 or 10000.
If this is not something you would do often (deleting 25M rows, sounds pretty one-off), an alternative is to use BW to create a file containing the delete statements and then giving the file to a DBA to execute.
please use subset feature of jdbc palette!! Let me know if you face any issues?
I would suggest two points:
If this is an one time activity then it is not adviced to use Tibco BW code for that. SQL script should be the better alternative.
When you say 25 million records- what criteria is this based on. It can be achieved through subset iteration .But there should be proper load testing in the Pre - Prod environment to check that the process is not causing any memory/DB issue.
You can also try using SQL procedure and invoking the same through BW.

(TSQL) INSERT doubling time of the query

I have a quite complex multi-join TSQL SELECT query that runs for about 8 seconds and returns about 300K records. Which is currently acceptable. But I need to reuse results of that query several times later, so I am inserting results of the query into a temp table. Table is created in advance with columns that match output of SELECT query. But as soon as I do INSERT INTO ... SELECT - execution time more than doubles to over 20 seconds! Execution plans shows that 46% of the query cost goes to "Table Insert" and 38% to Table Spool (Eager Spool).
Any idea why this is happening and how to speed it up?
Thanks!
The "Why" of it hard to say, we'd need a lot more information. (though my SWAG would be that it has to do with logging...)
However, the solution, 9 times out of 10 is to use SELECT INTO to make your temp table.
I would start by looking at standard tuning itmes. Is disk performing? Are there sufficient resources (IOs, RAM, CPU, etc)? Is there a bottleneck in the RDBMS? Does sound like the issue but what is happening with locking? Does other code give similar results? Is other code performant?
A few things I can suggest based on the information you have provided. If you don't care about dirty reads, you could always change the transaction isolation level (if you're using MS T-SQL)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select ...
This may speed things up on your initial query as locks will not need to be done on the data you are querying from. If you're not using SQL server, do a google search for how to do the same thing with the technology you are using.
For the insert portion, you said you are inserting into a temp table. Does your database support adding primary keys or indexes on your temp table? If it does, have a dummy column in there that is an indexed column. Also, have you tried to use a regular database table with this? Depending on your set up, it is possible that using that will speed up your insert times.

SSIS File Load WAY TOO SLOW in Large Destination Table

this is my first question, I've searched a lot of info from different sites but none of them where conslusive.
Problem:
Daily I'm loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2005 but it's taking TOO MUCH TIME(like 2 1/2 hours) and the file just has like 300 rows and its a 50 MB file aprox. This is driving me crazy, because is affecting the performance of my server.
This is the Scenario:
-My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, thats all!!!
-The Data Access Mode is set to FAST LOAD.
-Just have 3 indexes in the table and are nonclustered.
-My destination table has 366,964,096 records so far and 32 columns
-I haven't set FastParse in any of the Output columns yet.(want to try something else first)
So I've just started to make some tests:
-Rebuild/Reorganize the indexes in the destination table(they where way too fragmented), but this didn't help me much
-Created another table with the same structure but whitout all the indexes and executed the Job with the SSIS package loading to this new table and IT JUST TOOK LIKE 1 MINUTE !!!
So I'm confused, is there something I'm Missing???
-Is the SSIS package writing all the large table in a Buffer and the writing it on Disk? Or why the BIG difference in time ?
-Is the index affecting the insertion time?
-Should I load the file to this new table as a temporary table and then do a BULK INSERT to the destination table with the records ordered? 'Cause I though that the Data FLow Task was much faster than BULK INSERT, but at this point I don't know now.
Greetings in advance.
One thing I might look at is if the large table has any triggers which are causing it to be slower on insert. Also if the clustered index is on a field that will require a good bit of rearranging of the data during the load, that could cause an issues as well.
In SSIS packages, using a merge join (which requires sorting) can cause slownesss, but from your description it doesn't appear you did that. I mention it only in case you were doing that and didn't mention it.
If it works fine without the indexes, perhaps you should look into those. What are the data types? How many are there? Maybe you could post their definitions?
You could also take a look at the fill factor of your indexes - especially the clustered index. Having a high fill factor could cause excessive IO on your inserts.
Well I Rebuild the indexes with another fill factor (80%) like Sam told me, and the time droped down significantly. It took 30 minutes instead of almost 3hours!!!
I will keep with the tests to fine tune the DB. Also I didnt have to create a clustered index,I guess with the clustered the time will drop a lot more.
Thanks to all, wish that this helps to someone in the same situation.

Resources