How to prevent timeout error when importing large CSV data set into Memgraph? - memgraphdb

I'm trying to load a rather large CSV file. It has more than 700K entries and it times out every time when I try to import the data.
I am currently using for loop over the data to load it but it's quite time-consuming.

You can try changing the query execution timeout flag in your configuration settings:
-query-execution-timeout-sec=180
The default setting is 180 seconds. You can set a larger value. If you set it to 0 there will be no time limit for query execution.

Related

Databricks Delta table write performance slow

I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe)
The scenario is:
I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At last, write as delta files in ADLS.
df.write.format("delta").mode('append').save(...)
For each file, its about 10k rows and the whole process time takes about 1 hour.
I am curious if anyone can answer the question as below:
is loop a good approach to apply those transformations? is there better way to parallelly applying those functions to all files at once?
what is the common avg time to load delta table for a file with 10k row?
any suggestion for me to improve the performance?
You said you run all in Databricks.
Assuming you are using latest version of delt:
Set delta.autoCompact
set shuffle partitions to auto
Set delta.deletedFileRetentionDuration
Set delta.logRetentionDuration
When you write DF use partitionBy
When you write DF you may want to reparation but don't have you
You may want to set maxRecordsPerFile in your writer options
Show us the code as it seems like your processing code is bottleneck.

How Can i avoid repeated/duplicate test data or to avoid already executed data without deleting in CSV file?

We have a constrain in our application, For test data providing in JMeter execution (using CSV Data Set Config element) we are not supposed to provide duplicate test data and it won't accept in all the fields. So we kept unique test data (upto 8K data for 8k concurrent users) for all the fields in CSV format.
Here I have a manual intervention, After each test execution (i.e) 100 users, 1000 users up to 8000 users) we need to delete each row (WRT to users in thread group) in CSV file else the duplicate data will be fetching for next execution and result will be failed.
Here my questions is,
1. How Can i avoid repeated/duplicate test data or to avoid already executed data without deleting in CSV file.
2. During JMeter test execution with CSV files, How can we specify to take the data from the respective rows. For example 101th row, 1001th row & 7999th row (which contains 8000 rows of data)?
The easiest option will be using HTTP Simple Table Server, its READ command has KEEP=FALSE attribute so you will be able to feed your test with the unique data without having to physically remove it from the original CSV file.
You can install HTTP Simple Table Server plugin using JMeter Plugins Manager:
In general if your test doesn't need to be repeatable instead of keeping the data in the CSV file you can consider generating it on the fly using such JMeter Functions as:
__Random()
__RandomString()
__RandomDate()
__UUID()
etc.

JMeter CSV data set split into threads (users)

What I want to do:
I would like to test behavior of the system for 50 users. Each user has to do the same action X times,
with different input ( X - depends on how many records I have in the CSV file, so if the file contains 1000 records, each user will do the action 20 times).
What I actually did to do that:
I set up CSV Data Set Config (with CSV file with 1000 lines) and ofc set up Number of Threads to 50
What is my problem:
Now I'm quite not sure how to share the CSV file so that all user will have unique poll of the lines from the file. (so each user will have his unique lines from the CSV)
What can I do to workaround:
I can copy thread groups to make 50 thread groups, and add them separated CSV files, but it sounds ridiculous...
Given you set the following values in the CSV Data Set Config
Recycle on EOF: False
Stop thread on EOF: True
Sharing mode: All threads
then each thread (virtual user) will fetch new value(s) from the CSV file which will guarantee uniqueness of the test data
You can check this yourself by using __threadNum() function and ${__jm__Thread Group__idx}; variable
More information: CSV Data Set Config in Sharing Mode - Made Easy
In thread group under thread properties, we can set
number of threads = 50
ramp up period = 1
loop count = 20
So, here each thread after a sec will take next line from csv file and execute it.
This way the same csv file will be shared among different threads.
I would recommend to create multiple CSV files for your test plan and assign the variables accordingly for smooth execution of the script. Using same CSV file can not solve the problem as there are times when few threads executing much faster and others are slow in that case action will start replicating between different threads.

Pull Data from Hive to SQL Server without duplicates using Apache Nifi

Sorry I'm new in Apache Nifi. So i made a data flow regarding pulling data from Hive and storing it in SQL. There is no error on my data flow, the only problem is, its pulling data repeatedly.
My Data flow is consists of the following:
SelectHiveQL
SplitAvro
ConvertAvroToJson
ConvertJsonTOSQL
PutSQL
For example my table in hive have 20 rows only but when i run the data flow and check my table in MS SQL. It saved 5,000 rows. The SelectHiveQL pulled the data repeatedly.
What do i need to do so it will only pull 20 rows or just the exact number of rows in my Hive Table?
Thank you
SelectHiveQL (like many NiFi processors) runs on a user-specified schedule. To get a processor to only run once, you can set the run schedule to something like 30 sec, then start and immediately stop the processor. The processor will be triggered once, and stopping it does not interrupt that current execution, it just causes it not to be scheduled again.
Another way might be to set the run schedule to something very large, such that it would only execute once per some very long time interval (days, years, etc.)

increase efficiency of sqoop export from hdfs

I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.

Resources