Is there a way to limit SparkR data frame row into 1000000 ? because data file is big is taking time. Thanks
Did you try limit? You can use it in a SQL statement or as a SparkR command.
Related
I am running everything in databricks. (everything is under the assumption that the data is pyspark dataframe)
The scenario is:
I have 40 files read as delta files in ADLS n then apply a series of transformation function(thru loop FIFO flow). At last, write as delta files in ADLS.
df.write.format("delta").mode('append').save(...)
For each file, its about 10k rows and the whole process time takes about 1 hour.
I am curious if anyone can answer the question as below:
is loop a good approach to apply those transformations? is there better way to parallelly applying those functions to all files at once?
what is the common avg time to load delta table for a file with 10k row?
any suggestion for me to improve the performance?
You said you run all in Databricks.
Assuming you are using latest version of delt:
Set delta.autoCompact
set shuffle partitions to auto
Set delta.deletedFileRetentionDuration
Set delta.logRetentionDuration
When you write DF use partitionBy
When you write DF you may want to reparation but don't have you
You may want to set maxRecordsPerFile in your writer options
Show us the code as it seems like your processing code is bottleneck.
I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second.
And I modified following parameters of greenplum, but it is no significant improvement.
gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on
And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?
How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?
Thank you.
There are a variety of factors that could be at play here.
Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?
The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.
Greenplum love batches.
a) You can modify batch size in transformation with Nr rows in rowset.
b) You can modify commit size in table output.
I think a and b should match.
Find your optimum values. (For example we use 1000 for rows with big json objects inside)
Now, using following connection properties
reWriteBatchedInserts=true
It will re-write SQL from insert to batched insert. It increase ten times insert performance for my scenario.
https://jdbc.postgresql.org/documentation/94/connect.html
Thank you guys!
When reading a 1 Billion records of a table in Spark from Hive and this table have date and country columns as partitions. It is running for very long time since we are doing many transformations on it. If I change the Hive table file format to Parquet then will it be there any performance? Any suggestions on improvement of performance .
Change the Orc to Parquet maybe will not improve the performance.
But it depends of the type of data you have. If you are working with nested objects you need to use Parquet, Orc is not good for that.
But to create some improvement, I suggest you to do some steps that can help with your data in Hive.
Check the number of files in Hive.
One common thing that can create big problems in Hive Query is the number of files in each partition, and the size of these files are. If you are using Spark to store the data, I suggest you to check the size of the files and if they are stored with the size of your Hadoop block. If not, try to use the command CONCATENATE to solve that problem. As you can see here.
Predicate PushDown
This is what Hive, and Orc files can give you with the best performance in query the data. I suggest you to run one ANALYSE command to force the creation of the Statistics of your table, this will improve the performance and if the data is not efficient this will help. Check here and with this will update the Hive Metastore and will give you some relevant data information.
Ordered Data
If it is possible, try to store your data ordered by some column, and filter and do other stuffs in that column. Your join can be improved with this.
I have a dataframe I got by making a query using a SQLContext:
> df <- sql(sqlContext, "SELECT * FROM myTable")
when I try to get its size
> object.size(df)
1024 bytes
I know that it is not the real size of the dataframe, probably because it's distributed over Spark nodes. To get the real size I need to collect it:
> localDf <- collect(df)
> object.size(localDf)
45992 bytes
Sometimes the dataframe is too big to fit in the local memory. Is there a simple way to know the actual size of a dataframe without bringing it locally?
One way to do this is to use the Spark Web UI. Under the Executors tab you can look at Storage Memory.
I actually found a satisfactory workaround for this problem. I set the following Spark configuration to load the SparkContext:
spark.driver.maxResultSize=1m
In this case, when the result is bigger than 1MB, spark will return a org.apache.spark.SparkException, so I caught it and returned an error message.
I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.