Performance Issue while using joining multiple df - performance

I have a code which tries to do left join with multiple data frames as the attributes getting built from each of these data frames are to be positioned at different places in the final json file that I’m trying to write. Also the code is to grow as and when the new elements are getting added. Using the current approach , the code is taking almost 3-4 hours and finally aborts due to performance issue.
What is the better way to address this performance issue?
Lkp_df1
lkd_df2 etc
Main_df = main_df.join(keys,’left’)
.select( ....)
is the pattern I have in code

Please paste your entire code . Try to use persist or checkpoint and also check for number of partitions and data distribution across cluster

Related

Spark Rapids: Simple HashAggregate Example

[Hi All, I am new to Spark Rapids. I was going through the basic introduction to Spark Rapids, where I got a figure (attached) explaining the difference between CPU and GPU based query plans for hashaggregate example. All things in the plans, except the last phase converting to the Row Format is not clear to me. Can anyone please suggest the reason behind this.]
I do not see the referenced figure, but I suspect what is happening in your particular query comes down to one of two possible cases.
If your query is performing some kind of collection of the data back to the driver (e.g.: .show or .collect in Scala or otherwise directly displaying the query results) then the columnar GPU data needs to be converted back to rows before being returned to the driver. Ultimately the driver is working with RDD[InternalRow] which is why a transition from RDD[ColumnarBatch] needs to occur in those cases.
If your query ends by writing the output to files (e.g.: to Parquet or ORC) then the plan often shows a final GpuColumnarToRow transition. Spark's Catalyst optimizer automatically inserts ColumnarToRow transitions when it sees operations that are capable of producing columnar output (i.e.: RDD[ColumnarBatch]) and then the plugin updates those transitions to GpuColumnarToRow when the previous node will operate on the GPU. However in this case the query node is a data write command, and those produce no output in the query plan sense. Output is directly written to files when the node is executed instead of sending the output to a downstream node for further processing. Therefore this is a degenerate transition in practice, as the data write command sends no data to the columnar-to-row transition. I filed an issue against the RAPIDS Accelerator to clean up that degenerate transition, but it has no impact on query performance.

Serializing SDO_GEOMETRY type to text really slow

I am trying for a couple of days now to extract SDO_GEOMETRY records from an Oracle table into a CSV file via Microsoft Azure Data Factory (gen2). My select statement looks like this:
select t.MY_GEOM.get_WKT() from my_table t
where MY_GEOM column is of type SDO_GEOMETRY. It works but it's really, really slow. About 2 hours to pull 74000 records via this method.
Without that conversion (so, plain select without .get_wkt() takes about 32 seconds, but of course the result is rubbish and unusable.
Is there some way to speed up the process? My guess it's that the problem is on the server side, but I'm not a DBA and don't have direct access to it. I can connect to it via SQL Developer or from Data Factory.
The data contained there is just some LINESTRING(x1 y1, x2 y2, ...)
I also tried running SDO_UTIL.TO_WKTGEOMETRY to convert it, but it's equally slow.
If you have any suggestions, please let me know.
Kind regards,
Tudor
As i know,no additional burden will be imposed on data sources or sinks in ADF,so looks like that is a performance bottleneck at the db side with get_WKT() method.
Of course,you could refer to the tuning guides in this link to improve your transfer performance.Especially for Parallel copy. For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.That's based on the DIU.
I found a nice solution while searching for different approaches. As stated in some comments above, this solution that works for me consists of two steps:
Split the SDO_GEOMETRY LINESTRING entry into its coordinates via the following select statement
SELECT t.id, nt.COLUMN_VALUE AS coordinates, rownum FROM my_table t, TABLE(t.SDO_GEOMETRY.SDO_ORDINATES) nt
I just use it in a plain Copy Activity in Azure Data Factory to save my raw files as CSVs into a Data Lake. The files are quite large, about 4 times bigger than the final version created by the next step
Aggregate the coordinates back into a string via some Databricks Scala Spark code
val mergeList = udf { strings: Seq[String] => strings.mkString(", ") }
val result = df.withColumn("collected",
collect_list($"coordinates").over(Window.partitionBy("id").orderBy("rownum"))
)
.groupBy("id")
.agg(max($"collected").as("collected"))
.withColumn("final_coordinates", mergeList($"collected"))
.select("id", "final_coordinates")
val outputFilePrefix = s"$dataLakeFolderPath/$tableName"
val tmpOutputFolder = s"$outputFilePrefix.tmp"
result
.coalesce(1)
.write
.option("header", "true")
.csv(tmpOutputFolder)
dbutils.fs.cp(partition_path, s"$outputFilePrefix.csv")
dbutils.fs.rm(tmpOutputFolder, recurse = true)
The final_coordinates column contains my coordinates in the proper order (I had some issues with this). And I can plainly save the file back into my storage account. In the end, I only keep the proper CSV file that I am interested in.
As I said, it's quite fast. It takes about 2.5 minutes for my first step and a couple of seconds for the second one compared to 2 hours, so, I'm quite happy with this solution.

Hadoop vs Cassandra: Which is better for the following scenario?

There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Resources