Slow XSSFWorkbook and WorkbookFactory when reading xlsx files

Slow XSSFWorkbook and WorkbookFactory when reading xlsx files - performance

I've seen developers have had this problem since a few years ago. I have studied many forums and the official POI documents. Nonetheless I haven't found an answer yet.
So the problem is.. I have tried the following two snippets:
Workbook wb = WorkbookFactory.create(new File("spreadsheet.xlsx"));
and
File file = new File("C:\\spreadsheet.xlsx");
OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XSSFWorkbook workbook = new XSSFWorkbook(opcPackage);
and either of the approaches takes about 5-6min (if the application doesn't run out of memory) to process a simple and fairly small spreadsheet.xlsx file (200KB).
What do I need to do to fix this? (I'm using Apache POI 3.9)
/*****************************/
The process takes a long time in the following location:
public class XSSFSheet extends POIXMLDocumentPart implements Sheet{
...
protected void read(InputStream is) throws IOException {
try {
-->>> worksheet = WorksheetDocument.Factory.parse(is).getWorksheet();
} catch (XmlException e){
throw new POIXMLException(e);
}
}
...
I can't debug further. The VisualVM also says the same thing..!

One factor that might be contributing to the load time is that the data has been pasted into the worksheet so that the used range includes every row, ie when you use the sheet.usedrange rows count it returns > 1,000,000 rows.. Not sure how this happens but I found that I needed to perform an intermediary step wherein prior to loading the workbook I 'cleaned' it by using some vba script. The workbook has around 20 sheets of around 5000 rows each, each of which are filled out by different parts o the business, and it takes a fairly long time (maybe 4 minutes) to load but that is acceptable in this case. Before I added the cleaning stage it ran for over 30 minutes, which was not acceptable....
A user runs the process I am referring to, bu pressing two buttons. The first cleans, the second does the rest. The first process is triggered using Runtime.getruntime.exec and creates an empty text file that the second process will not run unless the test file is there.

Related

Googlesheet Appscript getSpreadsheetbyname() runs extremely slow [duplicate]

This question already has answers here:
Formulas make SpreadsheetApp work slower in Google Sheets Script
(3 answers)
Random slowness of Google Spreadsheet scripts
(1 answer)
Unreasonably slow apps script when accessing data from a google sheet cell
(2 answers)
Why my Google App Scripts slow down over time?
(2 answers)
Closed 2 months ago.
I have a simple appscript that fetches the first row from a data sheet and populate it into another sheet in the same worksheet. It is usually executed within 1-3 secs.
From last few days I am observing that the script take a long time at getSheetByName() for the first sheet. Once the first sheet is executed the next sheet does not take time. The below logs shows it took more than 90 secs just to execute getSheetByName()  for the first sheet (Calling Dashboard). The second sheet is executed almost instantaneously with the rest of the script. This is happening randomly after several executions and it is affecting our work.
I have tried SpreadsheetApp.flush(); but that does not help when this happens.
I am wondering if there a better way of handling this or I have missed anything? I have gone through several online resources but could not find any guidance on this kind of issue.
I am attaching my script and any help will be very much appreciated!!
enter image description here (Logs snapshot attached)
function fetchNextCallBack() {
Logger.log("Start Function")
const myGooglSheet = SpreadsheetApp.getActiveSpreadsheet();
Logger.log("Active Spreadsheet initiated")
//SpreadsheetApp.flush();
const shUserForm = myGooglSheet.getSheetByName("Calling Dashboard");
Logger.log("Calling Dashboard Initiated")
const datasheet = myGooglSheet.getSheetByName("Call Backs");
Logger.log("Call Backs Initiated")
shUserForm.getRange("C8:C22").clearContent();
shUserForm.getRange("F10:F18").clearContent();
shUserForm.getRange("M4:M6").clearContent();
Logger.log("Dashboard cleared")
const values = datasheet.getRange("A3:N3").getValues();
Logger.log("Call back Data fetched")
shUserForm.getRange("C8").setValue(values[0][5]); // vehicle no
shUserForm.getRange("C10").setValue(values[0][3]); // mobile no
shUserForm.getRange("C12").setValue(values[0][2]); // customer name
shUserForm.getRange("F12").setValue(values[0][4]); // model
shUserForm.getRange("C14").setValue(values[0][1]); // call type
shUserForm.getRange("F14").setValue(values[0][6]); // service type
shUserForm.getRange("F20").setValue(values[0][13]); // cre
shUserForm.getRange("C18").setValue(values[0][11]); // appt date
shUserForm.getRange("F18").setValue(values[0][12]); // appt slot
shUserForm.getRange("F10").setValue("REMINDER CALL");
Logger.log("Call back Data populated in dashboard")
}
the script is expected to be completed within 1-3 secs. However it takes sometime more than 90 secs abruptly. It always get stuck in the following line:
getSheetByName()

The code you show seems fine. Chances are that the spreadsheet is on the heavy side, and that makes accessing the data slow. You can confirm whether that is the case by making a copy of the spreadsheet and deleting everything on every tab, leaving say 25 blank rows on each tab. Then test the code in the blank spreadsheet and see if the same thing happens.
To improve spreadsheet performance, see these optimization tips.

How write performance can be improved for RecordWriter

Can anyone help me out finding correct API to improve write performance?
We use MultipleOutputs<ImmutableBytesWritable, Result> class to write data we read from a table, we use the newly created file as a backup. We face performance issue in write using MultipleOutputs, it takes nearly 5 seconds for every 10000 records we write.
This is the code we use:
Result[] results = // result from another table
MultipleOutputs<ImmutableBytesWritable, Result> mos = new MultipleOutputs<ImmutableBytesWritable, Result> ();
for(Result res : results ){
mos.write(new ImmutableBytesWritable(result.getRow()), result, baseoutputpath);
}
We get a batch of 10000 rows and write them in a loop, with baseoutputpath changing depending on Result content.
We are facing performance dip when writing into MultipleOutputs, we suspect that it might be due to writing in a loop.
Is there any other API in maprdb or HBase which push data to database using fewer RPC calls by buffering upto certain limit.
We write data as records so no file system write class would work for us.
Please note that we use mapreduce job to do all of the above.

Filestreams limitations in Spark Streaming

I need to develop a streaming application which read some session logs from several sources.
The batch interval could be in a scale around 5 minutes..
The problem is that the files I get in each batch vary enormously. In one in each batch I may get some file with 10 megabyte and then in another batch getting some files around 20GB.
I want to know if there is any approach to handle this..Is there any limitation for the size of RDDs a file stream can generate for each batch?
Can I limit the spark streaming to read just a fixed amount of data in each batch into the RDD?

As of I know there is no direct way to limit that. File to considered is controlled in isNewFile private function in FileStream. Based on the code I can think of one work around.
Use filter function to limit the number of files to be read. Any files more then 10 return false and use touch command to update the timestamp of the file to be considered for next window.
globalcounter=10
val filterF = new Function[Path, Boolean] {
def apply(file: Path): Boolean = {
globalcounter --
if(globalcounter > 0) {
return true // consider only 10 files.
}
// touch the file so that timestamp of the file is updated.
return false
}
}

Why does Hadoop take so long to start with a large number of S3 paths?

I have a Hadoop job that has ~60k S3 input paths. This job takes about 45 minutes to start. The same job, with only ~3k S3 input paths starts almost instantly.
Why does having a large number of input paths cause the job to take so long to start?

The answer has to do with how FileInputPath.addInputPath(...) is implemented. If you take a look at the source here, you'll see that its actually doing a string concatenation to save all of these paths to a file. Calling addInputPaths(...) just calls addInputPath, so there's no savings there. I ended up calling FileInputPath.setInputPaths(Job, Path[]). This skips the 60k+ string concatenations by building that part of the settings file once.
As climbage mentioned, there will need to be 60k+ calls to S3 to build the splits. It turns out that the S3 calls were taking less time than the string concatenation. My jobs went from taking 45 minutes to start down to less than 20.
For those who don't want to go combing through the source, heres the implementation of FileInputFormat.addInputPath() in Hadoop 2.5.1:
public static void addInputPath(Job job,
Path path) throws IOException {
Configuration conf = job.getConfiguration();
path = path.getFileSystem(conf).makeQualified(path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get(INPUT_DIR);
conf.set(INPUT_DIR, dirs == null ? dirStr : dirs + "," + dirStr);
}
and FileInputFormat.setInputPaths() in Hadoop 2.5.1:
public static void setInputPaths(Job job,
Path... inputPaths) throws IOException {
Configuration conf = job.getConfiguration();
Path path = inputPaths[0].getFileSystem(conf).makeQualified(inputPaths[0]);
StringBuffer str = new StringBuffer(StringUtils.escapeString(path.toString()));
for(int i = 1; i < inputPaths.length;i++) {
str.append(StringUtils.COMMA_STR);
path = inputPaths[i].getFileSystem(conf).makeQualified(inputPaths[i]);
str.append(StringUtils.escapeString(path.toString()));
}
conf.set(INPUT_DIR, str.toString());
}

One of the first things that FileInputFormat does during MapReduce initialization is determine the input splits. This is done by creating a list of every input file and its information (such as file size). I imagine that 60k API calls to S3 for file information isn't fast. 45 minutes seems extraordinarily slow - there may be some rate limiting going on as well?

Sorry for reopening an old question but I recently came across similar issue.
And the core of it is that in your case Hadoop will make 60K calls to AWS
To work around this one can use wildcards
FileInputFormat.addInputPath("path_to_a_folder/prefix*")
this will generate only 1 AWS call to list the directory path_to_a_folder and then filter by the prefix
I hope this will help to whoever will find this question

Slow Update/insert into SQL Server CE using LinqToDatasets

I have a mobile app that is using LinqToDatasets to update/insert into a SQL Server CE 3.5 File.
My Code looks like this:
// All the MyClass Updates
MyTableAdapter myTableAdapter = new MyTableAdapter();
foreach (MyClassToInsert myClass in updates.MyClassChanges)
{
// Update the row if it is already there
int result = myTableAdapter.Update(myClass.FirstColumn,
myClass.SecondColumn,
myClass.FirstColumn);
// If the row was not there then insert it.
if (result == 0)
{
myTableAdapter.Insert(myClass.FirstColumn, myClass.SecondColumn);
}
}
This code is used to keep the hand held database in sync with the server database. Problem is if it is a full update (first time for example) there are a lot of updates (about 125). That makes this code (and more loops like it take a very long time (I have three such loops that take over 30 seconds each).
Is there a faster or better way to do updates/inserts like this?
(I did see this Codeplex Project, but I could not see how to make it work with both updates and inserts.)

You should always use SqlCeResultSet for data access on mobile devices for maximum performance and memory usage. You must identify the data to be inserted and then use code like the SqlCeBulkCopy sample, and use similar code by using the Seek and Update methods of the SqlCeResultSet.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Slow XSSFWorkbook and WorkbookFactory when reading xlsx files - performance

Related

Googlesheet Appscript getSpreadsheetbyname() runs extremely slow [duplicate]

How write performance can be improved for RecordWriter

Filestreams limitations in Spark Streaming

Why does Hadoop take so long to start with a large number of S3 paths?

Slow Update/insert into SQL Server CE using LinqToDatasets

Categories

Resources