Performance issue while using microstream - microstream

I just started learning microstream. After going through the examples published to microstream github repository, I wanted to test its performance with an application that deals with more data.
Application source code is available here.
Instructions to run the application and the problems I faced are available here
To summarize, below are my observations
While loading a file with 2.8+ million records, processing takes 5 minutes
While calculating statistics based on loaded data, application fails with an OutOfMemoryError
Why is microstream trying to load all data (4 GB) into memory? Am I doing something wrong?

MicroStream is not like a traditional database and starts from the concept that all data are in memory. And an Object graph can be stored to disk (or other media) when you store this through the StorageManager.
In your case, all data are in 1 list and thus when accessing this list it reads all records from the disk. The Lazy reference isn't useful how you have used it since it just handles the access to the one list with all data.
Some optimizations that you can introduce.
Split the data based on vendorId, or day using a Map<String, Lazy<List>>
When a Map value is 'processed' removed it from the memory again by clearing the lazy reference. https://docs.microstream.one/manual/5.0/storage/loading-data/lazy-loading/clearing-lazy-references.html
Increase the number of Channels to optimize the reading and writing the data. see https://docs.microstream.one/manual/5.0/storage/configuration/using-channels.html
Don't store the object graph every 10000 lines but just at the end of the loading.
Hope this helps you solve the issues you have at the moment

Related

CFSpreadSheet functions using up memory for large data sets

We have a Coldfusion application that is running a large query (up to 100k rows) and then displaying it in HTML. The UI then offers an Export button that triggers writing the report to an Excel spreadsheet in .xlsx format using the cfspreadsheet tags and spreadsheet function, in particular, spreadsheetSetCellValue for building out row column values, spreadsheetFormatRow and spreadsheetFormatCell functions for formatting. The ssObj is then written to a file using:
<cfheader name="Content-Disposition" value="attachment; filename=OES_#sel_rtype#_#Dateformat(now(),"MMM-DD-YYYY")#.xlsx">
<cfcontent type="application/vnd-ms.excel" variable="#ssObj#" reset="true">
where ssObj is the SS object. We are seeing the file size about 5-10 Mb.
However... the memory usage for creating this report and writing the file jumps up by about 1GB. The compounding problem is that the memory is not released right away after the export completes by the java GC. When we have multiple users running and exporting this type of report, the memory keeps climbing up and reaches the heap size allocated and kills the serer's performance to the point it brings down the server. A reboot is usually necessary to clear it out.
Is this normal/expected behavior or how should we be dealing with this issue? Is it possible to easily release the memory usage of this operation on demand after the export has completed, so that others running the report readily get access to the freed up space for their reports? Is this type of memory usage for a 5-10Mb file common with cfspreadsheet functions and writing the object out?
We have tried temporarily removing the expensive formatting functions and still the memory usage is large for the creation and writing of the .xlsx file. We have also tried using the spreadsheetAddRows approach and the cfspreadsheet action="write" query="queryname" tag passing in a query object but this too took up a lot of memory.
Why are these functions so memory hoggish? What is the optimal way to generate Excel SS files without this out of memory issue?
I should add the server is running in Apache/Tomcat container on Windows and we are using CF2016.
How much memory do you have allocated to your CF instance?
How many instances are you running?
Why are you allowing anyone to view 100k records in HTML?
Why are you allowing anyone to export that much data on the fly?
We had issues of this sort (CF and memory) at my last job. Large file uploads consumed memory, large excel exports consumed memory, it's just going to happen. As your application's user base grows, you'll hit a point where these memory hogging requests kill the site for other users.
Start with your memory settings. You might get a boost across the board by doubling or tripling what the app is allotted. Also, make sure you're on the latest version of the supported JDK for your version of CF. That can make a huge difference too.
Large file uploads would impact the performance of the instance making the request. This meant that others on the same instance doing normal requests were waiting for those resources needlessly. We dedicated a pool of instances to only handle file uploads. Specific URLs were routed to these instances via a load balancer and the application was much happier for it.
That app also handled an insane amount of data and users constantly wanted "all of it". We had to force search results and certain data sets to reduce the amount shown on screen. The DB was quite happy with that decision. Data exports were moved to a queue so they could craft those large excel files outside of normal page requests. Maybe they got their data immediately, maybe the waited a while to get a notification. Either way, the application performed better across the board.
Presumably a bit late for the OP, but since I ended up here others might too. Whilst there is plenty of general memory-related sound advice in the other answer+comments here, I suspect the OP was actually hitting a genuine memory leak bug that has been reported in the CF spreadsheet functions from CF11 through to CF2018.
When generating a spreadsheet object and serving it up with cfheader+cfcontent without writing it to disk, even with careful variable scoping, the memory never gets garbage collected. So if your app runs enough Excel exports using this method then it eventually maxes out memory and then maxes out CPU indefinitely, requiring a CF restart.
See https://tracker.adobe.com/#/view/CF-4199829 - I don't know if he's on SO but credit to Trevor Cotton for the bug report and this workaround:
Write spreadsheet to temporary file,
read spreadsheet from temporary file back into memory,
delete temporary file,
stream spreadsheet from memory to
user's browser.
So given a spreadsheet object that was created in memory with spreadsheetNew() and never written to disk, then this causes a memory leak:
<cfheader name="Content-disposition" value="attachment;filename=#arguments.fileName#" />
<cfcontent type="application/vnd.ms-excel" variable = "#SpreadsheetReadBinary(arguments.theSheet)#" />
...but this does not:
<cfset local.tempFilePath = getTempDirectory()&CreateUUID()&arguments.filename />
<cfset spreadsheetWrite(arguments.theSheet, local.tempFilePath, "", true) />
<cfset local.theSheet = spreadsheetRead(local.tempFilePath) />
<cffile action="delete" file="#local.tempFilePath#" />
<cfheader name="Content-disposition" value="attachment;filename=#arguments.fileName#" />
<cfcontent type="application/vnd.ms-excel" variable = "#SpreadsheetReadBinary(local.theSheet)#" />
It shouldn't be necessary, but Adobe don't appear to be in a hurry to fix this, and I've verified that this works for me in CF2016.

Writing large amount of data using Akka in more efficient way

I've implemented Scala Akka application that streams 4 different types of data from biomodule sensor (ECG, EEG, Breath and general data). These data (timestamp and value) are typically stored in 4 different CSV files. However, sometimes I have to store each sample in two different files with different timestamps, so application is writing in 8 different CSV files at the same time.
Initially I've implemented one Akka actor that is responsible for persisting data, which receive path to the file in which to write data, timestamp and value. However, this was a bottleneck, since a number of samples that I need to store is large (e.g. one ECG sample is received each 4ms). As a result, this actor had finished recording in very short experiment 1-2 minutes after experiment is over.
I've also tried with 4 actors for 4 different message types, with the idea to distribute work. I didn't notice significant improvement in performances.
I'm wondering if someone has an idea how to improve the performance. Is it better to use one actor for storing files, few actors or it is most efficient if I have one actor for each file? Or maybe, it doesn't make any difference? Could I improve my code for storing data?
This is my method responsible for storing data:
def processValue(sample: WaveformValue): Unit ={
val csvfilewriter=new PrintWriter(new BufferedWriter(new FileWriter(sample.filepath,true)))
csvfilewriter.append(sample.timestamp.toString)
csvfilewriter.append(",")
csvfilewriter.append(sample.value.toString)
csvfilewriter.append("\r\n")
csvfilewriter.flush()
csvfilewriter.close()
}
It seems to me that your bottleneck is I/O -- disk access. It looks like you are opening, writing to, and closing a file for each sample, which is very expensive. I would suggest:
Open each file just once, and close it at the end of all processing. You might need to store the file in a member variable, or if you have have an arbitrary collection of files then store them in a map in a member variable.
Don't flush after every sample write.
Use buffered writes for each file writer. This avoids flushing data to the filesystem with every write, which involves a system call and waiting for the data to be written to disk. I see that you're already doing this, but the benefit is lost since you are flushing/closing the file after each sample anyway.

slow-loading persistent store coordinator in core data

I have been developing a Cocoa app with Core Data. Initially everything seemed fine, but as I added data to the application, I found that the initial data window took ages to load. To fix that, I moved to another startup window that didn't have the data, so start-up was snappy. However, no matter what I do, my first fetch AND my first attempt to load a data window (with tables views) are always slow. (That is, if I fetch slowly and then ask for the data window, both will be slow the first time around.) After that, performance is acceptable.
I traced through my application and found that while I can quickly step through the program, no matter what, the step that retrieves the persistent store coordinator is incredibly slow ... 15 - 20 seconds can elapse with a spinning beach ball.
I've read elsewhere that I might want to denormalize the data. I don't think that will be sufficient. An earlier version was far less "interconnected" between the entities, and it still was a slug at startup. Now I'm looking at entities that may have as high as 18,000 managed objects. Some of the relations are essential to having the data work correctly.
I've also read about the option of employing a separate managed object context in the background. The problem with this is that even this background context would take too long to be usable. If the user tries to run a search, he or she will still be waiting forever for that context to load. I might buy myself a few seconds while the user decides what to type in to the search field, but I can't afford to stall for 25 seconds.
I noticed that once data is imported into the persistent store, even searches on a table that is not related to others (and only has 1000 objects) still takes ages to load. The reason seems to be that it's the coordinator retrieval itself that's slow, not the actual fetch or the context.
Can anyone point me in the right direction on how to resolve this? Thanks!
Before you create your data model:
If you’re storing large objects such as photos, audio or video, you need to be very careful with your model design.
The key point to remember is that when you bring a managed object into a context, you’re bringing all of its data into memory.
If large photos are within managed objects cut from the same entity that drives a table-view, performance will suffer. Even if you’re using a fetched results controller, you could still be loading over a dozen high-resolution images at once, which isn’t going to be instant.
To get around this issue, attributes that will hold large objects should be split off into a related entity. This way the large objects can remain in the persistent store and can be represented by a fault instead, until they really are needed.
If you need to display photos in a table view, you should use auto-generated thumbnail images instead.
Read the whole article
You might be getting ahead of yourself thinking PSC is the culprit.
There is more going on behind the scenes with CoreData than is readily obvious -- PSC is very flexible and must be directed.
A realistic approach for the data size you specified (18K) is to focus on modularizing the logic of your fetch request templates and validation for specific size cases (think small medium large XtraLarge, etc.).
The suggestion to denormalize your data does not take into account the overhead to get your data into a fully denormalized state, plus a (sometimes) unintended side-effect of denormalization is sparsity (unless you have very specific model of course).
Since you usually do not know beforehand what data will be accessed and modified beforehand, make a one-to-many relationship between your central task and any subtasks. This will free up some constraints on your data access.
You can always give your end users the option to choose how they want to handle the larger datasets.

Clearing and freeing memory

I am developing a windows application using C# .Net. This is in fact a plug-in which is installed in to a DBMS. The purpose of this plug-in is to read all the records (a record is an object) in DBMS, matching the provided criteria and transfer them across to my local file system as XML files. My problem is related to usage of memory. Everything is working fine. But, each time I read a record, it occupies the memory and after a certain limit the plug in stops working, because of out of memory.
I am dealing with around 10k-20k of records (objects). Is there any memory related methods in C# to clear the memory of each record as soon as they are written to the XML file. I tried all the basic memory handling methods like clear(), flush(), gc(), & finalize()/ But no use.
Please consider he following:
Record is an object, I cannot change this & use other efficient data
structures.
Each time I read a record I write them to XML. and repeat this
again & again.
C# is a garbage collected language. Therefore, to reclaim memory used by an object, you need to make sure all references to that object are removed so that it is eligible for collection. Specifically, this means you should remove the objects from any data structures that are holding references to them after you're done doing whatever you need to do with them.
If you get a little more specific about what type of data structures you're using we can probably give a more specific answer.

5GB file to read

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Resources