Data warehouses and atomicity rarely coexist - etl

This came up in the context of a humorous Tweet about wrapping a data warehouse overnight load in a transaction and how this would bloat the log file and eat up disk space. I'm not trying to disprove it but rather to understand it better - as to me it seems to imply that a partial load (due to error) should be allowed to complete which would mean that the DW would not accurately reflect the source system(s).
The only way that I can understand it is if the incomplete records would be loaded into an intermediate staging layer in the DW but not be processed further until completed by a subsequent overnight load and only then would be processed further.
I tried to research it further but without success so would be really grateful for any advice.

When an error happens during the loading of a DW you could:
stop the load and rollback (either to the start of the load of a single target object, of a group of objects or the whole DW)
Stop the load and leave the DW as it is at that point
log the error and continue the load (either of the failing target or of other objects in the DW)
Which option you choose is entirely dependent on your particular circumstances and you might have many different strategies in use at different points in your etl pipeline and depending on the number of errors. For example:
The error may allow you to continue to load other dims/facts without affecting them
Your business might prefer a fact table to be loaded minus one erroring record rather than missing a complete day’s data until the error is fixed

Related

Lost Duration while Debugging Apex CPU time limit exceeded

I'm open to posting the code in this section to work through the optimization but its a bit length and complex, so instead I'm hoping that somebody can assist me with a few debugging questions I have. My goal is to find out what is causing my Apex CPU Time Limit Exceeded issue.
When using the Debug Log in its basic or normal layout I receive the message
Maximum CPU Time: 15062 out of 10,000 ** Close to Limit
I've optimized and re-wrote various loops and queries several times now and in each case this number concludes around there which leads me to believe it is lying to me and that my actual usage far exceeds that number. So on my journey I switched the Log Panels of the Developer Console to Analysis in hopes of isolating exactly what loop, method, or area of the code is giving me a headache.
This leads me to my main question and problem.
Execution Tree, Performance Tree & Executed Units
All show me that my durations UNDER the 10,000ms allowance. My largest consumption is 3,556.19ms which is being used by a wrapper class I created and consumed in the constructor method where there is a fair amount of logic that is constructing a fairly complicated wrapper class that spans over 5-7 custom objects. Still even with those 3,000ms the remainder of the process shows at negligible times bringing my total around 4,000ms. Again my question is.... Why am I unable to see or find what is consuming all my time?
Incorrect Iteration Data
In addition to this, on the Performance tree there is a column of data that shows the number of iterations for each method. I know that my Production Org has 81 objects that would essentially call the constructor for my custom wrapper object. I.E. my Constructor SHOULD be called 81 times, but instead it is called 32 times. So my other question is can I rely on the iteration data in the column? Or because it was iterating so many times does it stop counting at a certain point? Its possible that one of my objects is corrupted or causing an infinite loop somehow, but I don't want to dig through all the data in search of that conclusion if its a known issue that the iteration data is not accurate anyway.
System.Debug in the Production org
The Last question is why my System.Debug() lines are not displaying in my Developer Console on the production org. I've added serveral breadcrumbs throughout the code that would help me isolate just which objects are making it through and which are not, however, I cannot in any layout view system.debug messages outside of my Sandbox.
Sorry for the wealth of questions but I did want to give an honest effort to better understand the debugging process in Salesforce. If this is a lost cause I'm happy to start sharing some code as well but hopefully some debugging tips can get me to the solution.
It's likely your debug log got truncated, see "Each debug log must be 20 MB or smaller. If it exceeds this amount, you won’t see everything you need." in https://trailhead.salesforce.com/en/content/learn/modules/apex_basics_dotnet/debugging_diagnostics
Download the log and search for text similar to "skipped 123456 bytes of detailed log" to confirm, some system.debug statements will just not show up.
You might have to fine-tune the log levels (don't log validation rules and workflows? don't log every single variable assignment with "FINE" level etc). You might have to set all flags to NONE, then track only 1 particular class/trigger that you suspect (see https://help.salesforce.com/articleView?id=code_debug_log_classes.htm&type=5 and https://salesforce.stackexchange.com/questions/214380/how-are-we-supposed-to-use-debug-logs-for-a-specific-apex-class-only)
If it's truncated it's possible analysis tools give up (I had mixed luck with console to be honest, sometimes https://apextimeline.herokuapp.com/ is great to give overview - but it'll also fail to parse a 20 MB log...
When all else fails you can load up the log into Notepad++ (or any editor of your choice), find lines related to method entry/method exit (you might need a regular expression search), take these filtered lines tor excel, play with "text to columns" and just look at timing manually, see if there's a record that causes the spike. Because it could be #10 that's the problem, the fact it exhausts limits on #32 of 81 doesn't mean much. Search like [METHOD_ENTRY|METHOD_EXIT]MyTriggerHandler.onBeforeUpdate could be a good start. But first thing is to make sure log is not truncated.

An issue of partial insertion of data into the target when job fails

We have 17 records data set in one of the source tables in which we have erroneous data in the 14th record, which causes the job failure. Then, in the target only 10 records would be inserted as the commit size given as “10” in the mysqloutput component and the job failed. In the next execution after correcting the error record, job will fetch all the 17 records with successful execution. Due to which there will be duplicates in the target.
we tried :
To overcome this, we have tried with tmysqlrollback component in which we have included the tmysqlconnection and tmysqlcommit components.
Q1 : Is there any other option to use tmysqlrollback without using the tmysqlconnection and tmysqlcommit components?
Explored the tmysqlrollback and commit component from the documentation
https://help.talend.com/reader/QgrwjIQJDI2TJ1pa2caRQA/7cjWwNfCqPnCvCSyETEpIQ
But still looking for clue how to design the above process efficient manner.
Q2 : Also, We'd like to know about the RAM usage and disk space consumption from the performance perspective.
Any help on it would be much appreciated ?
No, the only way to do transactions in Talend is to open a connection using tMysqlConnection, then either commit using a tMysqlCommit or rollback using tMysqlRollback.
Without knowing what you're doing in your job (lookups, transformations..etc), it's hard to advise you on the ram consumption and performance. But if you only have a source to target, then ram consumption should be minimal (make sure you enable stream on the tMysqlInput component). If you have another database as your source, then ram consumption depends on how that database driver is configured (jdbc drivers usually accept a parameter to tell it to only fetch a certain number of records at a time).
Lookups and components that process data in memory (tSortRow, tUniqRow, tAggregateRow..etc) are what causes memory issues, but it's possible to tweak their usage (using disk among other methods).

slow-loading persistent store coordinator in core data

I have been developing a Cocoa app with Core Data. Initially everything seemed fine, but as I added data to the application, I found that the initial data window took ages to load. To fix that, I moved to another startup window that didn't have the data, so start-up was snappy. However, no matter what I do, my first fetch AND my first attempt to load a data window (with tables views) are always slow. (That is, if I fetch slowly and then ask for the data window, both will be slow the first time around.) After that, performance is acceptable.
I traced through my application and found that while I can quickly step through the program, no matter what, the step that retrieves the persistent store coordinator is incredibly slow ... 15 - 20 seconds can elapse with a spinning beach ball.
I've read elsewhere that I might want to denormalize the data. I don't think that will be sufficient. An earlier version was far less "interconnected" between the entities, and it still was a slug at startup. Now I'm looking at entities that may have as high as 18,000 managed objects. Some of the relations are essential to having the data work correctly.
I've also read about the option of employing a separate managed object context in the background. The problem with this is that even this background context would take too long to be usable. If the user tries to run a search, he or she will still be waiting forever for that context to load. I might buy myself a few seconds while the user decides what to type in to the search field, but I can't afford to stall for 25 seconds.
I noticed that once data is imported into the persistent store, even searches on a table that is not related to others (and only has 1000 objects) still takes ages to load. The reason seems to be that it's the coordinator retrieval itself that's slow, not the actual fetch or the context.
Can anyone point me in the right direction on how to resolve this? Thanks!
Before you create your data model:
If you’re storing large objects such as photos, audio or video, you need to be very careful with your model design.
The key point to remember is that when you bring a managed object into a context, you’re bringing all of its data into memory.
If large photos are within managed objects cut from the same entity that drives a table-view, performance will suffer. Even if you’re using a fetched results controller, you could still be loading over a dozen high-resolution images at once, which isn’t going to be instant.
To get around this issue, attributes that will hold large objects should be split off into a related entity. This way the large objects can remain in the persistent store and can be represented by a fault instead, until they really are needed.
If you need to display photos in a table view, you should use auto-generated thumbnail images instead.
Read the whole article
You might be getting ahead of yourself thinking PSC is the culprit.
There is more going on behind the scenes with CoreData than is readily obvious -- PSC is very flexible and must be directed.
A realistic approach for the data size you specified (18K) is to focus on modularizing the logic of your fetch request templates and validation for specific size cases (think small medium large XtraLarge, etc.).
The suggestion to denormalize your data does not take into account the overhead to get your data into a fully denormalized state, plus a (sometimes) unintended side-effect of denormalization is sparsity (unless you have very specific model of course).
Since you usually do not know beforehand what data will be accessed and modified beforehand, make a one-to-many relationship between your central task and any subtasks. This will free up some constraints on your data access.
You can always give your end users the option to choose how they want to handle the larger datasets.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?
I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

5GB file to read

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Resources