PDI: Returning the result of a SELECT-statement to the datastream - transformation

Using PDI (Kettle) I am filling the entry-stage of my database by utilizing a CSV Inputand Table Output step. This works great, however, I also want to make sure that the data that was just inserted fulfills certain criteria, e.g. fields not being NULL, etc.
Normally this would be a job for database constraints, however, we want to keep the data in the database even if its faulty (for debugging purposes. It is a pain trying to debug a .csv file...). As it is just a staging table anyway it doesn't cause any troubles for integrity, etc.
So to do just that, I wrote some SELECT Count(*) as test123 ... statements that instantly show if something is wrong or not and are easy to handle (if the value of test123 is 0 all is good, else the job needs to be aborted).
I am executing these statements using a Execute SQL Statements step within a PDI transformation. I expected the result to be automatically passed to my datastream, so I also used a Copy rows to result step to pass it up to the executing job.
This is the point where the problem is most likely located.
I think that the result of the SELECT statement was not automatically passed to my datastream, because when I do a Simple evaluation in the main job using the variable ${test123} (which I thought would be implicitly created by executing SELECT Count(*) as test123 ...) I never get the expected result.
I couldn't really find any clues to this problem in the PDI documentation so I hope that someone here has some experience with PDI and might be able to help. If something is still unclear, just hint at it and I will edit the post with more information.
best regards
Edit:
This is a simple model of my main job:
Start --> Load data (Transformation) --> Check data (Transformation) --> Simple Evaluation --> ...

You are mixing up a few concepts, if I read your post correctly.
You don't need a Execute SQL script, this is a job for the Table input step.
Just type your query in the Table input and you can preview your data and see it coming from the step into the data stream by using the preview on a subsequent step. The Execute SQL script is not an input step, which means it will not add external data to your data stream.
The output fields are not Variables. A Variable is set using the Set Variables step, which takes a single input row and maps a specific field to a variable, which can be persisted at parent job or root job levels. Fields are just that: fields. They are passed from one step to the next through hops and eventually to the parent job if you have a Copy rows to result step, but they are NOT variables.

Related

Accessing report output table in Oracle APEX

I have written an SQL query to return an Interactive report in APEX, and I would like to create another (classic) report on the same page that summarizes that report. The report gives test results, and I want to summarize it by grade level (so x # of As, y Bs, etc)
It's a complicated query that takes a while to run, so I'd like to avoid just running it again and using aggregation functions.
Does APEX store that report in a variable or table as it does variables (something like :TABLE_NAME) that I could just run a query on? I haven't been able to find anything on that in documentation or searching.
Thanks!
One option is to
enter parameters you use to filter data in that complex query
instead of running the report directly, create a push button and a process on it
that process should insert the result into a global temporary table (GTT) you'd, of course, have to create
why? Because it can be shared by multiple users, while everyone sees only their own data
interactive report would fetch & display data from the GTT
classic report would summarize data from the GTT

spring batch - how to avoid re-loading(writing) data that was loaded in the previous run

I have a basic spring batch app which is trying to load the data from a csv file to mysql. the program does load the file into db during the first run. However when I accidently re-run the job/app again, it had thrown the primary key violation (for the right reasons).
What is the best way to avoid reloading the data that is present on the target system? when the batch job is scheduled, if for any good reason, the source file has not changed since the previous run, I want to see 0 record processed message rather than a primary key violation error. hope it makes sense.
more information:
Thanks. I have probably not understood the answer. Let me explain my requirement in a better way. I have a file contains the data from external data source (say new hire data) with a fixed name of hire.csv. The file should be updated with the delta changes for every run. As there is a possibility of manual error of not removing all loaded rows, some new hires from previous run would also be present on current run. Is there a mechanism available within itemreader or itemprocessor to skip those records that are already present on the target db? I can do "insert into tb where not in (select from tb)" but this run for every row which I dont want to use. Hope it is clear now. thanks again.
However when I accidently re-run the job/app again, it had thrown the primary key violation (for the right reasons). What is the best way to avoid reloading the data that is present on the target system?
The file you are ingesting should be a (identifying) job parameter. This way, when the first run succeeds, the job instance is complete and cannot be run again. This is by design in Spring Batch for this very use case: preventing accidental job execution twice by error.
Edit: Add further options based on comments
If deleting the file is an option, then you can use a job listener or a final step to delete the file after ingesting it. With this option, you need to add a second identifying paramter (since the file name is always hire.csv) to make sure you have a different job instance for each run. This option does not require having a different file name for each run.
If the file can be renamed to hire-${timestamp}.csv and will be unique, then deleting the file after ingesting it and using a single job parameter with the filename is enough
Side note: I have seen people using a business key to identify records in the input file and using an item processor to query the database and filter items that have been already ingested. This works for small datasets but performs poorly with large datasets due to the additional query for each item.

Correct SSIS executing tasks in random order

This is a repost from 8 years ago, since the solutions provided there didn't worked for me, maybe now there are more alternatives for me and the other people which had that problem and couldn't solve it as well.
I have six Data Flow Task as shown in the following screenshot:
They execute in a different order everytime I execute them, and even the first one executes twice. I've recreated the tasks, hoping it was SSIS executing them in creation order.
They run in a random order each time I execute the package despite the Precedence Constraints, so I decided to recreate the WHOLE package. Failed as well.
It simply feels like Microsoft is messing up with me, since I don't find another explanation.
Any help provided will be a relief for me if my post is not voted as Redundant.
/Edited in order to add info/
My real problem is SSIS not inserting data in a defined order. It just executes the data insert as it pleases. Because I do need data to be natively stored with a specified order. I've done it before, just don't get why this tiem is different. I could however run a ORDER BY to get the data as I want except it's not me the one who'll be accessing the data, hope the one who's gonna extract and print the data notice that.
The biggest issue however is SSIS executing twice a random task, as I can't have for any reason a duplicate of the data as it will be later used for summarizing as well.(I suspect this is connected to the random order execution since the guy who posted the original question had the exact same issue as me).
The real way to notice these issues is not looking at the SSIS processes, but looking at the data stored in the DB. Sorry if I was unclear about my problem.
The SSIS log doesn't show you the tasks in the order that they run in. In your screenshot above it looks like it put them in alphabetical order, in fact.
Just because Abril is above Enero in your execution log doesn't mean that Abril ran first and Enero ran second.
Addendum based on comments below:
You are under the misconception that if you INSERT data into a database in a certain order, that when you SELECT that data without specifying an ORDER BY, you will get the data in the order it was inserted. This turns out not to be the case. The ONLY guaranteed way to get data from a database in a certain order is to use an ORDER BY clause when you SELECT it.
Let me be perfectly clear about this. When you say "I get my data from March being listed first than my data from January, meaning it was inserted first", you are wrong.
As for why your January data seems be to getting inserted twice, we would need to see the details of all the working parts: the original source data, the destination data before insert, the destination data after insert, and the SSIS package that does the insert. Without enough information to reproduce the issue ourselves, there is no way we can help you understand why it is happening in your package.

Kettle transformations locks database tables?

I have a main transformation calling a couple of sub transformation using a single threader step in Pentaho Kettle. The sub transformations write to an Oracle database.
The main transformation looks like this:
Data Frame --> Single threader 1 --> Single threader 1 --> Single threader 3
The sub transformations called by the single threaders look like this:
In --> A bunch of modifications and checks --> Write to DB
|
v
Out
The reason that the out step is directly connected to the in step is that I want to return the row unmodified, but I need to modify it to get the right data to insert into the DB. Each sub transformation update one or two tables.
All the steps in all the transformations use the same database connection, that is a shared resource between the files. The "Make transformation database transactional" checkbox is checked in all transformations.
My problem is this: When I run the main transform, a number of the tables are not updated even though there are no error messages. Those tables also remain locked after the transformation has finished, and remain locked until I quit Spoon. What tables are affected by this might change from time to time - sometimes there are many, sometimes just one, sometimes none. I have not been able to detect any pattern.
Am I doing something wrong? What can I do to prevent this from happening?

How can I limit memory usage when generating a CSV from a large resultset?

I have a web application in Spring that has a functional requirement for generating a CSV/Excel spreadsheet from a result set coming from a large Oracle database. The expected rows are in the 300,000 - 1,000,000 range. Time to process is not as large of an issue as keeping the application stable -- and right now, very large result sets cause it to run out of memory and crash.
In a normal situation like this, I would use pagination and have the UI display a limited number of results at a time. However, in this case I need to be able to produce the entire set in a single file, no matter how big it might be, for offline use.
I have isolated the issue to the ParameterizedRowMapper being used to convert the result set into objects, which is where I'm stuck.
What techniques might I be able to use to get this operation under control? Is pagination still an option?
A simple answer:
Use a JDBC recordset (or something similar, with an appropriate array/fetch size) and write the data back a LOB, either temporary or back into the database.
Another choice:
Use PL/SQL in the database to write a file using UTL_FILE for your recordset in CSV format. As the file will be on the database server, not on the client, Use UTL_SMTP or JavaMail using Java Stored Procedures to mail the file. After all, I'd be surprised if someone was going to watch the hourglass turn over repeatedly waiting for a 1 million row recordset to be generated.
Instead of loading an entire file in memory you can process each row individually and use output stream to send the output directly to the web browser. E.g. in servlets API, you can get the output stream from ServletResponse.getOutputStream() and then simply write result CSV lines to that stream.
I would push back on those requirements- they sound pretty artificial.
What happens if your application fails, or the power goes out before the user looks at that data?
From your comment above, sounds like you know the answer- you need filesystem or oracle access, in order to do your job.
You are being asked to generate some data- something that is not repeatable by sql?
If it were repeatable, you would just send pages of data back to the user at a time.
Since this report, I'm guessing, has something to do with the current state of your data, you need to store that result somewhere, if you can't stream it out to the user. I'd write a stored procedure in oracle- it's much faster not to send data back and forth across the network. If you have special tools or its just easier, sounds like there's nothing wrong with doing it on the java side instead.
Can you schedule this report to run once a week?
Have you considered the performance of an Excel spreadsheet with 1,000,000 rows?

Resources