Sequential file reads incorrect file data in IBM datastage 8.1 - sequential

The scenario which i am facing is as follows :
A sequential file is generated with for example 100 rows in the name of A.txt .. the same sequential file is required as input in another job but this time when it reads its taking the rows as 140 but physically oly 100 rows are there. I have analysed so much for this ind Datastage like verifyin the property for columns , delimiters , project comparison.. but still i am not able to .. .
if any one can help me out in this problem wud b highly appreciated..
Thanks

Please check with the file name - if both are same file written to and read, was the record count checked in Unix/Linux - at times the monitor in the director might not show the current count
Thanks

Please check the name of the file in output stage and in input stage.

Related

How to include column name from oracle source to flat file (destination) in datastage?

I'm a total beginner and working my way to become a good ETL developer and i uses IBM Infosphere Datastage. I'm able to transfer/import data from databases(Oracle) to sequential file(csv) but i wanted to get the columns name?
is there a way to do it ? i don't have anyone that taught me, i just do it by myself
So any idea would be very helpful.
Thanks!
In the sequential file stage, simply set the option "First Line is Column Names = True", compile and run the job then you will see the column names in the csv file.

NiFi: how to get maximum timestamp from first column?

NiFi version 1.5
i have a csv file arrives first time like:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
used listfile -> fetchfile to get the csv file.
next 10 minutes, i get appended csv file:
datetime,a.DLG,b.DLG,c.DLG
2019/02/04 00:00,86667,98.5,0
2019/02/04 01:00,86567,96.5,0
2019/02/04 02:00,86787,99.5,0
2019/02/04 03:00,86117,91.5,0
here, how do we need to get only new records alone (last two records). i do not want to process first two records that is already been processed.
my thought process is, we need to get maximum datetime to store in attribute and use QueryRecord. but i do not know how to get maximum datetime using which processor.
is there any better solution.
This is currently an open issue (NIFI-6047) but there has been a community contribution to address it, so you may see the DetectDuplicateRecord processor in an upcoming release of NiFi.
There may be a workaround to split up the CSV rows and create a compound key using ExtractText, then using DetectDuplicate.
It doesn't seems to be a work that is best solved on Nifi as you need to keep a state of what you have processed. An alternative would be for you to delete what you have already processed. Then you can assume what is in the file is always not processed.
here, how do we need to get only new records alone (last two records).
i do not want to process first two records that is already been
processed.
From my understanding, actual question is 'how to process/ingest csv rows as it is written to the file?'.
Description of 'TailFile' processor from NiFi documentation:
"Tails" a file, or a list of files, ingesting data from the file as it
is written to the file. The file is expected to be textual. Data is
ingested only when a new line is encountered (carriage return or
new-line character or combination)
This solution is appropriate when you don't want to move/delete actual file.

How does the CONCATENATE in ALTER TABLE command in HIVE works

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.

Combining different data flows and create .txt file by sorting output

I have a requirement. I am trying to combining several data flows with Talend in order to create a .txt file. In my case the input flows are DB tables. I am able to create the output file "prova.txt", but in this file some fields of 2nd and 3rd tables are missing and I don't know why. I checked with tLogRow and the probelm seems to be in tHashInput_1. In the 3 tHashOutput rows are logged correctly with all fields.
Below, my job:
Components tHashOutput_2, tHashOutput_3, tHashInput_1 are linked to tHashOutput_1.
Am I doing something wrong? Does anyone could help me?
Thank you in advance!
Assuming for all thashoutput schema will be same, I attached image for your problem.
Here in all tfileoutputdelimited components give same file name, same schema and use append option. It will append data in same file from all 3 tables.
An alternative, is to use tUnite, Assuming for all thashoutput schema will be same
Example: Using tUnique
Regards!
tunitetalendecanaveras

Get filename of Record in Hive

Is it possible to get the filename of a record in Hive? That would be incredibly helpful for debugging.
In my particular case, I've an incorrect values ​​in a table that is mapped to a folder with > 100 large files. To use grep is very inefficient
HIVE supports virtual columns, for example INPUT__FILE__NAME. It gives the input file's name for a mapper task.
Have a look at the documentation here. It provides some example on how to do this.
Unfortunately, I'm unable to test the same now. Let me know if this is working or not.

Resources