To count the rows of csv file we can use Get Files Rows Count Input in etl. How to find the number columns of a csv file?
Just read the first row of the CSV file using Text-File-Input setting header rows to 0. Usually, the first row contains field names. If you read the whole row into a single field, you can use Split-Field-To-Rows to have a single fieldname per row and the number of rows tells you the number of fields. There are other ways, but this one easily prepares for a subsequent metadata injection - if that's what you have in mind.
No Need of Metadata injection , In Split-Field-To-Rows, check "Include rownum in output" and give some name to that Variable. Then apply sort rows on that Variable, use Sample rows, then you will get number of fields which are present in the file.
Related
I have two text files with only one column each.
I need to take the column from each of the text files and create a new text file with the two columns with tabs.
These columns have no relation (ID) but are in order with each other.
I could do that in Excel, but there are more than 200 thousand lines and not accepted.
How can I do it in Pentaho?
Take 2 text input steps, read both the files,
after that add 2 add constant step create same column with some value,make sure the value of the both constant values remains same.
use stream lookup/merge join and merge them with constant values.
generate the file.
You can read both files with Text file input, add "row number" in each stream, which gives you two streams of 2 fields each. Then you can Merge join both streams on Row number, and finally a Select fields step to clean up the output so that only the two relevant fields are kept. Then Text file output to write it.
So, i am using text file input step in Pentaho Data Integration to load rows into my database. I need to create a unique ID for each row so i can identify duplicates later on in my transformation. I tried to create an ID by concatinating 3 columns into one but some rows will always be the same due to how the file is generated. I do have "true" duplicates so its been hard getting them to be identified separately. Is there any other way of identifying each row so i can make it my Primary Key and avoid duplicates?
Thank you!
If your problem are not unique rows, so, identify them by using Memory Group By, use a grouping criteria and don't specify an adding function. After recognizing unique rows assign them a sequence and voila!.
I want to read a csv file, enrich each row with some data from some other external system and then write the new enriched csv to some directory
Now to get the data from external system i need to pass each row one by one and get the new columns from external system.
But to query the external system with each row i need to pass a value which i have got from external system by sending all the values of a perticular column.
e.g - my csv file is -
name, value, age
10,v1,12
11,v2,13
so to enrich that i first need to fetch a value as per total age - i.e 12 + 13 and get the value total from external system and then i need to send that total with each row to external system to get the enriched value.
I am doing it using spring batch but using fLatFileReader i can read only one line at a time. How would i refer to whole column before that.
Please help.
Thanks
There are two ways to do this.
OPTION 1
Go for this option if you are okey to store all the records in memory. Totally depends how many record you need to calculate the total age.
Reader(Custom Reader) :
Write the logic to read one line at a time.
You need to return null from read() only when you feel all the lines are read for calculating the total age.
NOTE:- A reader will loop the read() method until it returns null.
Processor : You will get the full list of records. calculate the total age.
Connect the external system and get the value. Form the records which need to be written and return from the process method.
NOTE:- You can return all the records modified by a particular field or merge a single record. This is totally your choice what you would like to do.
Writer : Write the records.
OPTION 2
Go for this if option1 is not feasible.
Step1: read all the lines and calculate the total age and pass the value to the next step.
Step2: read all the lines again and update the records with required update and write the same.
For example, a HBase table has columnFamilyA, columnFamilyB and columnFamilyC, for some rows, columnFamilyA does not have any column in it. I would like to scan the table and return only the rows that have at least one column in columnFamilyA.
What kind of filter should I use? I checked SingleColumnValueFilter, but it seems to only work with specific column other than columnFamily. I need all rows where columnFamiliyA contains at least one column. Not just data in columnFamiliyA, but the entire row.
If you need only data from columnFamiliyA you can use addFamily method on Get or Scan objects.
Or you can do scan of scan. First do scan for columFamilyA cols. Then get the rows of first scan.
I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this: