How to read multiple successive tables from ONE file by R language? - multiple-tables

What should I do to read multiple successive tables from ONF file (e.g. CSV format), each table with one blank line and headers on top, like:
X1 X2 X3
1 2 3
4 5 6
(blank line)
Y1 Y2 Y3
2 4 5
3 7 9

This is very trivial. Except for the multithreading tag which has currently no link with the body of the question, this can be achieved by correct design of the process.
Try to draw the scheme on a sheet of a paper and/or use pseudo code. (You didn't even mention implementation language). Use status variables to track what table you are reading. On the beginning, set to the first table. When you find the line is a blank like change that status variable to the next table.
On each data row you read from a file, check the status variable and apply proper decoding.
Ultimately return all the tables, perform DB actions and so on...

Related

Google Sheet Query: Select misses data when there are different data type in a column?

I have a table like this:
a
b
c
1
2
abc
2
3
4.00
note c2 is text while c3 is a number.
When I do
=QUERY(A1:C,"select *")
The result is like
a
b
c
1
2
2
3
4.00
The "text" in C2 has been missed. You can see the live sheet here:
https://docs.google.com/spreadsheets/d/1UOiP1JILUwgyYUsmy5RzQrpGj7opvPEXE46B3xfvHoQ/edit?usp=sharing
How to deal with this issue?
QUERY is very useful, but it has a main limitation: only can handle one kind of data per column. The other data is left as blank. There are usually ways to try to overcome this from inside the QUERY, but I've found them unfruitful. What you can do is just to use:
={A:C}
You can work with filters by its own, but as a step-by-step to adapt the main features of query: If you need to add conditions, use LAMBDA INDEX and FILTER
For example, to check where A is not null:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}) --> with INDEX(quer,,1), I've accesed the first column
Where B is more than one cell and less than other:
=LAMBDA(quer,FILTER(quer,INDEX(quer,,2)>D1,INDEX(quer,,2)<D2))({A:C})
For sorting and limiting an amount of items, use SORTN. For example, you want to sort by 3rd column and limit to 5 higher values in that column:
=LAMBDA(quer,SORTN(FILTER(quer,INDEX(quer,,1)<>""),5,1,3,0))({A:C})
Or, to limit to 5 elements without sorting use ARRAY_CONSTRAIN:
=ARRAY_CONSTRAIN(LAMBDA(quer,FILTER(quer,INDEX(quer,,1)<>""))({A:C}),5)
There are other options, you can use REGEXMATCH and other options, and emulate QUERYs functions without missing data. Let me know!
shenkwen,
If you are comfortable with adding an Google App Script in your sheet to give you a custom function, I have a QUERY replacement function that supports all standard SQL SELECT syntax. I don't analyze the column data to try and force to one type based on which is the most common data in the column - so this is not an issue.
The custom function code - is one file and is at:
https://github.com/demmings/gsSQL/tree/main/dist
After you save, you have a new function from your sheet. In your example, the syntax would be
=gsSQL("select a,b,c from testTable", {{"testTable", "F150:H152", 60, true}})
If your data is on a separate tab called 'testTable'(or whatever you want), the second parameter is not required.
I have typed in your example data into my test sheet (see line 150)
https://docs.google.com/spreadsheets/d/1Zmyk7a7u0xvICrxen-c0CdpssrLTkHwYx6XL00Tb1ws/edit?usp=sharing

Loading a fixed width format file into R with different row / variable lengths

I am attempting to load the following data into R - using read.fwf:
sample of data
20100116600000100911600000014000006733839
20100116600000100912100000019600005648935
20100116600000100929100000000210000080787
20100116600000100980400000000090000000000
3010011660000010070031144300661101000
401001166000001000000001001
1010011660000020016041116664001338615001338115150001000000000000000000000000010001000100000000000000000000162002117592200001051
20100116600000200036300000001000005692222
however the first number in the row indicates which variables are coded in that line and they have different lengths so the 'widths' vector for lines starting with 1 is different from the 'widths' vector for lines starting with 2 and so on.
Is there a way I can do this without having to read the data in four times? (Note that each case has differing numbers of rows too)
Thank you, Maria

Talend loop for each record

Hi i am designing a data generation job.
my job is something like this
tRowGenerate --> tMap --> tFileOutputDelimited.
Lets say my tRowGenerate produces 5 columns with 2 records. I want to iterate for this records i.e for each record I want to iterate certain number of times.
for record 1 iterate 5 times to produce further data.
for record 2 iterate 3 times to produce further data.
Please suggest how to apply this multiply by xi logic. where xi for each record can change.
Thanks!
If you want to loop on the data generated from the tRowGenerator you can use a tLoop where you put the call to your business rule to determine the number of loops or when stop looping.
An example job might look like:
Logic of flow:
row1 is a main connection taking the generated values to the tFlowtoIterate that stores them in global variables;
the iterate link activates the tLoop that can use the values stored in the global vars to activate your business rule (to have the number of loops or tho ask if continue or stop);
the tLoop activate the tJavaFlex that uses the stored global vars to produce the output you like and pass it to the tFileOutputDelimited with a main link (row2).
You have to activate the append flag on the tFileOutputDelimited to keep the data from the different loops. If you need you can add a tFileDelete at the beginning to empty the output file before a new processing round.

Create Rows depending on count in Informatica

I am new to informatica power center tool and performing some assignment.
I have input data in a flat file.
data.csv contains
A,2
B,3
C,2
D,1
And Required output will be
output.csv should be like
A
A
B
B
B
C
C
D
Means I need to create output rows depending upon value in column. I tried it using java transformation and I got the result.
Is there any other way to do it.
Please help.
Java transformation is a very good approach, but if you insist on an alternative implementation, you can use a helper table and a Joiner transformation.
Create a helper table and populate it with appropriate amount of rows (you need to know the maximum value that may appear in the input file).
There is one row with COUNTER=1, two rows with COUNTER=2, three rows with COUNTER=3, etc.
Use a Joiner transformation to join data from the input file and the helper table - since the latter contains multiple rows for a single COUNTER value, the input rows will be multiplied.
COUNTER
-------------
1
2
2
3
3
3
4
4
4
4
(...)
Depending on your RDBMS, you may be able to produce the contents of the helper table using a SQL query in a source qualifier.

Loading the target tables

If suppose I have source with seven records from that first three must go in 3 target instances and 4th record again have to go into first target how can I achieve it?
Here is one way to achieve this result.
I use a sequence transformation to generate a series of numbers (starting with 1., increment by 1..).
I then route the table rows into one of the three targets based on this sequence number (using mod(nextval,3)) which will result in 0,1 or 2. Here are the three groups for the Router.
Group 1 : MOD(NEXTVAL,3)=0
Group 2 : MOD(NEXTVAL,3)=1
Group 3 : MOD(NEXTVAL,3)=2
Also, could you please explain why you need the table be loaded into multiple instances?
I have never really come across such scenarios before.

Resources