In Pentaho Data Integration can I output conditionally? - pentaho-data-integration

I need to output a different CSV file every 100 rows. For example, if there are 305 rows in a stream, I'd need to output a CSV for rows one through 100, 101 to 200, 201 to 300, and 301 to 305.
I got a column for the last row number, and built a page number variable which increments every 100 rows. I then tried searching online since I can't yet conceptualize a solution.
var numberOfInvoicePages = Math.ceil(Number(lastRow) / 300);
if(rowNumber % 300 == 0){
pageNumber += 1;
}
I expect to get a CSV which says ${baseTitle} ${pageNumber} for each page, and for the actual results I don't yet know how to build this.

In the Text File output step, you can adjust on how many rows the output will split to another file, under the option 'Split ever ... rows'.

Related

Google Sheets add a Permanent timestamp

I am setting up a sheet where a person will be able to check a checkbox, in different times, depending on the progress of a task. So, there are 5 checkboxes per row, for a number of different tasks.
Now, the idea is that, when you check one of those checkboxes, a message builds up in the few next cells coming after. So, the message is built in 3 cells. The first cell is just text, the second one is the date, and the third one is time. Also, those cells have 5 paragraphs each (one per checkbox).
The problem comes when I try to make that timestamp stay as it was when it was entered. As it is right now, the time changes every time I update any part of the Google Sheet.
I set u my formulas as follows:
For the text message:
=IF($C4=TRUE,"Insert text 1 here","")&CHAR(10)&IF($E4=TRUE, "Insert text here","")&CHAR(10)&IF($G4=TRUE, "Insert text 3 here","")&CHAR(10)&IF($I4=TRUE, "Insert text 4 here,"")&CHAR(10)&IF($K4=TRUE, "Insert text 5 here","")
For the date:
=IF($C4=TRUE,(TEXT(NOW(),"mmm dd yyyy")),"")&CHAR(10)&IF($E4=TRUE,(TEXT(NOW(),"mmm dd yyyy")),"")&CHAR(10)&IF($G4=TRUE,(TEXT(NOW(),"mmm dd yyyy")),"")&CHAR(10)&IF($I4=TRUE,(TEXT(NOW(),"mmm dd yyyy")),"")&CHAR(10)&IF($K4=TRUE,(TEXT(NOW(),"mmm dd yyyy")),"")
And for the time:
=IF($C4=TRUE,(TEXT(NOW(),"HH:mm")),"")&CHAR(10)&IF($E4=TRUE,(TEXT(NOW(),"HH:mm")),"")&CHAR(10)&IF($G4=TRUE,(TEXT(NOW(),"HH:mm")),"")&CHAR(10)&IF($I4=TRUE,(TEXT(NOW(),"HH:mm")),"")&CHAR(10)&IF($K4=TRUE,(TEXT(NOW(),"HH:mm")),"")
And it all looks like this:
I would appreciate it greatly if anyone could help me get this to work so that date and time are inserted after checking those boxes and they donĀ“t change again
Notice that your struggle with the continuous changing date time. I had the same struggle as yours over the year, and I found a solution that works for my case nicely. But it needs to be a little more "dirty work" with Apps Script
Some background for my case:
I have multiple sheets in the spreadsheet to run and generate the
timestamp
I want to skip my first sheet without running to generate timestamp
in it
I want every edit, even if each value that I paste from Excel to
generate timestamp
I want the timestamp to be individual, each row have their own
timestamp precise to every second
I don't want a total refresh of the entire sheet timestamp when I am
editing any other row
I have a column that is a MUST FILL value to justify whether the
timestamp needs to be generated for that particular row
I want to specify my timestamp on a dedicated column only
function timestamp() {
const ss = SpreadsheetApp.getActiveSpreadsheet();
const totalSheet = ss.getSheets();
for (let a=1; a<totalSheet.length; a++) {
let sheet = ss.getSheets()[a];
let range = sheet.getDataRange();
let values = range.getValues();
function autoCount() {
let rowCount;
for (let i = 0; i < values.length; i++) {
rowCount = i
if (values[i][0] === '') {
break;
}
}
return rowCount
}
rowNum = autoCount()
for(let j=1; j<rowNum+1; j++){
if (sheet.getRange(j+1,7).getValue() === '') {
sheet.getRange(j+1,7).setValue(new Date()).setNumberFormat("yyyy-MM-dd hh:mm:ss");
}
}
}
}
Explanation
First, I made a const totalSheet with getSheets() and run it
with a for loop. That is to identify the total number of sheets
inside that spreadsheet. Take note, in here, I made let a=1;
supposed all JavaScript the same, starts with 0, value 1 is to
skip the first sheet and run on the second sheet onwards
then, you will notice a function let sheet = ss.getSheets()[a]
inside the loop. Take note, it is not supposed to use const if
your value inside the variable is constantly changing, so use
let instead will work fine.
then, you will see a function autoCount(). That is to make a for
loop to count the number of rows that have values edited in it. The
if (values[i][0] === '') is to navigate the script to search
through the entire sheet that has value, looking at the row i and
the column 0. Here, the 0 is indicating the first column of the
sheet, and the i is the row of the sheet. Yes, it works like a
json object with panda feeling.
then, you found the number of rows that are edited by running the
autoCount(). Give it a rowNum variable to contain the result.
then, pass that rowNum into a new for loop, and use if (sheeet.getRange(j+1,7).getValue() === '') to determine which row
has not been edited with timestamp. Take note, where the 7 here
indicating the 7th column of the sheet is the place that I want a
timestamp.
inside the for loop, is to setValue with date in a specified
format of ("yyyy-MM-dd hh:mm:ss"). You are free to edit into any
style you like
ohya, do remember to deploy to activate the trigger with event type
as On Change. That is not limiting to edit, but for all kinds of
changes including paste.
Here's a screenshot on how it would look like:
Lastly, please take note on some of my backgrounds before deciding to or not to have the solution to work for your case. Cheers, and happy coding~!

how to add data to one table from a large number of tables

I have problem with my project. I have more then 45 tables with 6 sheets in them. My script must find same rows from that tables and another table, then it must insert that row in the table for same rows. I finished it but it have one problem. Standard number of quota is 100 per 100seconds. I tried to fix the problem by time.sleep(1) after requests , but i have more than 45 tables and it's take much time to find all same rows
if i == x:
dopler_cell_list = dopler.range(f'A{str(len(lol2) + length)}:AI{str(len(lol2) + length)}')
time.sleep(1)
for cell in dopler_cell_list:
cell.value = output_cell_list[count].value
time.sleep(1)
count += 1
How i can make it faster?

Using PageIndex, why parquet does not skip unnecessary pages?

Using parquet-mr#1.11.0, i have a schema such as:
schema message page {
required binary url (STRING);
optional binary content (STRING);
}
I'm doing a single row lookup by url to retrieve the associated content
Rows are ordered by url.
The file was created with:
parquet.block.size: 256 MB
parquet.page.size: 10 MB
Using parquet-tools I was able to verify that I have indeed my column index and/or offsets for my columns:
column index for column url:
Boudary order: ASCENDING
null count min max
page-0 0 http://materiais.(...)delos-de-curriculo https://api.quero(...)954874/toogle_like
page-1 0 https://api.quero(...)880/toogle_dislike https://api.quero(...)ior-online/encceja
page-2 0 https://api.quero(...)erior-online/todos https://api.quero(...)nte-em-saude/todos
offset index for column url:
offset compressed size first row index
page-0 4 224274 0
page-1 224278 100168 20000
page-2 324446 67778 40000
column index for column content:
NONE
offset index for column content:
offset compressed size first row index
page-0 392224 504412 0
page-1 896636 784246 125
page-2 1680882 641212 200
page-3 2322094 684826 275
[... truncated ...]
page-596 256651848 183162 53100
Using a reader configured as:
AvroParquetReader
.<GenericRecord>builder(HadoopInputFile.fromPath(path, conf))
.withFilter(FilterCompat.get(
FilterApi.eq(
FilterApi.binaryColumn(urlKey),
Binary.fromString(url)
)
))
.withConf(conf)
.build();
Thanks to the column-index and column-offsets I was expecting the reader to read only 2 pages:
The one containing the url matching min/max using column index.
then, the one containing the matching row index for content using offset index.
But what I see is that the reader is reading and decoding hundreds of pages (~250MB) for the content column, am I missing something on how PageIndex is supposed to work in parquet-mr ?
Looking a the 'loading page' and 'skipping record' log lines this is trying to build the whole record before applying the filter on url which, in my opinion, defeat the purpose of PageIndex.
I tried to look online and dive into how the reader works but I could not find anything.
edit
I found an opened PR from 2015 on parquet-column hinting that the current reader (at the time at least) is indeed building the whole record with all the required columns before applying the predicate:
https://github.com/apache/parquet-mr/pull/288
But I fail to see, on this context, the purpose of the column offsets.
Found out that, even though this is not what I was expecting reading the specs, it is working as intended.
From this issue I quote:
The column url has 3 pages. Your filter finds out that page-0 matches. Based on the offset index it is translated to the row range [0..19999]. Therefore, we need to load page-0 for the column url and all the pages are in the row range [0..19999] for column content.

How can I more efficiently find the height of a table using Python

I am using openpyxl to copy data from an Excel spreadsheet. The data is a table for an inventory database, where each row is an entry in the database. I read the table one row at a time using a for loop. In order to determine the range of the for loop, I wrote a function that examines each cell in the table to find the height of the table.
Code:
def find_max(self, sheet, row, column):
max_row = 0
cell_top = sheet.cell(row = row - 1, column = column)
while cell_top.value != None:
cell = sheet.cell(row = row, column = column)
max = 0
while cell.value != None or sheet.cell(row = row + 1, column = column).value != None:
row += 1
max = max + 1
cell = sheet.cell(row = row, column = column)
if max > max_row:
max_row = max
cell_top = sheet.cell(row = row, column = column + 1)
return max_row
To summarize the function, I move to the next column in the worksheet and then iterate through every cell in that sheet, keeping track of its height until there are no more columns. The catch about this function is that it has to find two empty cells in a row in order to fail the condition. In a previous version I used a similar approach, but only used one column and stopped as soon as I found a blank cell. I had to change it so the program would still run if the user forgot to fill out a column. This function works okay for a small table, but on a table with several hundred entries this makes the program run much slower.
My question is this: What can I do to make this more efficient? I know nesting a while loop like that makes a program take longer but I do not see how to get around it. I have to make the program as foolproof as possible, so I need to check more than one column to stop user errors from failing the program
This is untested, but every time I've used openpyxl, I iterate over all rows like so:
for row in active_worksheet:
do_something_to(row)
so you could count like:
count = 0
for row in active_worksheet:
count += 1
EDIT: This is a better solution: Is it possible to get an Excel document's row count without loading the entire document into memory?
Read-only mode works row-by-row on the source so you probably want to hook it into it. Alternatively, you could pass the cells of the of a worksheet into something like a Pandas matrix which has indices for empty cells.

How to apply rowspan and colspan to cell in Dhtmlx Grid view

My dhtmlxgrid view looks like the below mentioned.Using Json data.
code desc Qtytype w1 w2
Part A Part A desc Demand 100 200
issued 150 100
stock 200 200
F/C 100 250
Part B Part B desc Demand 100 200
issued 200 100
stock 300 200
F/C 100 250
I want to apply the rowsapn and colspan cell level.Kindly anyone suggest how to apply span in the cell level to make the view aboved.Thanks in advance.
There are two ways
a) after data loading you can use js api to make necessary col and rowspans
grid.load("some.url", function(){
grid.setRowspan(1,0,4) //1 - id of row
grid.setRowspan(1,1,4) //1 - id of row
});
b) you can define rowspans directly in data, in case of xml it will be
<row id="1"><cell rowspan="4">Part A</cell>
as far as I know the similar syntax must be available for json, but it is buggy in current version (3.5) and works for xml only.
After Data loaded you can apply like this
dfGrid.forEachRow(function(id){
if((id)%4==0)
{
dfGrid.forEachCell(id,function(cellObj,ind){
if(ind <= 3){
dfGrid.setRowspan(id,ind,4);
}
});
}
});
4 denotes how many row you need to apply.
3 denotes how many coloumn rowspan should be applied.
Note -- Id should be starts from the multiple of 4. based on the no of row the value of id varied ex.3 means id should start with multiple of 3.
Thank you.

Resources