how to add data to one table from a large number of tables - gspread

I have problem with my project. I have more then 45 tables with 6 sheets in them. My script must find same rows from that tables and another table, then it must insert that row in the table for same rows. I finished it but it have one problem. Standard number of quota is 100 per 100seconds. I tried to fix the problem by time.sleep(1) after requests , but i have more than 45 tables and it's take much time to find all same rows
if i == x:
dopler_cell_list = dopler.range(f'A{str(len(lol2) + length)}:AI{str(len(lol2) + length)}')
time.sleep(1)
for cell in dopler_cell_list:
cell.value = output_cell_list[count].value
time.sleep(1)
count += 1
How i can make it faster?

Related

How do I add specific values of columns to create new columns?

I have a dataset which I want to format in order to perform repeated measures anova. My dataset is of the form:
set.seed(32)
library(tibble)
id<- rep(1:2,each=3)
y_0 <- rep(rnorm(2,mean=50,sd=10),each=3)
time <- rep(c(1,2,3),times=2)
c<-rep(rnorm(2,mean=10,sd=12),each=3)
data <- tibble(id,y,t,c)
I want to bring the dataset in the form of a dataset for repeated measures anova meaning I want to have only one value for id in each column and create 3 more columns. One for y+c in time 1 named y_1,y+c in time 2 named y_2 and y+c in time 3 named y_3. Can anyone provide some assistance?

How to slice the dataset in Python in specific intervals

I have a dataset with n rows, how can I access a specific number of rows every specific number of rows through the whole dataset using Python?
For example, in 100 rows data set and I want to access 10 rows every 10 rows, like 1:10, 20:30, 40:50, 60:70, 80:90
I could think of something like this
df.iloc[np.array([int(x/10) for x in df.index]) % 2 == 0]
It takes the index of the dataframe, divides it by 10 and casts it to an int. This basically just removes the last digit in this example.
With the modulo statement the first 10 rows are True, the next 10 False and so on. This is then used with iloc to get just the lines with the True value.
This requires a continuously increasing index. If for example some rows were already filtered out this is not the case. reset_index can be used to reset the index.

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

How can I more efficiently find the height of a table using Python

I am using openpyxl to copy data from an Excel spreadsheet. The data is a table for an inventory database, where each row is an entry in the database. I read the table one row at a time using a for loop. In order to determine the range of the for loop, I wrote a function that examines each cell in the table to find the height of the table.
Code:
def find_max(self, sheet, row, column):
max_row = 0
cell_top = sheet.cell(row = row - 1, column = column)
while cell_top.value != None:
cell = sheet.cell(row = row, column = column)
max = 0
while cell.value != None or sheet.cell(row = row + 1, column = column).value != None:
row += 1
max = max + 1
cell = sheet.cell(row = row, column = column)
if max > max_row:
max_row = max
cell_top = sheet.cell(row = row, column = column + 1)
return max_row
To summarize the function, I move to the next column in the worksheet and then iterate through every cell in that sheet, keeping track of its height until there are no more columns. The catch about this function is that it has to find two empty cells in a row in order to fail the condition. In a previous version I used a similar approach, but only used one column and stopped as soon as I found a blank cell. I had to change it so the program would still run if the user forgot to fill out a column. This function works okay for a small table, but on a table with several hundred entries this makes the program run much slower.
My question is this: What can I do to make this more efficient? I know nesting a while loop like that makes a program take longer but I do not see how to get around it. I have to make the program as foolproof as possible, so I need to check more than one column to stop user errors from failing the program
This is untested, but every time I've used openpyxl, I iterate over all rows like so:
for row in active_worksheet:
do_something_to(row)
so you could count like:
count = 0
for row in active_worksheet:
count += 1
EDIT: This is a better solution: Is it possible to get an Excel document's row count without loading the entire document into memory?
Read-only mode works row-by-row on the source so you probably want to hook it into it. Alternatively, you could pass the cells of the of a worksheet into something like a Pandas matrix which has indices for empty cells.

Checking to see if a user has filled in a table in Matlab

Can this if (size(cost,1) == 2 && size(limit,1) == 2) expression be used? Because I want to take the data from cost table and limit table. The cost table is 4 by 3 table and limit table is 4 by 2 table. So i want to take the data (which are input from user) from limit table. I have this code:
if P1 < limit(1,1)
P1 = limit(1,1);
lambdanew = P1*2*cost(1,3) + cost(1,2);
I can execute my program only if the user inserts the data into limit table but if the user did not insert the data, so it will be an error saying this:
Index exceeds matrix dimensions.
Error in ==> fyp_editor>Mybutton_Callback at 100
if P1 < limit(1,1)
So my question is how I can make if statement for the limit table if the user did not enter the data?
Is it limit(0), limit = 0 or limit == 0??
Can you initialize the limit table somehow so you know it exists but that the user didn't enter any information in it? If limit table is 4 by 2, try limit = zeros(4,2). Hope that helps.
If you want to make sure that limit is an array of size (4,2), you can do the following
if ~all(size(limit)==[4 2]))
h = errordlg('please fill in all values for "limit"');
uiwait(h)
return
end
Thus, the user gets an error message popping up, after which the callback stops executing.

Resources