Is there a way to return data from ClickHouse not by rows but by columns?
So instead of result in a following form for columns a and b
a
b
1
2
3
4
5
6
I'd get a transposed result
-
-
-
1
3
5
2
4
6
The point is I want to access data per column, eg. iterate over everything in column a.
I was checking available output formats - Arrow would do but it is not supported by my platform for now.
I'm looking for a most effective way. E.g. considered ClickHouse stores data in columns already, it does not have to process it into rows so I can transfer it back to columns using array functions afterwards. I'm not familiar with internals very much but I was wondering that I could somehow skip transposing rows if data are already in columns.
Obviously there is no easy way to do it.
And a bigger issue that it's against the SQL conception.
You can use native protocol, although you will get columns in blocks by 65k rows.
col_a 65k values, col_b 65k values, col_a next 65k values, col_b next 65k values
Related
I maintain a table in Oracle that contains several hundred thousand lines of code, including a priority column, which indicates for each line its importance according to the needs of the system.
ID
BRAND
COLOR
VALUE
SIZE
PRIORITY
EFFECTIVE_DATE_FROM
EFFECTIVE_DATE_FROM
1
BL
BLUE
58345
12
1
10/07/2022
NULL
2
TK
BLACK
4455
1
1
10/07/2022
NULL
3
TK
RED
16358
88
2
11/01/2022
NULL
4
WRA
RED
98
10
6
18/07/2022
NULL
5
BL
BLUE
20942
18
7
02/06/2022
NULL
At any given moment thousands more rows may enter the table, and it is necessary to SELECT from it the 1000 rows with the highest priority.
Although the naive solution is to SELECT using ORDER BY PRIORITY ASC, we find that the process takes a long time when the table contains a very large amount of rows (say over 2 million records).
The solutions proposed so far are to divide the table into 2 different tables, so that in advance the records with priority 1 will be entered into Table A, and the rest of the records will be entered in Table B, and then we will SELECT using UNION between the two tables.
This way we save the ORDER BY process since Table A always contains priority 1, and it remains to sort only the data in Table B, which is expected to be smaller than the overall large table.
On the other hand, it was also suggested to leave the large table in place, and perform the SELECT using PARTITION BY on the priority column.
I searched the web for differences in speed or efficiency between the two options but did not find one, and I am debating how to do it. So which of the options is preferable if we focus on the efficiency and complexity of time?
I am trying to extract a table to a flat file using python and the ssas_api package, allowing me to run DAX queries from python code.
The table is fairly big and because of that a simple EVALUATE tablename query will timeout after 1h.
I want to split the queries into smaller ones, iterating over the table by chunks of let say 20k lines for example.
I could do the first chunk using TOPN but what about the next ones?
Let's say I have a Google sheet with tab1 and tab2
in tab 1 I have 2000 rows and 20 columns filled with data and 1000 rows are empty, so I have 3000 rows.
In tab2 I have a few formulas like vlookup and some if functions.
The options I can think of are:
I can name the range of the data in tab1 and use that in the formula(s) and if the range expands, I can edit the range
I can use option B:B
I can delete the empty rows and use B:B
what is the fastest way?
all three of those options have no real word effect on the overall performance given that you have only 3000 rows across 20 columns. the biggest impact on performance you can have is from QUERYs, IMPORTRANGEs and ARRAYFORMULAs if fed by a huge amount of data (10000+ rows) or if you have extensive calculations with multiple sub-steps consisting of whole virtual arrays.
I have a data frame with 9 columns and many rows. I want to filter all the rows that have observations greater than 3.0 in at least 3 columns. Which conditional statements should I use to subset my data frame?
Since I am a n00b, I only came up with this:
data_frame[data_frame > 3,]
Obviously, this gives me all the rows for which all values are > 2, regardless of what I actually need.
Thanks!
I figured that you could also combine logical operators:
data[rowSums(data>2)>=3,]
Like this, you can subset from a data frame the rows for which the sum of observations (higher than 2) occurs three or more times. And no specification for the columns.
Logical operator, in this case, the brain. I used the sum(rowSum(data))>x # x =sum of the limit value times columns available.
I'm trying to consider a table design in oracle/SQL where many rows in a database are related to each other and will always be queried together but do contain different information (although many of the columns will contain similar information between the similar rows)
In that sense, it seems to me that it would be more efficient to somehow compress several rows into a single row in Oracle that contains a single common recordID and is always stored together on disk since they are always inserted, deleted, queried and extracted together. For this type of table, is there some sort of Row Compression that can be used so that these related rows aren't treated as individual rows for better performance?
Updated: An example would be as follows
Field1 Field2 Field3
1 1 A
1 2 B
1 3 C
2 4 D
2 5 E
In this example, I would always insert and query the first three rows together (because they share Field1 values). They are separate pieces of data, but they are never separated from each other. Is there some way to insert, store, index and extract them as a group while keeping them as separate data rows?
If you are on Exadata then you can go for Exadata Hybrid Columnar Compression.
http://www.oracle.com/technetwork/database/exadata/ehcc-twp-131254.pdf
If you are not on Exadata you can still use OLTP compression.
http://allthingsoracle.com/compression-in-oracle-part-3-oltp-compression/