Using parquet-mr#1.11.0, i have a schema such as:
schema message page {
required binary url (STRING);
optional binary content (STRING);
}
I'm doing a single row lookup by url to retrieve the associated content
Rows are ordered by url.
The file was created with:
parquet.block.size: 256 MB
parquet.page.size: 10 MB
Using parquet-tools I was able to verify that I have indeed my column index and/or offsets for my columns:
column index for column url:
Boudary order: ASCENDING
null count min max
page-0 0 http://materiais.(...)delos-de-curriculo https://api.quero(...)954874/toogle_like
page-1 0 https://api.quero(...)880/toogle_dislike https://api.quero(...)ior-online/encceja
page-2 0 https://api.quero(...)erior-online/todos https://api.quero(...)nte-em-saude/todos
offset index for column url:
offset compressed size first row index
page-0 4 224274 0
page-1 224278 100168 20000
page-2 324446 67778 40000
column index for column content:
NONE
offset index for column content:
offset compressed size first row index
page-0 392224 504412 0
page-1 896636 784246 125
page-2 1680882 641212 200
page-3 2322094 684826 275
[... truncated ...]
page-596 256651848 183162 53100
Using a reader configured as:
AvroParquetReader
.<GenericRecord>builder(HadoopInputFile.fromPath(path, conf))
.withFilter(FilterCompat.get(
FilterApi.eq(
FilterApi.binaryColumn(urlKey),
Binary.fromString(url)
)
))
.withConf(conf)
.build();
Thanks to the column-index and column-offsets I was expecting the reader to read only 2 pages:
The one containing the url matching min/max using column index.
then, the one containing the matching row index for content using offset index.
But what I see is that the reader is reading and decoding hundreds of pages (~250MB) for the content column, am I missing something on how PageIndex is supposed to work in parquet-mr ?
Looking a the 'loading page' and 'skipping record' log lines this is trying to build the whole record before applying the filter on url which, in my opinion, defeat the purpose of PageIndex.
I tried to look online and dive into how the reader works but I could not find anything.
edit
I found an opened PR from 2015 on parquet-column hinting that the current reader (at the time at least) is indeed building the whole record with all the required columns before applying the predicate:
https://github.com/apache/parquet-mr/pull/288
But I fail to see, on this context, the purpose of the column offsets.
Found out that, even though this is not what I was expecting reading the specs, it is working as intended.
From this issue I quote:
The column url has 3 pages. Your filter finds out that page-0 matches. Based on the offset index it is translated to the row range [0..19999]. Therefore, we need to load page-0 for the column url and all the pages are in the row range [0..19999] for column content.
Related
I'm learning about Power Builder, and i don't know how to use these, (DWitemstatus, getnextmodified, modifiedcount, getitemstatus, NotModified!, DataModified!, New!, NewModified!)
please help me.
Thanks for read !
These relate to the status of rows in a datawindow. Generally the rows are retrieved from a database but this doesn't always have to be the case - data can be imported from a text file, XML, JSON, etc. as well.
DWItemstatus - these values are constants and describe how the data would be changed in the database.
Values are:
NotModified! - data unchanged since retrieved
DataModified! - data in one or more columns has changed
New! - row is new but no values have been assigned
NewModifed! - row is new and at least one value has been assigned to a column.
So in terms of SQL, a row which is not modified would not generate any SQL to the DBMS. A DataModified row would typically generate an UPDATE statement. New and NewModifed would typically generate INSERT statements.
GetNextModifed is a method to search a set of rows in a datawindow to find the modified rows within that set. The method takes a buffer parameter and a row parameter. The datawindow buffers are Primary!, Filter!, and Delete!. In general you would only look at the Primary buffer.
ModifedCount is a method to determine the number of rows which have been modifed in a datawindow. Note that deleting a row is not considered a modification. To find the number of rows deleted use the DeletedCount method.
GetItemStatus is a method to get the status of column within a row in a data set in a datawindow. It takes the parameters row, column (name or number), and DWBuffer.
So now an example of using this:
// loop through rows checking for changes
IF dw_dash.Modifiedcount() > 0 THEN
ll = dw_dash.GetNextModified(0,Primary!)
ldw = dw_dash
DO WHILE ll > 0
// watch value changed
IF ldw.GetItemStatus(ll,'watch',Primary!) = DataModified! THEN
event we_post_item(ll, 'watch', ldw)
END IF
// followup value changed
IF ldw.GetItemStatus(ll,'followupdate',Primary!) = DataModified! THEN
event we_post_item(ll, 'followupdate', ldw)
END IF
ll = ldw.GetNextModified(ll,Primary!)
LOOP
ldw.resetupdate() //reset the modifed flags
END IF
In this example we first check to see if any row in the datawindow has been modified. Then we get the first modified row and check if either the 'watch' or 'followupdate' columns were changed. If they were we trigger an event to do something. We then loop to the next modified row and so on. Finally we reset the modified flags so the row would now show as not being mofified.
I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?
I have the following working an existing sheet2 when filtering data from sheet source A:
=filter({{Source!A1:F115},{Source!R1:R115},{Processed!T1:T115}},Source!Q1:Q115=w2)
But when a new row was entered in source A, it breaks with error:
filter has mismatched range size. Expected row count 1, column count 1. Actual row count 116, column count 1.
When I check the formula became
=filter({{Source!A1:F116},{Source!R1:R116},{Processed!T1:T116}},Source!Q1:Q115=w2)
How can I fix this?
try to not include the end row:
=FILTER({{Source!A1:F}, {Source!R1:R}, {Processed!T1:T}}, Source!Q1:Q=W2)
if that's not the option you can try to freeze it:
=FILTER({{INDIRECT("Source!A1:F115")}, {INDIRECT("Source!R1:R115")},
{INDIRECT("Processed!T1:T115")}}, Source!Q1:Q115=W2)
or you can try something crazy like:
=FILTER({{Source!A1:F115}, {Source!R1:R115}, {Processed!T1:T115}},
INDIRECT("Source!Q1:Q"&COUNTA(Source!R1:R))=W2)
I have a data, for example per the following:
I need to match the content with the input provided for the Content & Range fields to return the matching rows. As you can see the Content field is a collection of strings & the Range field is a range between two numbers. I am looking at hashing the data, to be used for matching with the hashed input. Was thinking about Iterating through the collection of individual strings hashcode & storing it for the Content field. For the Range field I was looking at using interval trees. But then the challenge is when i hash the Content input & Range input how will i find if it that hashcode is present in the hashcode generated for the collection of strings in the Content fields & the same for the Range fields.
Please do let me know if there are any other alternate ways in which this can be achieved. Thanks.
There is a simple solution to your problem: Inverted Index.
For each item in content, create the inverted index that maps 'Content' to 'RowID', i.e. create another table of 2 columns viz. Content(string), RowIDs(comma separated strings).
For your first row, add the entries {Azd, 1}, {Zax, 1}, {Gfd, 1}..., {Mni, 1} in that table. For the second row, add entries for new Content strings. For the Content string already present in the first row ('Gfd', for example), just append the new row id to the entry you created for first row. So, Gfd's row will look like {Gfd, 1,2}.
When done processing, you will have the table that will have 'Content' strings mapped to all the rows in which this content string is present.
Do the same inverted indexing for mapping 'Range' to 'RowID' and create another table of Range(int), RowIDs(comma seperated strings).
Now, you will have a table whose rows will tell which range is present in which row ids.
Finally, for each query that you have to process, get the corresponding Content and Range row from the inverted index tables and do an intersection of those comma seperated list. You will get your answer.
Can anyone explain to me the following lines from Cassandra 2.1.15 WordCount example?
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3");
CqlConfigHelper.setInputCql(job.getConfiguration(), "select * from " + COLUMN_FAMILY + " where token(id) > ? and token(id) <= ? allow filtering");
How do I define concrete values which will be used to replace "?" in the query?
And what is meant by page row size?
How do I define concrete values which will be used to replace "?" in
the query?
You don't. These parameterized values are set by the splits created by the input format. They are set automatically but can be adjusted (to a degree) by adjusting the split size.
And what is meant by page row size?
Page row size determines the number of CQL Rows retrieved in a single request by a mapper during execution. If a C* partition contains 10000 CQL rows and the page row size is set to 1000, it will take 10 requests to retrieve all of the data.