Cassandra WordCount Hadoop - hadoop

Can anyone explain to me the following lines from Cassandra 2.1.15 WordCount example?
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3");
CqlConfigHelper.setInputCql(job.getConfiguration(), "select * from " + COLUMN_FAMILY + " where token(id) > ? and token(id) <= ? allow filtering");
How do I define concrete values which will be used to replace "?" in the query?
And what is meant by page row size?

How do I define concrete values which will be used to replace "?" in
the query?
You don't. These parameterized values are set by the splits created by the input format. They are set automatically but can be adjusted (to a degree) by adjusting the split size.
And what is meant by page row size?
Page row size determines the number of CQL Rows retrieved in a single request by a mapper during execution. If a C* partition contains 10000 CQL rows and the page row size is set to 1000, it will take 10 requests to retrieve all of the data.

Related

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

Loop over attribute values for executing SQL in Nifi

I would like to know how may I accomplish the following use case in Nifi Flow:
I would like to execute SQL query for date range over a loop. The date range are provided from list of attribute values.
For example: If my list of attributes are : 2013-01-01 2013-02-01 2013-03-01, I would like to execute SQL operation over a loop such that:
select * from where startdate>=2013-01-01 and enddate<2013-02-01
followed by:
select * from where startdate>=2013-02-01 and enddate<2013-03-01
Therefore, for the same, I roughly know the idea but cant implement concretely:
UpdateAttribute (containing list of date values) -> SplitText-> RouteOnAttribute -> ExecuteSQL
Thanks
In NiFi 1.8.0, you can use DuplicateFlowFile for this (via NIFI-5454). You can start with UpdateAttribute to add the count of delineated values in your list (let's assume it is an attribute called datelist), perhaps set list.count to
${allDelineatedValues(${datelist}, " "):count()}
Then in DuplicateFlowFile you can set Number of Copies to ${list.count:minus(1)}. Each flow file downstream will have a copy.index attribute set (the original having index 0), so you can use that in ReplaceText in conjunction with getDelimitedValue(), perhaps setting the content to the following:
select * from myTable where
startdate >= ${datelist:getDelimitedField(${copy.index:plus(1)})} and
enddate < ${datelist:getDelimitedField(${copy.index:plus(2)})}

How do I split birt dataset column into multiple rows

My datasource has a column that contains a comma-separated list of numbers.
I want to create a dataset that takes those numbers and turns them into groupings to use in a bar chart.
requirements
numbers will be between 0-17 inclusive
groupings: 0-2,3-5,6-10,11-17
x-axis labels have to be the groupings
y-axis is the percent of rows that contain that grouping
note that because each row can contribute to multiple columns the percentages can add up to > 100%
any help you can offer would be awesome... i'm very new to BIRT and have been stuck on this for a couple days now
Not sure that I understand the requirements exactly, but your basic question "split dataset column into multiple rows" can be solved either using a scripted dataset or with pure SQL (depending on your DB).
Either way, you will need a second dataset (e.g. your data model is master-detail, and in your layout you will need something like
Table/List "Master bound to master DS
Table/List "Detail" bound to detail DS
The detail DS need the comma-separated result column from the master DS as an input parameter of type "String".
Doing this with a scripted dataset is quite easy IFF you understand Javascript AND you understand how scripted datasets work: Create a report variable "myValues" of type object with a default value of null and a second report variable "myValuesIndex" of type integer with a default value of 0.
(Note: this is all untested!)
Create the dataset "detail" as a scripted DS, with one input parameter "csv" of type String and one output parameter "value" of type String.
In the open event of the scripted DS, code:
vars["myValues"] = this.getInputParameterValue("csv").split(",");
vars["myValuesIndex"] = 0;
In the fetch event, code:
var i = vars["myValuesIndex"];
var len = vars["myValues"].length;
if (i < len) {
row["value"] = vars["myValues"][i];
vars["myValuesIndex"] = i+1;
return true;
} else {
return false;
}
For example, for the master DS result row with csv = "1,2,3-4,foo", the detail DS will result in 4 rows with
value = "1"
value = "2"
value = "3-4"
value = "foo"
Using an Oracle DB, this can be done without Javascript. The detail DS (with the same input parameter as above) would then look like:
select t.value as value from table(split(?)) t
For the definition of the split function, see RedFilter's answer on
Is there a function to split a string in PL/SQL?
If you get ORA-22813, you should change the original definition
create or replace type split_tbl as table of varchar2(32767);
to
create or replace type split_tbl as table of varchar2(4000);
as mentioned on https://community.oracle.com/thread/2288603?tstart=0
It's also possible with pure SQL in 11g using regexp_substr (see the same page).
create parameters in the scripted data set. we have to pass or link actual dataset values to scripted dataset parameters through DataSet parameter Binding after assigning the scripted data set to Table.

sort and partition using random fields not just the first k fields

I'm using hadoop streaming to do some job, and I encounter a problem, here it is.
The input file to mapper has, say 3 fields, in each line. I know that mapper's output will be sorted and partitioned before feeding the data to reducer, and my problem is
1.Can I sort/partition those data using the 3rd field?
2.Can I sort the data using the whole line?
PS:
AFAIK, the sort-key or partition-key should be the first k fields of each line, right? If so, does it mean I should move those fields into the front of line in mapper?
Mapper's output are sorted only on the basis of the key.
So, say you have input record as : field1, field2 ,field3
1) if you do not want to the first field to be your key and can manage if your 3rd field is key then you do not need to do anything else, so you can do something like follows:
output.collect(new Text(field3), new Text(field1 + ","+field2)); //Old API
context.write(new Text(field3), new Text(field1 + ","+field2)); //New API
2) Similarly you can have everything as your key and null as value, this would result in getting sorted as per whole line, something like follows can be done:
output.collect(new Text(field1 + ","+field2 + "," + field3), null); // Old API
context.write(new Text(field1 + ","+field2 + "," + field3), null); // New API
No, it doesn't matter at all in which sequence the fields are in the input file as far as the sorting is concerned, it is only depended on what are you emitting from the mapper as mapper output.
But if you need to have field1 as your key in mapper output but want to do a secondary sort on field3, then read : How to do a secondary sort on values ?

Hive cluster by vs order by vs sort by

As far as I understand;
sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by
So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?
The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.
A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.
The longer version:
ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.
Let me clarify first: clustered by only distributes your keys into different buckets, clustered by ... sorted by get buckets sorted.
With a simple experiment (see below) you can see that you will not get global order by default. The reason is that default partitioner splits keys using hash codes regardless of actual key ordering.
However you can get your data totally ordered.
Motivation is "Hadoop: The Definitive Guide" by Tom White (3rd edition, Chapter 8, p. 274, Total Sort), where he discusses TotalOrderPartitioner.
I will answer your TotalOrdering question first, and then describe several sort-related Hive experiments that I did.
Keep in mind: what I'm describing here is a 'proof of concept', I was able to handle a single example using Claudera's CDH3 distribution.
Originally I hoped that org.apache.hadoop.mapred.lib.TotalOrderPartitioner will do the trick. Unfortunately it did not because it looks like Hive partitions by value, not key. So I patch it (should have subclass, but I do not have time for that):
Replace
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(key);
}
with
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(value);
}
Now you can set (patched) TotalOrderPartitioner as your Hive partitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/yevgen/out_data2
I also used
hive> set hive.enforce.bucketing = true;
hive> set mapred.reduce.tasks=4;
in my tests.
File out_data2 tells TotalOrderPartitioner how to bucket values.
You generate out_data2 by sampling your data. In my tests I used 4 buckets and keys from 0 to 10. I generated out_data2 using ad-hoc approach:
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.hive.ql.io.HiveKey;
import org.apache.hadoop.fs.FileSystem;
public class TotalPartitioner extends Configured implements Tool{
public static void main(String[] args) throws Exception{
ToolRunner.run(new TotalPartitioner(), args);
}
#Override
public int run(String[] args) throws Exception {
Path partFile = new Path("/home/yevgen/out_data2");
FileSystem fs = FileSystem.getLocal(getConf());
HiveKey key = new HiveKey();
NullWritable value = NullWritable.get();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, getConf(), partFile, HiveKey.class, NullWritable.class);
key.set( new byte[]{1,3}, 0, 2);//partition at 3; 1 came from Hive -- do not know why
writer.append(key, value);
key.set( new byte[]{1, 6}, 0, 2);//partition at 6
writer.append(key, value);
key.set( new byte[]{1, 9}, 0, 2);//partition at 9
writer.append(key, value);
writer.close();
return 0;
}
}
Then I copied resulting out_data2 to HDFS (into /user/yevgen/out_data2)
With these settings I got my data bucketed/sorted (see last item in my experiment list).
Here is my experiments.
Create sample data
bash> echo -e "1\n3\n2\n4\n5\n7\n6\n8\n9\n0" > data.txt
Create basic test table:
hive> create table test(x int);
hive> load data local inpath 'data.txt' into table test;
Basically this table contains values from 0 to 9 without order.
Demonstrate how table copying works (really mapred.reduce.tasks parameter which sets MAXIMAL number of reduce tasks to use)
hive> create table test2(x int);
hive> set mapred.reduce.tasks=4;
hive> insert overwrite table test2
select a.x from test a
join test b
on a.x=b.x; -- stupied join to force non-trivial map-reduce
bash> hadoop fs -cat /user/hive/warehouse/test2/000001_0
1
5
9
Demonstrate bucketing. You can see that keys are assinged at random without any sort order:
hive> create table test3(x int)
clustered by (x) into 4 buckets;
hive> set hive.enforce.bucketing = true;
hive> insert overwrite table test3
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test3/000000_0
4
8
0
Bucketing with sorting. Results are partially sorted, not totally sorted
hive> create table test4(x int)
clustered by (x) sorted by (x desc)
into 4 buckets;
hive> insert overwrite table test4
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test4/000001_0
1
5
9
You can see that values are sorted in ascending order. Looks like Hive bug in CDH3?
Getting partially sorted without cluster by statement:
hive> create table test5 as
select x
from test
distribute by x
sort by x desc;
bash> hadoop fs -cat /user/hive/warehouse/test5/000001_0
9
5
1
Use my patched TotalOrderParitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/training/out_data2
hive> create table test6(x int)
clustered by (x) sorted by (x) into 4 buckets;
hive> insert overwrite table test6
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test6/000000_0
1
2
0
bash> hadoop fs -cat /user/hive/warehouse/test6/000001_0
3
4
5
bash> hadoop fs -cat /user/hive/warehouse/test6/000002_0
7
6
8
bash> hadoop fs -cat /user/hive/warehouse/test6/000003_0
9
CLUSTER BY does not produce global ordering.
The accepted answer (by Lars Yencken) misleads by stating that the reducers will receive non-overlapping ranges. As Anton Zaviriukhin correctly points to the BucketedTables documentation, CLUSTER BY is basically DISTRIBUTE BY (same as bucketing) plus SORT BY within each bucket/reducer. And DISTRIBUTE BY simply hashes and mods into buckets and while the hashing function may preserve order (hash of i > hash of j if i > j), mod of hash value does not.
Here's a better example showing overlapping ranges
http://myitlearnings.com/bucketing-in-hive/
As I understand, short answer is No.
You'll get overlapping ranges.
From SortBy documentation:
"Cluster By is a short-cut for both Distribute By and Sort By."
"All rows with the same Distribute By columns will go to the same reducer."
But there is no information that Distribute by guarantee non-overlapping ranges.
Moreover, from DDL BucketedTables documentation:
"How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets."
I suppose that Cluster by in Select statement use the same principle to distribute rows between reducers because it's main use is for populating bucketed tables with the data.
I created a table with 1 integer column "a", and inserted numbers from 0 to 9 there.
Then I set number of reducers to 2
set mapred.reduce.tasks = 2;.
And select data from this table with Cluster by clause
select * from my_tab cluster by a;
And received result that I expected:
0
2
4
6
8
1
3
5
7
9
So, first reducer (number 0) got even numbers (because their mode 2 gives 0)
and second reducer (number 1) got odd numbers (because their mode 2 gives 1)
So that's how "Distribute By" works.
And then "Sort By" sorts the results inside each reducer.
Use case : When there is a large dataset then one should go for sort by as in sort by , all the set reducers sort the data internally before clubbing together and that enhances the performance. While in Order by, the performance for the larger dataset reduces as all the data is passed through a single reducer which increases the load and hence takes longer time to execute the query.
Please see below example on 11 node cluster.
This one is Order By example output
This one is Sort By example output
This one is Cluster By example
What I observed , the figures of sort by , cluster by and distribute by is SAME But internal mechanism is different. In DISTRIBUTE BY : The same column rows will go to one reducer , eg. DISTRIBUTE BY(City) - Bangalore data in one column , Delhi data in one reducer:
Cluster by is per reducer sorting not global. In many books also it is mentioned incorrectly or confusingly. It has got particular use where say you distribute each department to specific reducer and then sort by employee name in each department and do not care abt order of dept no the cluster by to be used and it more perform-ant as workload is distributed among reducers.
SortBy: N or more sorted files with overlapping ranges.
OrderBy: Single output i.e fully ordered.
Distribute By: Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.
For more information http://commandstech.com/hive-sortby-vs-orderby-vs-distributeby-vs-clusterby/
ClusterBy: Refer to the same example as above, if we use Cluster By x, the two reducers will further sort rows on x:
If I understood it correctly
1.sort by - only sorts the data within the reducer
2.order by - orders things globally by pushing the entire data set to a single reducer. If we do have a lot of data(skewed), this process will take a lot of time.
cluster by - intelligently distributes stuff into reducers by the key hash and make a sort by, but does not grantee global ordering. One key(k1) can be placed into two reducers. 1st reducer gets 10K K1 data, the second one might get 1K k1 data.

Resources