Hbase: Getting a row key based on a column qualifier value - filter

I have hbase table, column family, column qualifier and the value. Need to get rowkey. I have tried SingleColumnValueFilter but it took around 2 hours to scan entire table as my table contains billions of records. I need a better approach which gives rowkey in minutes than hours. Please help me out
Below is my code:
SingleColumnValueFilter colValFilter = new SingleColumnValueFilter(Bytes.toBytes(<CF>),
Bytes.toBytes(<CQ>), CompareFilter.CompareOp.EQUAL,
new BinaryComparator(Bytes.toBytes(<VALUE>)));

Related

Too many or too few rows when using different engines when creating one table from another

I'm trying to create one table from another using
CREATE TABLE IF NOT EXISTS new_data ENGINE = ReplicatedReplacingMergeTree(/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}')
ORDER BY created_at
SETTINGS index_granularity = 8192, allow_nullable_key=TRUE
AS
SELECT *
FROM table
WHERE column IS NOT NULL
When I use
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}'),
i've got around 7-9% of expected number of rows i've got from query SELECT...FROM...WHERE
When I use
ENGINE = ReplicatedMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}')
i've got 3 times more than expected (I assume every row occur exactly 3 times)
I would like to have exact number of rows without losses and without duplication
ReplicatedReplacingMergeTree with ORDER BY created_at
will replace many rows with the same created_at value to one rows
How did you delete exists table data before create
ReplicatedMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data'...)?
Did you use DROP TABLE new_data SYNC?
Which engine do you have for table?

Deletion is slow in oracle DB

We have a table that doesn't have much data. The table has 3 partitions and we are deleting data in one partition only.
delete from table AB partition(A) where id=value;
here id has an index also but still delete is slow.
The datatype of id is varchar2 and the value is number.
Please help me to understand why the delete statement is slow.
I don't think the index has much use in this case. It has to evaluate every single row in the partition to see if it matches id=value. Typically this will be a full table scan and no index will be used. It totally depends on the number of rows in the partition how long it will take. But maybe i did not understand the question properly. I presumed "value" is a column in the same table, like ID.

How to sample for each group in hive?

I have a large table in hive that has 1.5 bil+ values. One of the columns is category_id, which has ~20 distinct values. I want to sample the table such that I have 1 mil values for each category.
I checked out Random sample table with Hive, but including matching rows and Hive: Creating smaller table from big table and I figured out how to get a random sample from the entire table, but I'm still unable to figure out how to get a sample for each category_id.
I understand you want to sample your table in multiple files. You might want to check Hive bucketing or Dynamic partitions to balance your records between multiple folder/files.

comparing data in two tables taking time

I need to query table1 find all orders and created date ( key is order number an date)).
In table 2 ( key is order number an date) Check if the order exists for a a date.
For this i am scanning table 1 and for each record checking if it exists in table 2. Any better way to do this
In this situation in which your key is identical for both tables, it makes sense to have a single table in which you store both data for Table 1 and Table 2. In that way you can do a single scan on your data and know straight away if the data exists for both criteria.
Even more so, if you want to use this data in MapReduce, you would simply scan that single table. If you only want to get the relevant rows, you could define a filter on the Scan. For example, in the case where you will not be populating rows at all in Table 2, you would simply use a ColumnPrefixFilter
If, however, you do need to keep this data separately in 2 tables, you could pre-split the tables with the same region boundaries for both tables - this will be helpful when you do the query that you are aiming for - load all rows in Table 1 when row exists in Table 2. Essentially this would be a map-side join. You could define multiple inputs in your MapReduce job, and since the region borders are the same, the splits will be such that each mapper will have corresponding rows from both tables. You would probably need to implement your own MultipleInput format for that (the MultiTableInputFormat class recently introduced in 0.96 does not seem to do that map side join)

Spilt data table into multiple data table on basis of dates in column

I have some specific issue with data table that i have to split it dynamically on the basis of dates which are in data table column.
for example:
enter link description here
Now according to dates I should get 3 data tables having name respectively.
I could not Linq so how can resolve this problem????
Thnks.
you can sort the table according to the date column and then go over the rows data and create new table for every new date

Resources