why is clickhouse writing batch records slowly? - clickhouse

I use clickhouse-jdbc to write data to a distributed "all" table of clickhouse(3 hosts for 3 shards 1 replica).
5000 batch commit using PreparedStatement for 1000000 records costs 6280s.
...
ps.setString(68, dateTimeStr);
ps.setDate(69, date);
ps.addBatch();
System.out.println("i: " + i);
if(i % 5000 == 0 || i == maxRecords) {
System.out.println(new java.util.Date());
ps.executeBatch();
System.out.println(new java.util.Date());
// ps.execute();
conn.commit();
System.out.println("commit: " + new java.util.Date());
}
...
Is there any better way for inserting one hundred million records a day?

Yes, inserts into a Distributed engine could be potentially slow due to the whole bunch of logic needed to be done on each insert operation (fsync the data to the specific shard, etc).
You can try to tune some settings, which are described in the link above.
However, I found it much convenient and faster to write into the underlying table directly. But it requires you to care about sharding and spreading the data across all of your nodes, ie distributing your data to your own taste.

Related

Spring batch Remote partitioning : Pushing Huge data in kafka during partition

I have implemented spring batch remote partitioning.Now I have to push partition 10 billion ids divided into partitions.The ids will be fetched from elastic and push into partition which in turn will be pushed into kafka
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> map = new HashMap<>(gridSize);
AtomicInteger partitionNumber = new AtomicInteger(1);
try {
for(int i=0;i<n;i++){
List<Integer> ids = //fetch id from elastic
map.put("partition" + partitionNumber.getAndIncrement(), context);
}
System.out.println("Partitions Created");
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
I cannot fetch and push all ids in map at once otherwise,I will go out of memory.I want ids to be pushed in queue and then next ids are fetched.
Can this be done through spring batch?
If you want to use partitioning, you have to find a way to partition the input dataset with a given key. Without a partition key, you can't really use partitioning (with or without Spring Batch).
If your IDs are defined by a sequence that can be divided into partitions, you don't have to fetch 10 billion IDs, partition them and put each partition (ie all IDs of each partition) in the execution context of workers. What you can do is find the max ID, create ranges of IDs and assign them to distinct workers. For example:
Partition 1: 0 - 10000
Partition 2: 10001 - 20000
etc
If your IDs are not defined by a sequence and cannot be partitioned by range, then you need to find another key (or a composite key) that allows you to partition data based on another criteria. Otherwise, (remote) partitioning is not an option for you.

Add field to existing documents over million records

Scenario
We have over 5 million document in a bucket and all of it has nested JSON with a simple uuid key. We want to add one extra field to ALL of the documents.
Example
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field2":{
"nested_field1":"data2"
}
}
After adding extra field
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field3":"data3",
"field2":{
"nested_field1":"data2"
}
}
It has only one Primary Index: CREATE PRIMARY INDEX idx FOR bucket.
Problem
It takes ages. We tried it with n1ql, UPDATE bucket SET field3 = data3. Also sub-document mutation. But all of it takes hours. It's written in Go so we could put it into a goroutine, but it's still too much time.
Question
Is there any solution to reduce that time?
As you need to add new field, not modifying any existing field it is better to use SDKs SUBDOC API vs N1QL UPDATE (It is whole document update and require fetch the document).
The Best option will be Use N1QL get the document keys then use
SDK SUBDOC API to add the field you need. You can use reactive API(asynchronously)
You have 5M documents and have primary index use following
val = ""
In loop
SELECT RAW META().id FROM mybucket WHERE META().id > $val LIMIT 10000;
SDK SUBDOC update
val = last value from the SELECT
https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/
The Eventing Service can be quite performant for these sort of enrichment tasks. Even a low end system should be able to do 5M rows in under two (2) minutes.
// Note src_bkt is an alias to the source bucket for your handler
// in read+write mode supported for version 6.5.1+, this uses DCP
// and can be 100X more performant than N1QL.
function OnUpdate(doc, meta) {
// optional filter to be more selective
// if (!doc.type && doc.type !== "mytype") return;
// test if we already have the field we want to add
if (doc.field3) return;
doc.field3 = "data3";
src_bkt[meta.id] = doc;
}
For more details on Eventing refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html I typically enrich 3/4 of a billion documents. The Eventing function will also run faster (enrich more documents per second) if you increase the number of workers in your Eventing function's setting from say 3 to 16 provided you have 8+ physical cores on your Eventing node.
I tested the above Eventing function and it enriches 5M documents (modeled on your example) on my non-MDS single node couchbase test system (12 cores at 2.2GHz) in just 72 seconds. Obviously if you have a real multi node cluster it will be faster (maybe all 5M docs in just 5 seconds).

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this
for (Result rs = scanner.next(); rs != null; rs = scanner.next()) {
number++;
}
If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce
How to quickly count number of rows.
Use RowCounter in HBase
RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>
Usage: RowCounter [options]
<tablename> [
--starttime=[start]
--endtime=[end]
[--range=[startKey],[endKey]]
[<column1> <column2>...]
]
You can use the count method in hbase to count the number of rows. But yes, counting rows of a large table can be slow.count 'tablename' [interval]
Return value is the number of rows.
This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar
hbase.jar rowcount’ to run a counting mapreduce job). Current count is shown
every 1000 rows by default. Count interval may be optionally specified. Scan
caching is enabled on count scans by default. Default cache size is 10 rows.
If your rows are small in size, you may want to increase this
parameter.
Examples:
hbase> count 't1'
hbase> count 't1', INTERVAL => 100000
hbase> count 't1', CACHE => 1000
hbase> count 't1', INTERVAL => 10, CACHE => 1000
The same commands also can be run on a table reference. Suppose you had a reference to table 't1', the corresponding commands would be:
hbase> t.count
hbase> t.count INTERVAL => 100000
hbase> t.count CACHE => 1000
hbase> t.count INTERVAL => 10, CACHE => 1000
If you cannot use RowCounter for whatever reason, then a combination of these two filters should be an optimal way to get a count:
FirstKeyOnlyFilter() AND KeyOnlyFilter()
The FirstKeyOnlyFilter will result in the scanner only returning the first column qualifier it finds, as opposed to the scanner returning all of the column qualifiers in the table, which will minimize the network bandwith. What about simply picking one column qualifier to return? This would work if you could guarentee that column qualifier exists for every row, but if that is not true then you would get an inaccurate count.
The KeyOnlyFilter will result in the scanner only returning the column family, and will not return any value for the column qualifier. This further reduces the network bandwidth, which in the general case wouldn't account for much of a reduction, but there can be an edge case where the first column picked by the previous filter just happens to be an extremely large value.
I tried playing around with scan.setCaching but the results were all over the place. Perhaps it could help.
I had 16 million rows in between a start and stop that I did the following pseudo-empirical testing:
With FirstKeyOnlyFilter and KeyOnlyFilter activated:
With caching not set (i.e., the default value), it took 188 seconds.
With caching set to 1, it took 188 seconds
With caching set to 10, it took 200 seconds
With caching set to 100, it took 187 seconds
With caching set to 1000, it took 183 seconds.
With caching set to 10000, it took 199 seconds.
With caching set to 100000, it took 199 seconds.
With FirstKeyOnlyFilter and KeyOnlyFilter disabled:
With caching not set, (i.e., the default value), it took 309 seconds
I didn't bother to do proper testing on this, but it seems clear that the FirstKeyOnlyFilter and KeyOnlyFilter are good.
Moreover, the cells in this particular table are very small - so I think the filters would have been even better on a different table.
Here is a Java code sample:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.filter.RowFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.RegexStringComparator;
public class HBaseCount {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
HTable table = new HTable(config, "my_table");
Scan scan = new Scan(
Bytes.toBytes("foo"), Bytes.toBytes("foo~")
);
if (args.length == 1) {
scan.setCaching(Integer.valueOf(args[0]));
}
System.out.println("scan's caching is " + scan.getCaching());
FilterList allFilters = new FilterList();
allFilters.addFilter(new FirstKeyOnlyFilter());
allFilters.addFilter(new KeyOnlyFilter());
scan.setFilter(allFilters);
ResultScanner scanner = table.getScanner(scan);
int count = 0;
long start = System.currentTimeMillis();
try {
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
count += 1;
if (count % 100000 == 0) System.out.println(count);
}
} finally {
scanner.close();
}
long end = System.currentTimeMillis();
long elapsedTime = end - start;
System.out.println("Elapsed time was " + (elapsedTime/1000F));
}
}
Here is a pychbase code sample:
from pychbase import Connection
c = Connection()
t = c.table('my_table')
# Under the hood this applies the FirstKeyOnlyFilter and KeyOnlyFilter
# similar to the happybase example below
print t.count(row_prefix="foo")
Here is a Happybase code sample:
from happybase import Connection
c = Connection(...)
t = c.table('my_table')
count = 0
for _ in t.scan(filter='FirstKeyOnlyFilter() AND KeyOnlyFilter()'):
count += 1
print count
Thanks to #Tuckr and #KennyCason for the tip.
Use the HBase rowcount map/reduce job that's included with HBase
Simple, Effective and Efficient way to count row in HBASE:
Whenever you insert a row trigger this API which will increment that particular cell.
Htable.incrementColumnValue(Bytes.toBytes("count"), Bytes.toBytes("details"), Bytes.toBytes("count"), 1);
To check number of rows present in that table. Just use "Get" or "scan" API for that particular Row 'count'.
By using this Method you can get the row count in less than a millisecond.
To count the Hbase table record count on a proper YARN cluster you have to set the map reduce job queue name as well:
hbase org.apache.hadoop.hbase.mapreduce.RowCounter -Dmapreduce.job.queuename= < Your Q Name which you have SUBMIT access>
< TABLE_NAME>
You can use coprocessor what is available since HBase 0.92. See Coprocessor and AggregateProtocol and example
Two ways Worked for me to get count of rows from hbase table with Speed
Scenario #1
If hbase table size is small then login to hbase shell with valid user and execute
>count '<tablename>'
Example
>count 'employee'
6 row(s) in 0.1110 seconds
Scenario #2
If hbase table size is large,then execute inbuilt RowCounter map reduce job:
Login to hadoop machine with valid user and execute:
/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter '<tablename>'
Example:
/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'employee'
....
....
....
Virtual memory (bytes) snapshot=22594633728
Total committed heap usage (bytes)=5093457920
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
ROWS=6
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
If you're using a scanner, in your scanner try to have it return the least number of qualifiers as possible. In fact, the qualifier(s) that you do return should be the smallest (in byte-size) as you have available. This will speed up your scan tremendously.
Unfortuneately this will only scale so far (millions-billions?). To take it further, you can do this in real time but you will first need to run a mapreduce job to count all rows.
Store the Mapreduce output in a cell in HBase. Every time you add a row, increment the counter by 1. Every time you delete a row, decrement the counter.
When you need to access the number of rows in real time, you read that field in HBase.
There is no fast way to count the rows otherwise in a way that scales. You can only count so fast.
U can find sample example here:
/**
* Used to get the number of rows of the table
* #param tableName
* #param familyNames
* #return the number of rows
* #throws IOException
*/
public long countRows(String tableName, String... familyNames) throws IOException {
long rowCount = 0;
Configuration configuration = connection.getConfiguration();
// Increase RPC timeout, in case of a slow computation
configuration.setLong("hbase.rpc.timeout", 600000);
// Default is 1, set to a higher value for faster scanner.next(..)
configuration.setLong("hbase.client.scanner.caching", 1000);
AggregationClient aggregationClient = new AggregationClient(configuration);
try {
Scan scan = new Scan();
if (familyNames != null && familyNames.length > 0) {
for (String familyName : familyNames) {
scan.addFamily(Bytes.toBytes(familyName));
}
}
rowCount = aggregationClient.rowCount(TableName.valueOf(tableName), new LongColumnInterpreter(), scan);
} catch (Throwable e) {
throw new IOException(e);
}
return rowCount;
}
Go to Hbase home directory and run this command,
./bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'namespace:tablename'
This will launch a mapreduce job and the output will show the number of records existing in the hbase table.
You could try hbase api methods!
org.apache.hadoop.hbase.client.coprocessor.AggregationClient

SQL Server 2008 R2 Express DataReader performance

I have a database that contains 250,000 records. I am using a DataReader to loop the records and export to a file. Just looping the records with a DataReader and no WHERE conditions is taking approx 22 minutes. I am only selecting two columns (the id and a nvarchar(max) column with about 1000 characters in it).
Does 22 minutes sound correct for SQL Server Express? Would the 1GB of RAM or 1CPU have an impact on this?
22 minutes sounds way too long for a single basic (non-aggregating) SELECT against 250K records (even 22 seconds sounds awfully long for that to me).
To say why, it would help if you could post some code and your schema definition. Do you have any triggers configured?
With 1K characters in each record (2KB), 250K records (500MB) should fit within SQL Express' 1GB limit, so memory shouldn't be an issue for that query alone.
Possible causes of the performance problems you're seeing include:
Contention from other applications
Having rows that are much wider than just the two columns you mentioned
Excessive on-disk fragmentation of either the table or the DB MDF file
A slow network connection between your app and the DB
Update: I did a quick test. On my machine, reading 250K 2KB rows with a SqlDataReader takes under 1 second.
First, create test table with 256K rows (this only took about 30 seconds):
CREATE TABLE dbo.data (num int PRIMARY KEY, val nvarchar(max))
GO
DECLARE #txt nvarchar(max)
SET #txt = N'put 1000 characters here....'
INSERT dbo.data VALUES (1, #txt);
GO
INSERT dbo.data
SELECT num + (SELECT COUNT(*) FROM dbo.data), val FROM dbo.data
GO 18
Test web page to read data and display the statistics:
using System;
using System.Collections;
using System.Data.SqlClient;
using System.Text;
public partial class pages_default
{
protected override void OnLoad(EventArgs e)
{
base.OnLoad(e);
using (SqlConnection conn = new SqlConnection(DAL.ConnectionString))
{
using (SqlCommand cmd = new SqlCommand("SELECT num, val FROM dbo.data", conn))
{
conn.Open();
conn.StatisticsEnabled = true;
using (SqlDataReader reader = cmd.ExecuteReader())
{
while (reader.Read())
{
}
}
StringBuilder result = new StringBuilder();
IDictionary stats = conn.RetrieveStatistics();
foreach (string key in stats.Keys)
{
result.Append(key);
result.Append(" = ");
result.Append(stats[key]);
result.Append("<br/>");
}
this.info.Text = result.ToString();
}
}
}
}
Results (ExecutionTime in milliseconds):
IduRows = 0
Prepares = 0
PreparedExecs = 0
ConnectionTime = 930
SelectCount = 1
Transactions = 0
BytesSent = 88
NetworkServerTime = 0
SumResultSets = 1
BuffersReceived = 66324
BytesReceived = 530586745
UnpreparedExecs = 1
ServerRoundtrips = 1
IduCount = 0
BuffersSent = 1
ExecutionTime = 893
SelectRows = 262144
CursorOpens = 0
I repeated the test with SQL Enterprise and SQL Express, with similar results.
Capturing the "val" element from each row increased ExecutionTime to 4093 ms (string val = (string)reader["val"];). Using DataTable.Load(reader) took about 4600 ms.
Running the same query in SSMS took about 8 seconds to capture all 256K rows.
Your results from running exec sp_spaceused myTable provide a potential hint:
rows = 255,000
reserved = 1994320 KB
data = 1911088 KB
index_size = 82752 KB
unused 480KB
The important thing to note here is reserved = 1994320 KB meaning your table is some 1866 MB, when reading fields that are not indexed (since NVARCHAR(MAX) can not be indexed) SQL Server must read the entire row into memory before restricting the columns. Hence you're easily running past the 1GB RAM limit.
As a simple test delete the last (or first) 150k rows and try the query again see what performance you get.
A few questions:
Does your table have a clustered index on the primary key (is it the id field or something else)?
Are you sorting on a column that is not indexed such as the `nvarchar(max) field?
In the best scenario for you your PK is id and also a clustered index and you either have no order by or you are order by id:
Assuming your varchar(max) field is named comments:
SELECT id, comments
FROM myTable
ORDER BY id
This will work fine but it will require you to read all the rows into memory (but it will only do one parse over the table), since comments is VARCHAR(MAX) and cannot be indexed and table is 2GB SQL will then have to load the table into memory in parts.
Likely what is happening is you have something like this:
SELECT id, comments
FROM myTable
ORDER BY comment_date
Where comment_date is an additional field that is not indexed. The behaviour in this case would be that SQL would be unable to actually sort the rows all in memory and it will end up having to page the table in and out of memory several times likely causing the problem you are seeing.
A simple solution in this case is to add an index to comment_date.
But suppose that is not possible as you only have read access to the database, then another solution is make a local temp table of the data you want using the following:
DECLARE #T TABLE
(
id BIGINT,
comments NVARCHAR(MAX),
comment_date date
)
INSERT INTO #T SELECT id, comments, comment_date FROM myTable
SELECT id, comments
FROM #T
ORDER BY comment_date
If this doesn't help then additional information is required, can you please post your actual query along with your entire table definition and what the index is.
Beyond all of this run the following after you restore backups to rebuild indexes and statistics, you could just be suffering from corrupted statistics (which happens when you backup a fragmented database and then restore it to a new instance):
EXEC [sp_MSforeachtable] #command1="RAISERROR('UPDATE STATISTICS(''?'') ...',10,1) WITH NOWAIT UPDATE STATISTICS ? "
EXEC [sp_MSforeachtable] #command1="RAISERROR('DBCC DBREINDEX(''?'') ...',10,1) WITH NOWAIT DBCC DBREINDEX('?')"
EXEC [sp_MSforeachtable] #command1="RAISERROR('UPDATE STATISTICS(''?'') ...',10,1) WITH NOWAIT UPDATE STATISTICS ? "

Azure Table Storage - PartitionKey and RowKey selection to use between query

I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.

Resources