Add field to existing documents over million records - go

Scenario
We have over 5 million document in a bucket and all of it has nested JSON with a simple uuid key. We want to add one extra field to ALL of the documents.
Example
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field2":{
"nested_field1":"data2"
}
}
After adding extra field
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field3":"data3",
"field2":{
"nested_field1":"data2"
}
}
It has only one Primary Index: CREATE PRIMARY INDEX idx FOR bucket.
Problem
It takes ages. We tried it with n1ql, UPDATE bucket SET field3 = data3. Also sub-document mutation. But all of it takes hours. It's written in Go so we could put it into a goroutine, but it's still too much time.
Question
Is there any solution to reduce that time?

As you need to add new field, not modifying any existing field it is better to use SDKs SUBDOC API vs N1QL UPDATE (It is whole document update and require fetch the document).
The Best option will be Use N1QL get the document keys then use
SDK SUBDOC API to add the field you need. You can use reactive API(asynchronously)
You have 5M documents and have primary index use following
val = ""
In loop
SELECT RAW META().id FROM mybucket WHERE META().id > $val LIMIT 10000;
SDK SUBDOC update
val = last value from the SELECT
https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/

The Eventing Service can be quite performant for these sort of enrichment tasks. Even a low end system should be able to do 5M rows in under two (2) minutes.
// Note src_bkt is an alias to the source bucket for your handler
// in read+write mode supported for version 6.5.1+, this uses DCP
// and can be 100X more performant than N1QL.
function OnUpdate(doc, meta) {
// optional filter to be more selective
// if (!doc.type && doc.type !== "mytype") return;
// test if we already have the field we want to add
if (doc.field3) return;
doc.field3 = "data3";
src_bkt[meta.id] = doc;
}
For more details on Eventing refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html I typically enrich 3/4 of a billion documents. The Eventing function will also run faster (enrich more documents per second) if you increase the number of workers in your Eventing function's setting from say 3 to 16 provided you have 8+ physical cores on your Eventing node.
I tested the above Eventing function and it enriches 5M documents (modeled on your example) on my non-MDS single node couchbase test system (12 cores at 2.2GHz) in just 72 seconds. Obviously if you have a real multi node cluster it will be faster (maybe all 5M docs in just 5 seconds).

Related

Search by values in Redis cache - Secondary Indexing

I am new to Redis. I want to search by one or multiple values that comes from API.
e.g - Let's say that I want to store some sec data as below:
Value1
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
Value2
{
"isin":"isin222",
"id_bb_global":"BBG222",
"cusip":"cusip222",
"sedol":"sedol222",
"cpn":"1.0",
"cntry":"IN",
"144A":"Y",
"issue_cntry":"DE"
}
...
...
I want to search by cusip or cusip and id_bb_global, ISIN plus Exchange, or sedol.
e.g - search query data -> {"isin":"isin222", "cusip":"cusip222"} , should return all data sets from value.
What is the best way to store this kind of data structure in Redis and API to retrieve the same faster.
when you insert data, you can create sets to maintain the index.
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
example for the above data, if you wnat to filter by isin and cusip, you can create the respective set for isin:123 and cusip:123 and add that item id to both of those sets.
later on, if you want to find item that are in both isin:123 and cusip:123, you just have to run SINTER on those 2 sets.
Or if you want to find items that are either in isin:123 OR cusip:123, you can union them.

How to achieve dimensional charting on large dataset?

I have successfully used combination of crossfilter, dc, d3 to build multivariate charts for smaller datasets.
My current system caters to 1.5 million txns a day and I want to use the above combination to show dimensional charts on this big sized data (spanned over 6 months). I cannot push this sized data to the frontend for obvious reasons.
The txn data has seconds level granularity but this level of granularity is not required in the visualization. If txn data can be rolled up to a granularity of a day at the backend and push the day based aggregation to the front end then it can drastically reduce the IO traffic and size of the data given to the crossfilter,dc and then dc can show its visualization magic.
Taking forward the above idea -> I decided to reduce the size of the data by reducing the granularity of the timeseries data from millseconds to day by pre-aggregating the data from various dimensions using the below GROUP BY query (this is similar to the stuff done by crossfilter but at the frontend)
SELECT TRUNC(DATELOGGED) AS DTLOGGED, CODE, ACTION, COUNT(*) AS
TXNCOUNT, GROUPING_ID(TRUNC(DATELOGGED),CODE, ACTION) AS grouping_id
FROM AAAA GROUP BY GROUPING SETS(TRUNC(DATELOGGED),
(TRUNC(DATELOGGED),CURR_CODE), (TRUNC(DATELOGGED),ACTION));
Sample output of these rows:
Tuples/Rows in which aggregation is done by (TRUNC(DATELOGGED),CODE) will have a common grouping_id 1 and by (TRUNC(DATELOGGED),ACTION) will have a common grouping_id 2
//group by DTLOGGED, CODE
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":69,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":20,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"144","ACTION":"", "TXNCOUNT":254,"GROUPING_ID":1},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"376","ACTION":"", "TXNCOUNT":961,"GROUPING_ID":1},
//group by DTLOGGED, ACTION
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":373600,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":48978,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"ENROLLED_PURCHASE", "TXNCOUNT":402311,"GROUPING_ID":2},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"UNENROLLED_PURCHASE", "TXNCOUNT":54910,"GROUPING_ID":2},
//group by DTLOGGED
{"DTLOGGED":"2013-08-03T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":460732,"GROUPING_ID":3},
{"DTLOGGED":"2013-08-04T07:00:00.000Z","CODE":"","ACTION":"", "TXNCOUNT":496060,"GROUPING_ID":3}];
Questions:
These rows are are dis-joined i.e. not like usual rows where each row will have valid values for CODE and ACTION in a single row.
After a selection is made in one of the graphs, the redrawing effect either removes the other graphs or shows no data on them.
Please give me any troubleshooting help or suggest better ways to solve this?
http://jsfiddle.net/universallocalhost/5qJjT/3/
So there are a couple things going on in this question, so I'll try to separate them:
Crossfilter works with tidy data
http://vita.had.co.nz/papers/tidy-data.pdf
This means that you will need to come up with a naive method of filling in the nulls you're seeing (or if need be, in your initial query of the data, omit the nulled values. If you want to get really fancy, you could even infer the null values based off of other data. Whatever your solution, you need to make your data tidy prior to putting it into crossfilter.
Groups and Filtering Operations
txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
if(d.GROUPING_ID ===1) {
return d.TXNCOUNT;
} else {
return 0;
}
});
This is a filtering operation done on the reduction. This is something that you should separate. Allow that filtering to occur elsewhere (either in the visual, crossfilter itself, or in the query on the data).
This means your reduceSum's become:
var txnVolByCurrcode = txnByCurrcode.group().reduceSum(function(d) {
return d.TXNCOUNT;
});
And if you would like the user to select which group to display:
var groupId = cfdata.dimension(function(d) { return d.GROUPING_ID; });
var groupIdGroup = groupId.group(); // this is an interesting name
dc.pieChart("#group-chart")
.width(250)
.height(250)
.radius(125)
.innerRadius(50)
.transitionDuration(750)
.dimension(groupId)
.group(groupIdGroup)
.renderLabel(true);
For an example of this working:
http://jsfiddle.net/b67pX/

why Mutation does not make inserts for existing columns

I am loading initial data (url list for a crawler) to Cassandra with status crawled=0. Then using Hadoop I crawl all the links and try to change crawled from 0 to something else, for example 1 or 2, or 3. When I check in Cassandra cli interface get ColumnFamily['www.somedomain.com'] the value of crawler column remains the same. If during initial import I have not mentioned crawled column, it adds correctly. This is only one part of the algorithm and I need further updates of this column with other Map/Reduce jobs, etc.
In Thrift and Cassandra API it is said that we have only inserts and deletions. Insert should work as an update.
For crawled column I have UTF8 type.
Mutation class is like this:
private static Mutation getMutationCrawled(Text crawledVal)
{
Text column = new Text();
column.set("crawled");
Column c = new Column();
c.setName(ByteBuffer.wrap(Arrays.copyOf(column.getBytes(), column.getLength())));
c.setValue(ByteBuffer.wrap(crawledVal.getBytes()));
c.setTimestamp(System.currentTimeMillis());
Mutation m = new Mutation();
m.setColumn_or_supercolumn(new ColumnOrSuperColumn());
m.column_or_supercolumn.setColumn(c);
return m;
}
Cassandra resolves conflicts using the timestamp of the mutation, with the largest timestamp winning. You can set the timestamp value to whatever you want, but the convention is to set the timestamp as a value in micro seconds. In the example above, you set the timestamp with,
c.setTimestamp(System.currentTimeMillis());
Most likely the initial import code to populate the values is setting the timestamp in micro seconds. The micro second timestamp values are larger than the millisecond timestamp values, so your updates are being ignored.

Azure Table Storage - PartitionKey and RowKey selection to use between query

I am a total newbie with Azure! The purpose is to return the rows based on the timestamp stored in the RowKey. As there is a transaction cost with each query, I want to minimize the number of transactions/queries whilst maintain performance
These are the proposed Partition and Row Keys:
Partition Key: TextCache_(AccountID)_(ParentMessageId)
Row Key: (DateOfMessage)_(MessageId)
Legend:
AccountId - is an integer
ParentMessageId - The parent messageId if there is one, blank if it is the parent
DateOfMessage - Date the message was created - format will be DateTime.Ticks.ToString("d19")
MessageId - the unique Id of the message
I would like to get back from a single query the rows and any childrows that is > or < DateOfMessage_MessageId
Can this be done via my proposed PartitionKeys and RowKeys?
ie.. (in psuedo code)
var results = ctx.PartitionKey.StartsWith(TextCache_AccountId)
&& ctx.RowKey > (TimeStamp)_MessageId
Secondly, if there I have a number of accounts, and only want to return back the first 10, could it be done via a single query
ie.. (in psuedo code)
var results = (
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId1)) &&
&& ctx.RowKey > (TimeStamp1)_MessageId1 )
)
||
(
ctx.PartitionKey.StartsWith(TextCache_(AccountId2)) &&
&& ctx.RowKey > (TimeStamp2)_MessageId2 )
) ...
)
.Take(10)
The short answer to your questions is yes, but there are some things you need to watch for.
Azure table storage doesn't have a direct equivalent of .StartsWith(). If you're using the storage library in combination with LINQ you can use .CompareTo() (> and < don't translate properly) which will mean that if you run a search for account 1 and you ask the query to return 1000 results, but there are only 600 results for account 1, the last 400 results will be for account 10 (the next account number lexically). So you'll need to be a bit smart about how you deal with your results.
If you padded out the account id with leading 0s you could do something like this (pseudo code here as well)
ctx.PartionKey > "TextCache_0000000001"
&& ctx.PartitionKey < "TextCache_0000000002"
&& ctx.RowKey > "123465798"
Something else to bear in mind is that queries to Azure Tables return their results in PartitionKey then RowKey order. So in your case messages without a ParentMessageId will be returned before messages with a ParentMessageId. If you're never going to query this table by ParentMessageId I'd move this to a property.
If TextCache_ is just a string constant, it's not adding anything by being included in the PartitionKey unless this will actually mean something to your code when it's returned.
While you're second query will run, I don't think it will produce what you're after. If you want the first ten rows in DateOfMessage order, then it won't work (see my point above about sort orders). If you ran this query as it is and account 1 had 11 messages it will return only the first 10 messages related to account 1 regardless if whether account 2 had an earlier message.
While trying to minimise the number of transactions you use is good practice, don't be too concerned about it. The cost of running your worker/web roles will dwarf your transaction costs. 1,000,000 transactions will cost you $1 which is less than the cost of running one small instance for 9 hours.

Efficient way to delete multiple rows in HBase

Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase?
There is a table say 'chart', which contains items that are in charts. Row keys are in the following format:
chart|date_reversed|ranked_attribute_value_reversed|content_id
Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row found by a Scan? All the rows to be deleted are going to be close to each other.
I need to delete the rows, because I don't want one item (one content_id) to have multiple entries which it will have if its ranked_attribute_value had been changed (its change is the reason why chart needs to be regenerated).
Being a HBase beginner, so perhaps I might be misusing rows for something that columns would be better -- if you have a design suggestions, cool! Or, maybe the charts are better generated in a file (e.g. no HBase for output)? I'm using MapReduce.
Firstly, coming to the point of range delete there is no range delete yet in HBase, AFAIK. But there is a way to delete more than one rows at a time in the HTableInterface API. For this simply form a Delete object with row keys from scan and put them in a List and use the API, done! To make scan faster do not include any column family in the scan result as all you need is the row key for deleting whole rows.
Secondly, about the design. First my understanding of the requirement is, there are contents with content id and each content has charts generated against them and those data are stored; there can be multiple charts per content via dates and depends on the rank. In addition we want the last generated content's chart to show at the top of the table.
For my assumption of the requirement I would suggest using three tables - auto_id, content_charts and generated_order. The row key for content_charts would be its content id and the row key for generated_order would be a long, which would auto-decremented using HTableInterface API. For decrementing use '-1' as the amount to offset and initialize the value Long.MAX_VALUE in the auto_id table at the first start up of the app or manually. So now if you want to delete the chart data simply clean the column family using delete and then put back the new data and then make put in the generated_order table. This way the latest insertion will also be at the top in the latest insertion table which will hold the content id as a cell value. If you want to ensure generated_order has only one entry per content save the generated_order id first and take the value and save it into content_charts when putting and before deleting the column family first delete the row from generated_order. This way you could lookup and charts for a content using 2 gets at max and no scan required for the charts.
I hope this is helpful.
You can use the BulkDeleteProtocol which uses a Scan that defines the relevant range (start row, end row, filters).
See here
I ran into your situation and this is my code to implement what you want
Scan scan = new Scan();
scan.addFamily("Family");
scan.setStartRow(structuredKeyMaker.key(starDate));
scan.setStopRow(structuredKeyMaker.key(endDate + 1));
try {
ResultScanner scanner = table.getScanner(scan);
Iterator<Entity> cdrIterator = new EntityIteratorWrapper(scanner.iterator(), EntityMapper.create(); // this is a simple iterator that maps rows to exact entity of mine, not so important !
List<Delete> deletes = new ArrayList<Delete>();
int bufferSize = 10000000; // this is needed so I don't run out of memory as I have a huge amount of data ! so this is a simple in memory buffer
int counter = 0;
while (entityIterator.hasNext()) {
if (counter < bufferSize) {
// key maker is used to extract key as byte[] from my entity
deletes.add(new Delete(KeyMaker.key(entityIterator.next())));
counter++;
} else {
table.delete(deletes);
deletes.clear();
counter = 0;
}
}
if (deletes.size() > 0) {
table.delete(deletes);
deletes.clear();
}
} catch (IOException e) {
e.printStackTrace();
}

Resources