bigtable.SampleRowKeys return a single key - go

I'm trying to write code that does a full table scan in go by using the bigtable.Table.SampleRowKeys RPC method. The table as around 7m rows (verified with cbt), yet the call returns a single key, whereas the documentation mention:
// SampleRowKeys returns a sample of row keys in the table. The returned row keys will delimit contiguous sections of
// the table of approximately equal size, which can be used to break up the data for distributed tasks like MapReduce.
Am I missing something?

Turns out that the returned keys are midpoints, thus if it returns a single key, say [k1], then the ranges are [("", k1), (k1, "")].

Related

DyanamoDB with AWS Lambda function Order by Desc Order with scan

I am trying to create AWS Lambda function using Node.js and try to scan records from dynamodb. But it gives me records in random order I would like to fetch top 5 records which are recently added in to table. I would like to sort based on Timestamp so can get latest 5 records. Any one have an idea please help me out.
dynamodb does not intend to support ordering in its scan operation. Order is supported in query operations.
To get the behavior you want you can do the following (with one caveat, see below):
Make sure that each record on your table has an attribute (let's call it x) which always holds the same value (does not matter which value, let's say the value is always "y")
define a global secondary index on your table. the key of that index should use x as the partition key (aka: "hash key") and the timestamp field as the sort key.
then you can issue a query action on that index. "Query results are always sorted by the sort key value" (see here) which is exactly what you need.
The caveat: this means that your index will hold all records of your table under the same partition key. This goes against best practices of dynamodb (see Choosing the Right DynamoDB Partition Key). It will not scale for large tables (more than tens of GB).

RethinkDB: count unique rows within grouped data

I'm trying to count all unique rows within grouped data, i.e, how many unique rows exist within each group.
Although groupedData.distinct().count() works for relatively small amounts of rows, running it on ~200k rows, such as in my case, ends with "over size limit".
I understand why it happens, yet I can't come up with more efficient way of doing it - is there a way?
Count is an expensive thing in RethinkDB to my experience. Especially for count operation that require iterating the whole data set. I myself struggle with this for a bit before.
To my understanding, when you pass groupData to distinct, it creates an array, because groupData will be a sequence, therefore it has 100,000 element limits.
To solve this, I think we have to use a stream, and count the stream instead. We cannot use group because it returns a group of stream, or in other words, an array of stream to my understanding again.
So here is how I solve it:
Create an index on the field I want to groups
Call distnct on that table with the index.
Map the stream, passing value into a count function with getAll, using index
An example query
r.table('t').distinct({index: 'index_name'})
.map(function(value) {
return {group: value, total: r.table('t').getAll(value, {index: 'index_name'}).count()}
})
With this, everything is a stream and we can lazily iterator over result set to get the count of each group.

Mapping through two data sets with Hadoop

Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys.
Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient.
Would there be a recommended way to do this that doesn't require repeating the work of loading in A every time?
Some pseudcode to clarify what I'm currently doing:
Load in Data Set A # This seems like the expensive step to always be doing
Foreach key/value in Data Set B:
If key is in Data Set A:
Update Data Seta A
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.
All of the answers posted so far are correct - this should be a Reduce-side join... but there's no need to reinvent the wheel! Have you considered Pig, Hive, or Cascading for this? They all have joins built-in, and are fairly well optimized.
This video tutorial by Cloudera gives a great description of how to do a large-scale Join through MapReduce, starting around the 12 minute mark.
Here are the basic steps he lays out for joining records from file B onto records from file A on key K, with pseudocode. If anything here isn't clear, I'd suggest watching the video as he does a much better job explaining it than I can.
In your Mapper:
K from file A:
tag K to identify as Primary Key
emit <K, value of K>
K from file B:
tag K to identify as Foreign Key
emit <K, record>
Write a Sorter and Grouper which will ignore the PK/FK tagging, so that your records are sent to the same Reducer regardless of whether they are a PK record or a FK record and are grouped together.
Write a Comparator which will compare the PK and FK keys and send the PK first.
The result of this step will be that all records with the same key will be sent to the same Reducer and be in the same set of values to be reduced. The record tagged with PK will be first, followed by all records from B which need to be joined. Now, the Reducer:
value_of_PK = values[0] // First value is the value of your primary key
for value in values[1:]:
value.replace(FK,value_of_PK) // Replace the foreign key with the key's value
emit <key, value>
The result of this will be file B, with all occurrences of K replaced by the value of K in file A. You can also extend this to effect a full inner join, or to write out both files in their entirety for direct database storage, but those are pretty trivial modifications once you get this working.

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Removing duplicates from a huge file

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361
These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task.
What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)
If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.
I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.
If you go down the database route, you can load the csv into a database and use 'duplicate key update'
using mysql:-
Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of
LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.
If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)
Set keySet;
while(not end_of_input_file)
read line from input file
if first column is not in keySet
add first column to keySet
write line to output file
end while
If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:
r = read_row()
if r is None:
os.exit()
last = r[0]
write_row(r)
while True:
r = read_row()
if r is None:
os.exit()
if r[0] != last:
write_row(r)
last = r[0]
Otherwise:
What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.
S = set()
while True:
r = read_row()
if r is None:
os.exit()
if r[0] not in S:
write_row(r)
S.add(r[0])
This will stream over the input using only memory proportional to the size of the set of values from the first column.
If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.
Say you have the following data: a, c, a, b
With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c
We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.
We then sort by position and strip that: a, c, b
This would involve three linear scans over the data set and two sorting steps.

Resources