Efficient set operations in mapreduce - hadoop

I have inherited a mapreduce codebase which mainly calculates the number of unique user IDs seen over time for different ads. To me it doesn't look like it is being done very efficiently, and I would like to know if anyone has any tips or suggestions on how to do this kind of calculation as efficiently as possible in mapreduce.
We use Hadoop, but I'll give an example in pseudocode, without all the cruft:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id, user_id)
reduce(ad_id, user_ids):
uniqe_user_ids = new Set()
foreach (user_id in user_ids):
unique_user_ids.add(user_id)
collect(ad_id, unique_user_ids.size)
It's not much code, and it's not very hard to understand, but it's not very efficient. Every day we get more data, and so every day we need to look at all the ad impressions from the beginning to calculate the number of unique user IDs for that ad, so each day it takes longer, and uses more memory. Moreover, without having actually profiled the code (not sure how to do that in Hadoop) I'm pretty certain that almost all of the work is in creating the set of unique IDs. It eats enormous amounts of memory too.
I've experimented with non-mapreduce solutions, and have gotten much better performance (but the question there is how to scale it in the same way that I can scale with Hadoop), but it feels like there should be a better way of doing it in mapreduce that the code I have. It must be a common enough problem for others to have solved.
How do you implement the counting of unique IDs in an efficient manner using mapreduce?

The problem is that the code you inherited was written with the mindset "I'll determine the unique set myself" instead of the "let's leverage the framework to do it for me".
I would something like this (pseudocode) instead:
map(key, value):
ad_id = .. // extract from value
user_id = ... // extract from value
collect(ad_id & user_id , unused dummy value)
reduce(ad_id & user_id , unused dummy value):
output (ad_id , 1); // one unique userid.
map(ad_id , 1): --> identity mapper!
collect(ad_id , 1 )
reduce(ad_id , set of a lot of '1's):
summarize ;
output (ad_id , unique_user_ids);

Niels' solution is good, but for an approximate alternative that is closer to the original code and uses only one map reduce phase, just replace the set with a bloom filter. The membership queries in a bloom filter have a small probability of error, but the size estimates are very accurate.

Related

Intersection of Intervals in Apache Pig

In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.
This is not an efficient solution, and I recommend writing a UDF to do this.
Self Join the dataset with itself to get a cross product of all the combinations. In pig, it's difficult to join something with itself, so you just act as if you are loading two separate datasets. After the cross product, you end up with data like
1,A,1500,2000,1,A,1500,2000
1,A,1500,2000,2,A,1900,3000
.....
At this point, you need to satisfy four conditionals,
"where" field matches
id one and two from the self join don't match (so you don't get back the same ID intersecting with itself)
start time from second group being compared should be greater than start time for first group and less then end time for first group
This code should work, might have a syntax error somewhere as I couldn't test it but should help you to write what you need.
inputdataone = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
inputdatatwo = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
crossProduct = CROSS inputdataone, inputdatatwo;
crossProduct =
FOREACH crossProduct
GENERATE inputdataone::id as id_one,
inputdatatwo::id as id_two,
(inputdatatwo::start-inputdataone::start>=0 AND inputdatatwo::start-inputdataone::end<=0 AND inputdataone::where==inputdatatwo::where?1:0) as intersect;
find_intersect = FILTER crossProduct BY intersect==1;
final =
FOREACH find_intersect
GENERATE id_one,
id_two;
Crossing large sets inflates the data.
A naive solution without crossing would be to partition the intervals and check for intersections within each interval.
I am working on a similar problem and will provide a code sample when I am done.

Creating DAX peer measure

The scenario:
We are an insurance brokerage company. Our fact table is claim metrics current table. This table has unique rows for multiple claim sid-s, so that, countrows(claim current) gives the correct count of the number of unique claims. Now, this table also has clientsid and industrysid. The relation between client and industry here is that, 1 industry can have multiple clients, and 1 client can belong to only 1 industry.
Now, let us consider a fact called claimlagdays, which is present in the table at the granularity of claimsid.
Now, one requirement is that, we need to find out "peer" sum(claimlagdays). This, for a particular client, is basically calculated as:
sum(claimlagdays) for the industry of the client being filtered (minus) sum(claimlagdays) for this particular client. Let's call this measure A.
Similar to above, we need to calculate "peer" claim count , which is claimcount for the industry of the client being filtered (minus) claimcount for this particular client.
Let's call this measure B.
In the final calculation, we need to divide A by B, to get the "peer" average lag days.
So basically, the hard part here is this: find the industry of the particular client which is being filtered for, and then, apply this filter to the fact table (claim metrics current) to find out the total claim count/other metric only for this industry. then of course, subtract the client figure from this industry figure to get the "peer" measure. This has to be done for each row, keeping intact any other filters which might be applied in the slicer(date/business unit, etc.)
There are a couple of other filters static which need to be considered, which are present in other tables, such as "Claim Type"(=Indemnity/Medical) and Claim Status(=Closed).
My solution:
For measure B
I tried creating a calculated column, as:
Claim Count_WC_MO_Industry=COUNTROWS(FILTER(FILTER('Claim Metrics Current',RELATED('Claim WC'[WC Claim Type])="Medical" && RELATED('Coverage'[Coverage Code])="WC" && RELATED('Claim Status'[Status Code])="CL"),EARLIER('Claim Metrics Current'[IndustrySID])='Claim Metrics Current'[IndustrySID]))
Then I created the measure
Claim Count - WC MO Peer:=CALCULATE(SUM([Claim Count_WC_MO_Industry])/[Claim - Count])- [Claim - Count WC MO]
{I did a sum because, tabular model doesn't directly allow me to use a calculated column as a measure, without any aggregation. And also, that wouldn't make any sense since tabular model wouldn't understand which row to take}
The second part of the above measure is obviously, the claim count of the particular client, with the above-mentioned filters.
Problem with my solution:
The figures are all wrong.I am not getting a client-wise or year-wise separation of the industry counts or the peer counts. I am only getting a sum of all the industry counts in the measure.
My suspicion is that this is happening because of the sum which is being done. However, I don't really have a choice, do I, as I can't use a calculated column as a measure without some aggregation...
Please let me know if you think the information provided here is not sufficient and if you'd like me to furnish some data (dummy). I would be glad to help.
So assuming that you are filtering for the specific client via a frontend, it sounds like you just want
ClientLagDays :=
CALCULATE (
SUM ( 'Claim Metrics Current'[Lag Days] ),
Static Filters Here
)
Just your base measure of appropriate client lag days, including your static filters.
IndustryLagDays :=
CALCULATE (
[ClientLagDays],
ALL ( 'Claim Metrics Current'[Client] ),
VALUES ( 'Claim Metrics Current'[IndustrySID] )
)
This removes the filter on client but retains the filter on Industry to get the industry-wide total of lag days.
PeerLagDays:=[IndustryLagDays]-[ClientLagDays]
Straightforward enough.
And then repeat for claim counts, and then take [PeerLagDays] / [PeerClaimCount] for your [Average Peer Lag Days].

Pig: Count number of keys in a map

I'd like to count the number of keys in a map in Pig. I could write a UDF to do this, but I was hoping there would be an easier way.
data = LOAD 'hbase://MARS1'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'A:*', '-loadKey true -caching=100000')
AS (id:bytearray, A_map:map[]);
In the code above, I want to basically build a histogram of id and how many items in column family A that key has.
In hoping, I tried c = FOREACH data GENERATE id, COUNT(A_map); but that unsurprisingly didn't work.
Or, perhaps someone can suggest a better way to do this entirely. If I can't figure this out soon I'll just write a Java MapReduce job or a Pig UDF.
SIZE should apparently work for you (not tried it myself):
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SIZE

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources