Wikipedia pages co-edit graph extraction using Hadoop - hadoop

I am trying the build the graph of Wikipedia co-edited pages using hadoop. The raw data contains the list of edits, i.e. has one row per edit telling who edited what:
# revisionId pageId userId
1 1 10
2 1 11
3 2 10
4 3 10
5 4 11
I want to extract a graph, in which each node is a page, and there is a link between two pages if at lease one editor edited both pages (the same editor). For the above example, the output would be:
# edges: pageId1,pageId2
1,2
1,3
1,4
2,3
I am far from being an expert in Map/Reduce, but I think this has to be done in two jobs:
The first job extracts the list of edited pages for each user.
# userId pageId1,pageId2,...
10 1,2,3
11 1,4
The second job takes the output above, and simply generates all pairs of pages that each user edited (these pages have thus been edited by the same user, and will therefore be linked in the graph). As a bonus, we can actually count how many users co-edited each page, to get the weight of each edge.
# pageId1,pageID2 weight
1,2 1
1,3 1
1,4 1
2,3 1
I implemented this using Hadoop, and it works. The problem is that the map phase of the second job is really slow (actually, the first 30% are OK, but then it slows down quite a lot). The reason I came up with is that because some users have edited many pages, the mapper has to generate a lot of these pairs as outputs. Hadoop thus has to spill to disk, rendering the whole thing pretty slow.
My questions are thus the following:
For those of you who have more experience than I with Hadoop: am I doing it wrong? Is there a simpler way to extract this graph?
Can disk spills be the reason why the map phase of the second job is pretty slow? How can I avoid this?
As a side node, this runs fine with a small sample of the edits. It only gets slow with GBs of data.

Apparently, this is a common problem known as combinations/cross-correlation/co-occurrences, and there are two patterns to solve it using Map/Reduce, Pairs or Stripes:
Map Reduce Design Patterns :- Pairs & Stripes
MapReduce Patterns, Algorithms, and Use Cases (Cross-correlation section)
Pairs and Stripes
The way I presented in my question is the pairs approach, which usually generates much more data. The stripes approach benefits more from a combiner, and gave better results in my case.

Related

Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

Using Spark 2.4.0.
My production data is extremely skewed, so one of the tasks was taking 7x longer than everything else.
I tried different strategies to normalize the data so that all executors worked equally -
spark.default.parallelism
reduceByKey(numPartitions)
repartition(numPartitions)
My expectation was that all three of them should evenly partition, however playing with some dummy non-production data on Spark Local/Standalone suggests that options 1,2 normalize better than 3.
Data as below : (and i am trying to do a simple reduce on balance per account+ccy combination
account}date}ccy}amount
A1}2020/01/20}USD}100.12
A2}2010/01/20}SGD}200.24
A2}2010/01/20}USD}300.36
A1}2020/01/20}USD}400.12
Expected result should be [A1-USD,500.24], [A2-SGD,200.24], [A2-USD,300.36] Ideally these should be partitioned in 3 different partitions.
javaRDDWithoutHeader
.mapToPair((PairFunction<Balance, String, Integer>) balance -> new Tuple2<>(balance.getAccount() + balance.getCcy(), 1))
.mapToPair(new MyPairFunction())
.reduceByKey(new ReductionFunction())
Code to check partitions
System.out.println("b4 = " +pairRDD.getNumPartitions());
System.out.println(pairRDD.glom().collect());
JavaPairRDD<DummyString, BigDecimal> newPairRDD = pairRDD.repartition(3);
System.out.println("Number of partitions = " +newPairRDD.getNumPartitions());
System.out.println(newPairRDD.glom().collect());
Option 1: Doing nothing
Option 2: Setting spark.default.parallelism to 3
Option 3: reduceByKey with numPartitions = 3
Option 4: repartition(3)
For Option 1
Number of partitions = 2
[
[(DummyString{account='A2', ccy='SGD'},200.24), (DummyString{
account='A2', ccy='USD'},300.36)],
[(DummyString{account='A1', ccy='USD'},500.24)]
]
For option 2
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]]
For option 3
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]
]
For option 4
Number of partitions = 3
[[], [(DummyString{
account='A2', ccy='SGD'},200.24)], [(DummyString{
account='A2', ccy='USD'},300.36), (DummyString{
account='A1', ccy='USD'},500.24)]]
Conclusion : options 2(spark.default.parallelism) and 3(reduceByKey(numPartitions) normalized much better than option 4 (repartition)
Fairly deterministic results, never saw option4 normalize into 3 partitions.
Question :
is reduceByKey(numPartitions) much better than repartition or
is this just because the sample data set is so small ? or
is this behavior going to be different when we submit via a YARN cluster
I think there a few things running through the question and therefore harder to answer.
Firstly there are the partitioning and parallelism related to the data at rest and thus when read in; without re-boiling the ocean, here is an excellent SO answer that addresses this: How spark read a large file (petabyte) when file can not be fit in spark's main memory. In any event, there is no hashing or anything going on, just "as is".
Also, RDDs are not well optimized compared to DFs.
Various operations in Spark cause shuffling after an Action invoked:
reduceByKey will cause less shuffling, using hashing for final aggregations and local partition aggregation which is more efficient
repartition as well, that uses randomness
partitionBy(new HashPartitioner(n)), etc. which you do not allude to
reduceByKey(aggr. function, N partitions) which oddly enough appears to be more efficient than a repartition firstly
Your latter comment alludes to data skewness, typically. Too many entries hash to the same "bucket" / partition for the reduceByKey. Alleviate by:
In general try with a larger number of partitions up front (when reading in) - but I cannot see your transforms, methods here, so we leave this as general advice.
In general try with a larger number of partitions up front (when reading in) using suitable hashing - but I cannot see your transforms, methods here, so we leave this as general advice.
Or in some cases "salt" the key by adding a suffix and then reduceByKey and reduceByKey again to "unsalt" to get the original key. Depends on extra time taken vs. leaving as is or performing the other options.
repartition(n) applies random ordering, so you shuffle and then need to shuffle again. Unnecessarily imo. As another post shows (see comments on your question) it looks like unnecessary work done, but these are old style RDDs.
All easier to do with dataframes BTW.
As we are not privy to your complete coding, hope this helps.

Find Top 10 Most Frequent visited URl, data is stored across network

Source: Google Interview Question
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
Have many large <string (url) -> int (visits)> maps.
Calculate < string (url) -> int (sum of visits among all distributed maps), and get the top ten in the combined map.
Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly.
I have now come across quite a few questions of this type, where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer.
All I could think of is brute force, which in some or other way, violates the given constraint.
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
pre-processing: let R be the number of servers in cluster, give each
server unique id from 0,1,2,...,R-1
(map) For each (string,id) - send the tuple to the server which has the id hash(string) % R.
(reduce) Once step 2 is done (simple control communication), produce the (string,count) of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server.
(map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
(reduce) Server 0 will yield the top 10 across the network.
Notes:
The problem with the algorithm, like most big-data algorithms that
don't use frameworks is handling failing servers. MapReduce takes
care of it for you.
The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
In the worst case any algorithm, which does not require transmitting the whole frequency table, is going to fail. We can create a trivial case where the global top-10s are all at the bottom of every individual machines list.
If we assume that the frequency of URIs follow Zipf's law, we can come up with effecive solutions. One such solution follows.
Each machine sends top-K elements. K depends solely on the bandwidth available. One master machine aggregates the frequencies and finds the 10th maximum frequency value "V10" (note that this is a lower limit. Since the global top-10 may not be in top-K of every machine, the sum is incomplete).
In the next step every machine sends a list of URIs whose frequency is V10/M (where M is the number of machines). The union of all such is sent back to every machine. Each machines, in turn, sends back the frequency for this particular list. A master aggregates this list into top-10 list.

Hadoop Controlling Split Size for Similar Sections of Input Data

I've got a situation where my input data looks something like the following.
AA1
AA2
AA3
AA4
BB1
BB2
BB3
CC1
CC2
CC3
CC4
CC5
CC6
What I want to do is split the data up into InputSplits where each split covers a section of the strings that begin with certain leading letters. For example the 1st input split would be all strings that start with "AA", the 2nd split would be those that start with "BB", etc.
I want to do it this way because my data needs to be together like that in order for the reduce phase to correctly operate.
What I've been playing around with so far is writing my own InputFormat and RecordReader classes to do this, however I see in some examples (http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat) that the splits are already created by the time the reader gets to them. I believe that I run into danger of having splits not align correctly with the boundaries between strings.
In order to make this work fully, do I have to implement my own version of InputFormat getSplits function? If I do this, do I run a risk of distributing my splits across machines in a fashion that does not take advantage of machine locality? Finally, is there a better way to this in general?
Any help is appreciated. Thanks,
mj
EDIT 0
I'm including more information per the request of several commentators.
The objective of my program is to compare strings that belong in groups to find the overlap between those strings and record which strings together share that overlap. Consider the following example.
AAAA
AAAB
AAAC
AAB
BAAA
All the strings that share an "A" at the beginning have some overlap that is common between them. The one that starts with "B" obviously doesn't. When it comes to actually discovering what the specific overlap is, and building those groups, if I'm looking at "AAAA", I need to compare all the way down to "BAAA" and no further. My concern is that the InputSplits will chop up my data such that certain strings won't be compared against and I will have missing/incomplete groups. I was hoping to use the Map step (or reading of data) to split the problem into these groups and then allow the Reduce step to calculate the groups and return results.
I've got millions of strings like this and it takes a while on a single machine. I've logically implemented a ton of "tricks" to streamline the process and make it run fast. I was hoping Hadoop could step in and help and make it even faster.
Joe K - to answer your question, I don't know the extent of the overlap between all strings. Overlap can differ, for example AA1 can overlap 2 characters all the way through AA4, but if AB5 were present, only 1 character would overlap. The strings can vary in length greatly so you may get huge overlap in other instances. Also, detecting what exactly the overlap is is what I wanted to do in the reduce phase. That's what my whole objective was.
I don't know if the shuffle/sort phase will correctly distribute the adjacent strings to the same reducer or if there will be breaks. My ignorance of the process unfortunately is tripping me up here.
EDIT 0 END

How do you implement ranking and sorting in map/reduce?

I'm learning the Java map/reduce API in Hadoop and trying to wrap my head around thinking in map/reduce. Here's a sample program i'm writing against apache http server log files, it has two phases (each implemented as a M/R job and then chained together):
Count the number of times each IP address accessed the server
Find the top 5 IP addresses (most requests)
phase 1 seems pretty trivial, it's a simple counting implementation in map/reduce and it emits the something like the following:
192.168.0.2 4
10.0.0.2 7
127.0.0.1 3
...etc
This output would feed into the mapper of the second map/reduce job.
Now i'm confused on how to implement the top 5 in a parallel way. Since reducers are sequential in nature, I'm guessing there would only be one reducer that goes against the full list to sort it, right? How do you go about solving step number #2 in a parallel way?
First of all, if the output of the first job small enough that you don't need to parallelize it, consider:
hadoop fs -cat joboutput/part-* | sort -k2 -n | head -n5
This will likely be faster than sending it all to one reducer in many cases!
Sorting in Hadoop is pretty rough when you try to get away from using only 1 reducer. If you are interested in sorting, try checking out TotalOrderPartioner. By searching the web for that you should find some examples. The fundamental solution is that you have to partition your values into ascending-value bins with a custom partitioner. Then, each bin is sorted naturally. You output, and you have a sorted set.
The hard part is figuring out how to put data into which bins.
If you are interested in top-5 specifically (or top-50, whatever), there is an interesting way to do that. The basic premise is that if you take the top 5 of each mapper, then take the top 5 of the top 5's in the reducer. Each mapper effectively sends their top five to the reducer to compete for the true top five, kind of like a tournament. You are guaranteed to get the top 5 in the reducer, you just need to weed some of them out.
To keep track of the top-5 in both the mapper and reducer, I like to use a TreeMap. Basically, keep inserting values, and keep truncating it to the top 5. In the Mapper#cleanup method, write out the top 5 records (don't write out during the map itself). Do the same for the reducer.
I'll plug Apache Pig here for something like this. It might not be as effective as the options above, but it sure is easier to code.
loaded = LOAD 'joboutput/' USING PigStorage('\t') AS (ip:chararray, cnt:int);
sorted = ORDER loaded BY cnt DESC;
top = LIMIT sorted 5;
dump top;
Sorry that something as simple as sorting is not as straightforward as you might have imagined in Hadoop. Some things are going to be easy (e.g., the ip counting that you did) and others are going to be hard (sorting, joins). Just the nature of the beast.

Finding sets that have specific subsets

I am a graduate student of physics and I am working on writing some code to sort several hundred gigabytes of data and return slices of that data when I ask for it. Here is the trick, I know of no good method for sorting and searching data of this kind.
My data essentially consists of a large number of sets of numbers. These sets can contain anywhere from 1 to n numbers within them (though in 99.9% of the sets, n is less than 15) and there are approximately 1.5 ~ 2 billion of these sets (unfortunately this size precludes a brute force search).
I need to be able to specify a set with k elements and have every set with k+1 elements or more that contains the specified subset returned to me.
Simple Example:
Suppose I have the following sets for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I were to give the request (1,3) I would have the sets: (1,2,3),
(1,2,3,4,5), and (1,3,8,9).
The request (11) would return the set: (5,8,11).
The request (1,2,3) would return the sets: (1,2,3) and (1,2,3,4,5)
The request (50) would return no sets:
By now the pattern should be clear. The major difference between this example and my data is that the sets withn my data are larger, the numbers used for each element of the sets run from 0 to 16383 (14 bits), and there are many many many more sets.
If it matters I am writing this program in C++ though I also know java, c, some assembly, some fortran, and some perl.
Does anyone have any clues as to how to pull this off?
edit:
To answer a couple questions and add a few points:
1.) The data does not change. It was all taken in one long set of runs (each broken into 2 gig files).
2.) As for storage space. The raw data takes up approximately 250 gigabytes. I estimate that after processing and stripping off a lot of extraneous metadata that I am not interested in I could knock that down to anywhere from 36 to 48 gigabytes depending on how much metadata I decide to keep (without indices). Additionally if in my initial processing of the data I encounter enough sets that are the same I might be able to comress the data yet further by adding counters for repeat events rather than simply repeating the events over and over again.
3.) Each number within a processed set actually contains at LEAST two numbers 14 bits for the data itself (detected energy) and 7 bits for metadata (detector number). So I will need at LEAST three bytes per number.
4.) My "though in 99.9% of the sets, n is less than 15" comment was misleading. In a preliminary glance through some of the chunks of the data I find that I have sets that contain as many as 22 numbers but the median is 5 numbers per set and the average is 6 numbers per set.
5.) While I like the idea of building an index of pointers into files I am a bit leery because for requests involving more than one number I am left with the semi slow task (at least I think it is slow) of finding the set of all pointers common to the lists, ie finding the greatest common subset for a given number of sets.
6.) In terms of resources available to me, I can muster approximately 300 gigs of space after I have the raw data on the system (The remainder of my quota on that system). The system is a dual processor server with 2 quad core amd opterons and 16 gigabytes of ram.
7.) Yes 0 can occur, it is an artifact of the data acquisition system when it does but it can occur.
Your problem is the same as that faced by search engines. "I have a bajillion documents. I need the ones which contain this set of words." You just have (very conveniently), integers instead of words, and smallish documents. The solution is an inverted index. Introduction to Information Retrieval by Manning et al is (at that link) available free online, is very readable, and will go into a lot of detail about how to do this.
You're going to have to pay a price in disk space, but it can be parallelized, and should be more than fast enough to meet your timing requirements, once the index is constructed.
Assuming a random distribution of 0-16383, with a consistent 15 elements per set, and two billion sets, each element would appear in approximately 1.8M sets. Have you considered (and do you have the capacity for) building a 16384x~1.8M (30B entries, 4 bytes each) lookup table? Given such a table, you could query which sets contain (1) and (17) and (5555) and then find the intersections of those three ~1.8M-element lists.
My guess is as follows.
Assume that each set has a name or ID or address (a 4-byte number will do if there are only 2 billion of them).
Now walk through all the sets once, and create the following output files:
A file which contains the IDs of all the sets which contain '1'
A file which contains the IDs of all the sets which contain '2'
A file which contains the IDs of all the sets which contain '3'
... etc ...
If there are 16 entries per set, then on average each of these 2^16 files will contain the IDs of 2^20 sets; with each ID being 4 bytes, this would require 2^38 bytes (256 GB) of storage.
You'll do the above once, before you process requests.
When you receive requests, use these files as follows:
Look at a couple of numbers in the request
Open up a couple of the corresponding index files
Get the list of all sets which exist in both these files (there's only a million IDs in each file, so this should't be difficult)
See which of these few sets satisfy the remainder of the request
My guess is that if you do the above, creating the indexes will be (very) slow and handling requests will be (very) quick.
I have recently discovered methods that use Space Filling curves to map the multi-dimensional data down to a single dimension. One can then index the data based on its 1D index. Range queries can be easily carried out by finding the segments of the curve that intersect the box that represents the curve and then retrieving those segments.
I believe that this method is far superior to making the insane indexes as suggested because after looking at it, the index would be as large as the data I wished to store, hardly a good thing. A somewhat more detailed explanation of this can be found at:
http://www.ddj.com/184410998
and
http://www.dcs.bbk.ac.uk/~jkl/publications.html
Make 16383 index files, one for each possible search value. For each value in your input set, write the file position of the start of the set into the corresponding index file. It is important that each of the index files contains the same number for the same set. Now each index file will consist of ascending indexes into the master file.
To search, start reading the index files corresponding to each search value. If you read an index that's lower than the index you read from another file, discard it and read another one. When you get the same index from all of the files, that's a match - obtain the set from the master file, and read a new index from each of the index files. Once you reach the end of any of the index files, you're done.
If your values are evenly distributed, each index file will contain 1/16383 of the input sets. If your average search set consists of 6 values, you will be doing a linear pass over 6/16383 of your original input. It's still an O(n) solution, but your n is a bit smaller now.
P.S. Is zero an impossible result value, or do you really have 16384 possibilities?
Just playing devil's advocate for an approach which includes brute force + index lookup :
Create an index with the min , max and no of elements of sets.
Then apply brute force excluding sets where max < max(set being searched) and min > min (set being searched)
In brute force also exclude sets whole element count is less than that of the set being searched.
95% of your searches would really be brute forcing a very smaller subset. Just a thought.

Resources