Embarrassingly parallel example - parallel-processing

I am totally new to the world of parallel programming. I intend to know that if there is a java program called Task A, which contains 10 sub tasks T1...T10, where T1 reads data from table A1, T2 from table A2...T10 from table A10, using the same key say employee_id, is the problem called embarrassingly parallel ? I was thinking of employing 10 threads to do each of the sub tasks.

Related

Optimize hive query to avoid JOIN

Question is similar to this except I want to know if I can do it in one query. This is what I have working but as we all know joins are expensive. Any better hql to do this?
select a.tbl1,b.tbl2
from
(
select count(*) as tbl1 from tbl1
) a
join
(
select count(*) as tbl2 from tbl2
) b ON 1=1
Yes, Joins are expensive
When it is said that joins are expensive, this typically refers to the situation where you have many records in multiple tables that need to be matched with eachother.
According to that description your join is not expensive, as you only join 2 sets with 1 record each.
But, you must be looking at overhead
Perhaps you notice that the individual counts take significantly shorter than the command which you use to count and combine the result. This would be because map and reduce operations have significant overhead (can be 30 seconds per stage).
You can play around a bit to see whether you hit a plan that does not incur much overhead, but it could well be that you are out of luck as hive does not scale down that well.
If it is not critical for you to keep them as a separate columns you can use UNION ALL operation to work with row format:
select 'tbl1', count(*) from tbl1
UNION ALL
select 'tbl2', count(*) from tbl2;
This would allow you to avoid extra MAPJOIN operator in your former query. Technically you can have one less mapper in your end execution plan.
Update
In up-to-date distributions of Hadoop you will not get much differences from performance perspective of going either UNION or MAP JOIN approach as these operations would be optimized within former jobs. But keep in mind that on older versions of the cluster or basing on some configuration properties MAPJOIN could be converted into a separate job.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

How to filter a very, very large file

I have a very large unsorted file, 1000GB, of ID pairs
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC123
ID:ABC124 ID:ABC126
I would like to filter the file for
1) duplicates
example
ABC123 ABC124
ABC123 ABC124
2) reverse pairs (discard the second occurrence)
example
ABC123 ABC124
ABC124 ABC123
After filtering, the example file above would look like
ID:ABC123 ID:ABC124
ID:ABC123 ID:ABA122
ID:ABC124 ID:ABC126
Currently, my solution is this
my %hash;
while(my $line = <FH>){
chomp $line; #remove \n
my ($id1,$id2) = split / /, $line;
if(exists $hash{$id1$1d2} || exists $hash{$id2$id1}){
next;
}
else{
$hash{$id1$id2} = undef ; ## store it in a hash
print "$line\n";
}
}
which gives me the desired results for smaller lists, but takes up too much memory for larger lists, as I am storing the hash in memory.
I am looking for a solution that will take less memory to implement.
Some thoughts I have are
1) save the hash to a file, instead of memory
2) multiple passes over the file
3) sorting and uniquing the file with unix sort -u -k1,2
After posting on stack exchange cs, they suggested an external sort algorithm
You could use map reduce for the tasks.
Map-Reduce is a framework for batch-processing that allows you to easily distribute your work among several machines, and use parallel processing without taking care of synchronization and failure tolerance.
map(id1,id2):
if id1<id2:
yield(id1,id2)
else:
yield(id2,id1)
reduce(id1,list<ids>):
ids = hashset(ids) //fairly small per id
for each id2 in ids:
yield(id1,id2)
The map-reduce implementation will allow you to distribute your work on several machines with really little extra programming work required.
This algorithm also requires linear (and fairly small) number of traversals over the data, with fairly small amount of extra memory needed, assuming each ID is associated with a small number of other IDs.
Note that this will alter the order of pairs (make first id second in some cases)
If the order of original ids does matter, you can pretty easily solve it with an extra field.
Also note that the order of data is altered, and there is no way to overcome it when using map-reduce.
For better efficiency, you might want to add a combiner, which will do the same job as the reducer in this case, but if it will actually help depends a lot on the data.
Hadoop is an open source library that implements Map-Reduce, and is widely used in the community.
Depending on the details of your data (see my comment on the question) a Bloom filter may be a simple way to get away with two passes. In the first pass insert every pair into the filter after ordering the first and the second value and generate a set of possible duplicates. In the second pass filter the file using the set of possible duplicates. This obviously requires that the set of (possible) duplicates is not itself large.
Given the characteristics of the data set - up to around 25 billion unique pairs and roughly 64 bit per pair - the result will be on the order of 200 GB. So you either need a lot of memory, many passes or many machines. Even a Bloom filter will have to be huge to yield a acceptable error rate.
sortbenchmark.org can provide some hints on what is required because the task is not to different from sorting. The 2011 winner used 66 nodes with 2 quadcore processors, 24 GiB memory and 16 500 GB disks each and sorted 1,353 GB in 59.2 seconds.
As alternative to rolling your own clever solution, you could add the data into a database and then use SQL to get the subset that you need. Many great minds have already solved the problem of querying 'big data', and 1000GB is not really that big, all things considered...
Your approach is almost fine, you just need to move you hashes to disk instead of keeping them in memory. But let's go step by step.
Reorder IDs
It's inconvenient to work with records with different order of IDs in them. So, if possible, reorder IDs, or, if not, create additional keys for each record that holds correct order. I will assume you can reorder IDs (I'm not very good in Bash, so my code will be in Python):
with open('input.txt') as file_in, open('reordered.txt', 'w') as file_out:
for line in file_in:
reordered = ' '.join(sorted(line.split(' '))) # reorder IDs
file_out.write(reordered + '\n')
Group records by hash
You cannot filter all records at once, but you can split them into reasonable number of parts. Each part may be uniquely identified by hash of records in it, e.g.:
N_PARTS = 1000
with open('reordered.txt') as file_in:
for line in file_in:
part_id = hash(line) % N_PARTS # part_id will be between 0 and (N_PARTS-1)
with open('part-%8d.txt' % part_id, 'a') as part_file:
part_file.write(line + '\n')
Choice of has function is important here. I used standard Python's hash() (module N_PARTS), but you may need to use another function, that gives distribution of number of records with each hash close to uniform. If hash function work more or less ok, instead of 1 large file of 1Tb you will get 1000 small files of ~100Mb. And the most important thing is that you have a guarantee that there are no 2 same records in different parts.
Note, that opening and closing part files for each line isn't really a good idea, since it generates countless system calls. In fact, better approach would be to keep files open (you may need to increase your ulimit -f), use batching or even write to database - this is up to implementation, while I will keep the code simple for the purpose of demonstration.
Filter each group
100Mb file are much easier to work with, aren't they? You can load them into memory and easily remove duplicates with hash set:
unique = set([])
for i in range(N_PARTS): # for each part
with open('part-%8d.txt') as part_file:
file line in part_file: # for each line
unique.add(line)
with open('output.txt', 'w') as file_out:
for record in unique:
file_out.write(record + '\n')
This approach uses some heavy I/O operations and 3 passes, but it is linear in time and uses configurable amount of memory (if your parts are still too large for a single machine, just increase N_PARTS).
So if this were me I'd take the database route as described by #Tom in another answer. I'm using Transact SQL here, but it seems that most of the major SQL databases have similar windowing/ranking row_number() implementations (except MySQL).
I would probably run a two sweep approach, first rewriting the id1 and id2 columns into a new table so that the "lowest" value is in id1 and the highest in id2.
This means that the subsequent task is to find the dupes in this rewritten table.
Initially, you would need to bulk-copy your source data into the database, or generate a whole bunch of insert statements. I've gone for the insert here, but would favour a bulk insert for big data. Different databases have different means of doing the same thing.
CREATE TABLE #TestTable
(
id int,
id1 char(6) NOT NULL,
id2 char(6) NOT NULL
)
insert into
#TestTable (id, id1, id2)
values
(1, 'ABC123', 'ABC124'),
(2, 'ABC123', 'ABC124'),
(3, 'ABC123', 'ABA122'),
(4, 'ABC124', 'ABC123'),
(5, 'ABC124', 'ABC126');
select
id,
(case when id1 <= id2
then id1
else id2
end) id1,
(case when id1 <= id2
then id2
else id1
end) id2
into #correctedTable
from #TestTable
create index idx_id1_id2 on #correctedTable (id1, id2, id)
;with ranked as
(select
ROW_NUMBER() over (partition by id1, id2 order by id) dupeRank,
id,
id1,
id2
from #correctedTable)
select id, id1, id2
from ranked where dupeRank = 1
drop table #correctedTable
drop table #TestTable
Which gives us the result:
3 ABA122 ABC123
1 ABC123 ABC124
5 ABC124 ABC126
I'm no trying to answer the question, merely adding my 0.02€ to other answers.
A must-to-do to me is to split the task into multiple smaller tasks as was already suggested. Both the control flow and the data structures.
The way that Merge Sort was used with Tape Drives to sort big data volumes (larger than memory and larger then random-access-disk). In nowaday terms it would mean that the storage is distributed accross multiple (networked) disks or networked disk-sectors.
There are already languages and even operating systems that support this kind of distribution with different granularity. Some 10 years ago I had my hot candidates for this kind of tasks but I don't remember the names and things had changed since then.
On of the first was the distributed Linda Operating System with parallel processors attached/disconnected as needed. Basic coordination structure was huge distributed Tuple Space data structure where processors read/wrote tasks and wrote results.
More recent approach with similar distribution of work are the Multi agent systems (Czech Wikipedia article perhaps contains more links)
Related wikipedia article are Parallel Computing, Supercomputer Operating Systems and List of concurrent and parallel programming languages
I don't mean to say that you should buy some processor time on a supercomputer and run the computation there. I'm listing them as algorithmic concepts to study.
As there will be many times some free or open source software solutions available that will allow you to do the same in the small. Starting with cheap software and available hardware. e.g. back at university in 1990 we used the night time in computer lab to calculate ray-traced 3D images. It was very computationally expensive process as for every pixel you must cast a "ray" and calculate its collisions with the scene model. On 1 machine with scene with some glasses and mirrors it ran like 1 pixel per second (C++ and optimized assembly language code). At the lab we had some ~15 PCs available. So the final time might be reduced ~15 times (I386, I486 and image of 320x200 256 colors). The image was split into standalone tasks, computed in parallel and them merged into one. The approach scaled well at that time and similar approach would help you also today.
There always was and always will be something like "big data", that big that it does not fit into RAM and it does not fit on disk and it can't be computed on 1 computer in finite time.
Such tasks were solved successfully since the very first days of computing. Terms like B-Tree, Tape drive, Seek time, Fortran, Cobol, IBM AS/400 come from that era. If you're like engineers of those times than you'll for sure come out with something smart :)
EDIT: actually, you are probably looking for External Sorting

How to optimize big table read and outer join in pig

I am joining a big table with 3 other tables,
A = join small table by (f1,f2) RIGHT OUTER , massiveTable by (f1,f2) ;
B = join AnotherSmall by (f3) RIGHT OUTER , A by (f3) ;
C = join AnotherSmall by (f4) , B by (f4) ;
The small tables may not fit in memory, but this forces a billion object read three times and time consuming, I was wondering if there is any way rereading can be avoided and process can be made more efficient?
Thanks in advance.
If you design your big table in HBase to have three column families, i.e. splitting f1 and f2, from f3 and from f4, you should be able to avoid the unnecessary reads.
Also if you think about it, you don't re-read but rather read a different part of a record: firstly f1 and f2, then f3 and finally f4.

Pig Inner Join produces a Job with 1 hanging reducer

I have a Pig Script that I've been working on that has an Inner Join from 2 different data sources. This join happens to be the 1st MapReducing causing operation. With the only operations before hand being filters and foreachs. When this Join is then executed everything goes threw the map phase perfectly and fast, but when it comes to the reduce phase all the reducers but 1 finish fast. However the 1 just sits there at the Reduce part of the Phase chugging over data at a very very slow pace. To the point that it can take up to an hour+ just waiting on that 1 reducer to complete. I have tried increasing the reducers as well as switching to a skewed join, but nothing seems to help.
Any ideas for things to look into.
I also did an explain to see if I could see anything, but that just shows a simple single job flow with nothing amazing.
Likely what is happening is a single key has huge number of instances on both sides and it's exploding.
For example, if you join:
x,4 x,'f'
x,5 x,'g'
x,6 X x,'h'
y,7 x,'i'
... you will get 12 pairs of x! So you can imagine that if you have 1000 of one key and 2000 of the same key in the other data set, you will get 2 million pairs just from those 2000 rows. The single reducer unfortunately has to take the brunt force of this explosion.
Adding reducers or using a skew join isn't going to help here, because at the end of the day, a single reducer needs to handle this one big explosion of pairs.
Here are a few things to check:
It sounds like only a single join key is causing this issue since only one reducer is getting hammered. The common culprit is NULL. Can the column in either of these be NULL? If so, it'll get a huge explosion! Try filtering out NULL on the foreign key of both relations before running through the join and see if there is a difference. Or, instead of NULL... perhaps you have some sort of default value or a single value that shows up a lot.
Try to figure out how many of each key there actually are, and figure out what the explosion will look like. Something like (warning: I'm not actually testing this code, hopefully it works):
A1 = LOAD ... -- load dataset 1
B1 = GROUP A1 BY fkey1;
C1 = FOREACH B1 GENERATE group, COUNT_STAR(A1) as cnt1;
A2 = LOAD ... -- load dataset 2
B2 = GROUP A2 BY fkey2;
C2 = FOREACH B2 GENERATE group, COUNT_STAR(A2) as cnt2;
D = JOIN C1 by fkey1, C2 by fkey2; -- do the join on the counts
E = FOREACH D GENERATE fkey1, (cnt1 * cnt2) as cnt; -- multiply out the counts
F = ORDER E BY cnt DESC; -- order it by the highest first
STORE F INTO ...
Similarly, it may have nothing to do with an explosion. One of your relations might just have a single key a ton of times. For example, in the word count example, the reducer that ends up with the word "the" is going to have a lot more counting to do than the one that gets "zebra". I don't think this is the case here since only one of your reducers is getting hammered, which is why I think #1 is probably the case.
If you have some huge number for one of the keys, that's why. And you also know what key is causing the issue.

Resources