pig - need tips after performance tuning gone wrong - performance

I have a Pig script that took around 10 minutes to finish and I thought that there was still room for some performance improvement.
So, I started by putting the JOINs and GROUPs in a nested FOREACH and also putting the previous FILTERs inside the same FOREACH.
I also added using 'replicated'.
The problem now is that instead of taking 10 minutes, it's taking over 30 minutes.
Is there a place that has best practices and performance improvement tips besides PIG's documentation?
So that you can get a better picture, here's some code:
--before
previous_join = JOIN A by id, B by id --for symplification
filtering = FILTER previous_join BY ((year_min > 1995 ? year_min - 1 : year_min) <= list_year and (year_max > 2015 ? year_max - 1 : year_max) >= list_year);
final_filtered = FOREACH filtering GENERATE user_id as user_id, list_year;
--after
final_filtered = FOREACH (JOIN A by id, B by id) {
tmp = FILTER group BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year and (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year and A::premium == 'true');
GENERATE A::user_id AS user_id, B::list_year AS list_year;
};
Am I doing something wrong or is this the wrong approach?
Thanks.

In prior case [before] you are performing filter and projection after the join is performed.
It will be helpful if you calculate time log for each operation and identify the bottleneck operation.
Can you also try splitting your filter statements in multiple relations rather than just one and check the difference in filter timing?
filter_by_min_year = FILTER previous_join BY ((A::year_min > 1995 ? A::year_min - 1 : A::year_min) <= B::list_year);
filter_by_max_year = FILTER filter_by_min_year BY (A::year_max > 2015 ? A::year_max - 1 : A::year_max) >= B::list_year);
Overall you want to find ids(+some more columns) with A::year_min <=B::list_year and A::year_max >= B::list_year
Instead of performing join on raw A & B, you can try using projections on both of them to contain only columns needed for join and later operations.
A-projected = foreach A generate id, year_min, year_max;
B-projected = foreach B generate id, list_year;
C = join A-projected by id, B-projected by id USING 'replicated';
If any of A-projected or B-projected is a small set that can be loaded in memory use replicated join, I am assuming B-projected to be a smaller set than A-projected.
If this doesnt apply to your case, please skip this option.
Also you can try setting the number of reducers to be used for this join by using PARALLEL keyword.
After applying filter you will get a list of required id's that you can use to fetch other information from A or B.
Also consider tweaking MapReduce properties like io.sort.mb, mapred.job.shuffle.input.buffer.percent etc.
Hope this helps.

Related

Linq query increase performance efficient query

int mappedCount = (from product in products
from productMapping in DbContext.ProductCategoryMappings
.Where(x => product.TenantId == x.TenantId.ToString() &&
x.ProductId.ToString().ToUpper() == product.ProductGuid.ToUpper())
join tenantCustMapping in DbContext.TenantCustCategories
on productMapping.Value equals tenantCustMapping.Id
select 1).ToList().Sum();
I need to increase the performance.
When mapping product two tables each item having multiple product
If you want to increase performance, you'd need to know the volumes of data that are getting sent around. How many "products" are in your variable.
It may be quicker to update your products list to contain integers / guids and send that to your database rather than send strings that the database has to run ToUpper() on before comparing them.
something like:
var convertedList = products.Select( new {TenantId = int.parse(product.TenantId), productId = Guid.Parse(x.ProductId)}
Then sending that to your Db, and comparing them directly
I think changing "Select 1).ToList().Sum()" to ".Count()" will improve performance. Even if not, it'll help readability.

Pig case statement for finding no. Of events in a specific period of time

Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.

Linq query returns duplicate results when .Distinct() isn't used - why?

When I use the following Linq query in LinqPad I get 25 results returned:
var result = (from l in LandlordPreferences
where l.Name == "Wants Student" && l.IsSelected == true
join t in Tenants on l.IsSelected equals t.IsStudent
select new { Tenant = t});
result.Dump();
When I add .Distinct() to the end I only get 5 results returned, so, I'm guessing I'm getting 5 instances of each result when the above is used.
I'm new to Linq, so I'm wondering if this is because of a poorly built query? Or is this the way Linq always behaves? Surely not - if I returned 500 rows with .Distinct(), does that mean without it there's 2,500 returned? Would this compromise performance?
It's a poorly built query.
You are joining LandlordPreferences with Tenants on a boolean value instead of a foreign key.
So, most likely, you have 5 selected land lords and 5 tenants that are students. Each student will be returned for each land lord: 5 x 5 = 25. This is a cartesian product and has nothing to do with LINQ. A similar query in SQL would behave the same.
If you would add the land lord to your result (select new { Tenant = t, Landlord = l }), you would see that no two results are actually the same.
If you can't fix the query somehow, Distinct is your only option.

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG:
logs = LOAD 'log'
logs_w_one = foreach logs generate 1 as one;
logs_group = group logs_w_one all;
logs_count = foreach logs_group generate SUM(logs_w_one.one);
dump logs_count;
This seems to be too inefficient. Please enlighten me if there is a better way!
COUNT is part of pig see the manual
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
Arnon Rotem-Gal-Oz already answered this question a while ago, but I thought some may like this slightly more concise version.
LOGS = LOAD 'log';
LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);
Be careful, with COUNT your first item in the bag must not be null. Else you can use the function COUNT_STAR to count all rows.
Basic counting is done as was stated in other answers, and in the pig documentation:
logs = LOAD 'log';
all_logs_in_a_bag = GROUP logs ALL;
log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs);
dump log_count
You are right that counting is inefficient, even when using pig's builtin COUNT because this will use one reducer. However, I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we're counting.
In other words, when counting a relation, we don't actually care about the data itself so let's use as little RAM as possible. You were on the right track with your first iteration of the count script.
logs = LOAD 'log'
ones = FOREACH logs GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count
This will work on much larger relations than the previous script and should be much faster. The main difference between this and your original script is that we don't need to sum anything.
This also doesn't have the same problem as other solutions where null values would impact the count. This will count all the rows, regardless of if the first column is null or not.
USE COUNT_STAR
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);
Here is a version with optimization.
All the solutions above would require pig to read and write full tuple when counting, this script below just write '1'-s
DEFINE row_count(inBag, name) RETURNS result {
X = FOREACH $inBag generate 1;
$result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X);
};
The use it like
xxx = row_count(rows, 'rows_count');
What you want is to count all the lines in a relation (dataset in Pig Latin)
This is very easy following the next steps:
logs = LOAD 'log'; --relation called logs, using PigStorage with tab as field delimiter
logs_grouped = GROUP logs ALL;--gives a relation with one row with logs as a bag
number = FOREACH LOGS_GROUP GENERATE COUNT_STAR(logs);--show me the number
I have to say it is important Kevin's point as using COUNT instead of COUNT_STAR we would have only the number of lines which first field is not null.
Also I like Jerome's one line syntax it is more concise but in order to be didactic I prefer to divide it in two and add some comment.
In general I prefer:
numerito = FOREACH (GROUP CARGADOS3 ALL) GENERATE COUNT_STAR(CARGADOS3);
over
name = GROUP CARGADOS3 ALL
number = FOREACH name GENERATE COUNT_STAR(CARGADOS3);

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources