Map Reduce output show decimal number in format 5.7491844E7 etc - hadoop

My Java Mapreduce output shows me number in format 5.7491844E7 etc , Please let me know how can I get the correct result printed ?
Output:
Baby 5.7491844E7
Books 5.7450452E7
CDs 5.7410388E7
Cameras 5.7299596E7
Children's Clothing 5.7625016E7
Computers 5.7314756E7
Consumer Electronics 5.7453124E7
Crafts 5.7418584E7
DVDs 5.7649004E7
Garden 5.7539812E7
Health and Beauty 5.7480924E7
Men's Clothing 5.7621152E7
Music 5.749592E7
Pet Supplies 5.7197412E7
Sporting Goods 5.75983E7
Toys 5.7464028E7
Video Games 5.7513068E7
Women's Clothing 5.7434656E7

You can use any of the below options
decimalFormat.
DecimalFormat df = new DecimalFormat("#");
df.setMaximumFractionDigits(8);
System.out.println(df.format(value));
printf.
System.out.printf("%.9f", value);
System.out.println();
convert toBigDecimal and toPlainString().
System.out.println(new BigDecimal(value).toPlainString());
System.out.println();
String.format
System.out.println(String.format("%.12f", value));

Related

Calculate percentage when an item can be assigned to multiple categories

this might be a stupid question but I'm blanking at the moment and can't find the answer in google.
I have the following example table
customer
bought
A
food
B
food,drink
C
drink
D
drink
now how do I calculate the percentage of customers that bought food/drink over total customers? what would be the best way to calculate this?
solution
% food = #customers who bought food / #total unique customers = 2 / 4 = 50%
% drink = #customers who bought drink / #total unique customers = 3 / 4 = 75%
the problem here is the total % exceeds 100%
solution - count customer B twice, once for drink and once for food
% food = #customers who bought food / #total customers = 2 / 5 = 40%
% drink = #customers who bought drink / #total customers = 3 / 5 = 60%
the total is 100% but is this the correct way to calc % in this case?
The issue is obviously that one customer can buy both food and drink and I'm not sure how to handle this case. Any help would be appreciated. Thank you!
UPDATE:
thanks for the answer. It makes sense to remove altogether the customers who bought both products. But now I'm wondering what happens if there are more than 2 categories?
Example as follows, now we have an additional product (ice cream) in the mix
customer
bought
A
food
B
food,drink
C
drink
D
drink
E
drink,ice cream
F
ice cream
G
food,drink,ice cream
I guess with the same logic we can remove customer G since they bought all the products? And how should we handle customer E and B?
I think that what you are looking for is simply to eliminate the number of customers that purchased both products.
If you are looking for unique combinations of customers who purchased only a specific product then:
3 unique customers bought a single unique item
1/3 food => 33%
2/3 drink => 66%
The ones who bought both products, don't matter in these calculations since they add the same percentage to both cases.
EDIT:
I used excel to help me out. I hope that is fine
I added your data into an excel sheet and created a crosstable pivot.
I've set the Customers as columns and the Products as rows, in order to see what each individual customer has purchased via the values field Count of Product
I've changed the count of Product into the formula for % of Total
in order to show me for each product, the percentage split between the customers, so for example if customer B bought both drink and food he will be shown as 50% in both corresponding rows. If he bought all 3 then 33%
The grand total column, will contains the final percentage for each product row item. Excel calculates it the same way as states in some comments, it counts the products total instead of the customers total, which makes sense since that is your data set being consumed, and not the customers themselves.
If we swap the rows and columns around and do the same calculations, we see individual percentages for each customer (how much % of product they each individually purchased) and we can sum them up for each product. The result is the same

Total sum based on Expression Detail Items (SUM IF Line Items)

I have a table showing Turnover.
I used the same column but did a SUM IIF on the detail lined as follows:
=SUM(IIF(Fields!ID_Country.Value = 17,Fields!Turnover.Value,0))
and
=SUM(IIF(Fields!ID_Country.Value = 23,Fields!Turnover.Value,0))
17 = Great Britain
23 = Ireland
I then want to do a TOTAL per column based on my above detail rows. I just cannot find the right way to do it.
I Tried:
SUM(=SUM(IIF(Fields!ID_Country.Value = 17,Fields!Turnover.Value,0)))
But this does not work.
Please help as I am really stuck.
Thank you!
Because Turnover for GB is in Pounds and for Ireland they are in Euros, I want to split my columns and then do a final SUM for GB and Ireland.
I expect to see the grant total for per column for GB and Ireland summing my detail rows which have the expression calculation. (SUMIIF)
Table: Circled in RED is where I expect to see my TOTAL
enter image description here
Data Structure:
enter image description here

Is MNL the right model to use when the choice options vary across observations?

In a survey of 100 people, I am asking each person to choose between product A and product B. I ask each person this question 3 times, but each time I present a different set of products. Say, first time, Person 1 is asked to choose between 'Phone 1' and 'Phone 2', given certain attributes of each phone. The second time the choice is again 'Phone 1' vs. 'Phone 2', but a different set of attributes for each phone.
A person is presented three attributes associated with the two phone alternatives every time the question is asked. So, each time between Phone 1 and Phone 2, the attributes of the phone such as cost, memory and camera pixels are presented so that user can choose which set of attributes is most attractive, Phone 1's or Phone 2's.
Overall, 3*100 = 300 responses; 3 responses per person. Each time the attributes cost, memory and camera pixels presented and user asked to choose the feature set they prefer.
My goal is to analyze how users value features of a phone vs. cost of the phone.
In this scenario, can I use a MNL - even though each time I asked the person a question, I only presented two choices ? My understanding is that MNL is sued when (a) there are multiple choices and (b) the choice options do not change across observations, i.e. each person is asked to choose between multiple products, say A, B, C and A, B, C do not change across observations.
In the scenario described above, the two choices varied across the three times the same person was asked the question ? If not MNL, should I rather create a binary logit model given that user only had to choose between two options when the question was asked (even though he was asked the question three times)? If I can use binary logit, should I be concerned that the choice set of products change across observations ? or should I let the attributes defined in each of the rows address the differences in product choices across observations.
I have setup the data as follows (thinking I can do MNL but may be I should set it up differently and use another modeling approach?):
I am working on designing and analyzing similar survey but mine is related to transportation. I am at the beginning level and I am still new to the whole concept, however I will give you an advice and reference maybe it is helpful.
First point: I have come cross 3 models as following from a useful video on YouTube:
MNL refers to Multinomial Logit Model. MNL is used with
alternative-invariant regressors (for example salary of participant
in the survey, or his/her gender …).
Conditional logit model is used with alternative-invariant (gender,
salary, education level …) and alternative-variant regressors (cost
of the product, memory, camera pixel …)
Mixed logit model which uses random parameters. It is also used with
alternative-invariant (gender, salary, education level …) and
alternative-variant regressors (cost of the product, memory, camera
pixel …)
Note regarding alternative-invariant and alternative-variant regressors:
The gender of person participating in the survey will NOT vary between Product A or Product P, so it is alternative-invariant regressor. While price of product could vary between Product A and Product B so it is called alternative-variant regressors.
Based on above I assume you need to use conditional logit model or mixed logit model.
For me I couldn’t find a special function in R for the conditional logit model or mixed logit model. The same mlogit function is used, refer to the examples below for the help of mlogit package:
a pure "multinomial model"
summary(mlogit(mode ~ 0 | income, data = Fish))
a pure "conditional" model
summary(mlogit(mode ~ price + catch, data = Fish))
a "mixed" model
m <- mlogit(mode ~ price+ catch | income, data = Fish)
summary(m)
same model with charter as the reference level
m <- mlogit(mode ~ price+ catch | income, data = Fish, reflevel = "charter")
From the examples above, I think (but NOT sure) that in the Manual of mlogit package, they refer to mixed logit when you used both alternative-invariant and alternative-variant regressors. While conditional model when you have only alternative-variant regressors. On the other hand, multinomial model when you have only multinomial alternative-invariant regressors.
Second point: There is something called “panel data” when you are asking the same person to choose one product for each choice-set. Same person here means that in your model you are taking into consideration, the gender, the salary, the education level … which they will stay the same for the same person. Check this: https://en.wikipedia.org/wiki/Panel_data
To use panel techniques please refer to help in mlogit package in R. I am quoting from it the following:
“panel only relevant if rpar is not NULL and if the data are repeated observations of the same unit ; if TRUE, the mixed-logit model is estimated using panel techniques”
So in my understanding, if you want to use the panel techniques you have to use random draws because panel will be true and rpar will not be NULL.
Moreover, for example about using the panel data, please refer to the below example from “Estimation of multinomial logit models in R : The mlogit Packages” by Yves Croissant
data("Train", package = "mlogit")
Tr <- mlogit.data(Train, shape = "wide", varying = 4:11, choice = "choice", sep = "_", opposite = c("price", "time", "change", "comfort"), alt.levels=c("A", "B"), id.var ="id")
Train.ml <- mlogit(choice ~ price + time + change + comfort, Tr)
Train.mxlc <- mlogit(choice ~ price + time + change + comfort, Tr, panel = TRUE, rpar = c(time = "cn", change = "n", comfort = "ln"), correlation = TRUE, R = 100, halton = NA)
Train.mxlu <- update(Train.mxlc, correlation = FALSE)
I hope that is useful to you.

pig distinct atom

Suppose my data looks like this with columns named food, action, and population:
pizzas eatenBy humans
pizzas eatenBy collegeKids
pizzas eatenBy everyOne
pizzas grownBy farmers
sprouts grownBy sproutFarmers
sprouts grownBy humans
How can I write a Pig Latin script to produce ONLY a unique food & action, with any valid population from the distinct food & action group?
ie, the only output I'd like from the above data would be this (though the population of the 1st and 3rd lines could be different):
pizzas eatenBy everyOne
pizzas grownBy farmers
sprouts grownBy sproutFarmers
Thank you,
Don't know how you'd do this with DISTINCT (which is more efficient than what I'm about to suggest), but you could do this:
food = load 'foodInput' AS (foodType,action,population);
foodGrouped = GROUP food by (foodType,action);
foodLimited = foreach foodGrouped {
limited = LIMIT food 1;
GENERATE FLATTEN(limited.(foodType,action,population));
};

achieving a complex sort via Linq to Objects

I've been asked to apply conditional sorting to a data set and I'm trying to figure out how to achieve this via LINQ. In this particular domain, purchase orders can be marked as primary or secondary. The exact mechanism used to determine primary/secondary status is rather complex and not germane to the problem at hand.
Consider the data set below.
Purchase Order Ship Date Shipping Address Total
6 1/16/2006 Tallahassee FL 500.45
19.1 2/25/2006 Milwaukee WI 255.69
5.1 4/11/2006 Chicago IL 199.99
8 5/16/2006 Fresno CA 458.22
19 7/3/2006 Seattle WA 151.55
5 5/1/2006 Avery UT 788.52
5.2 8/22/2006 Rice Lake MO 655.00
Secondary POs are those with a decimal number and primary PO's are those with an integer number. The requirement I'm dealing with stipulates that when a user chooses to sort on a given column, the sort should only be applied to primary POs. Secondary POs are ignored for the purposes of sorting, but should still be listed below their primary PO in ship date descending order.
For example, let's say a user sorts on Shipping Address ascending. The data would be sorted as follows. Notice that if you ignore the secondary POs, the data is sorted by Address ascending (Avery, Fresno, Seattle, Tallahassee)
Purchase Order Ship Date Shipping Address Total
5 5/1/2006 Avery UT 788.52
--5.2 8/22/2006 Rice Lake MO 655.00
--5.1 4/11/2006 Chicago IL 199.99
8 5/16/2006 Fresno CA 458.22
19 7/3/2006 Seattle WA 151.55
--19.1 2/25/2006 Milwaukee WI 255.69
6 1/16/2006 Tallahassee FL 500.45
Is there a way to achieve the desired effect using the OrderBy extension method? Or am I stuck (better off) applying the sort to the two data sets independently and then merging into a single result set?
public IList<PurchaseOrder> ApplySort(bool sortAsc)
{
var primary = purchaseOrders.Where(po => po.IsPrimary)
.OrderBy(po => po.ShippingAddress).ToList();
var secondary = purchaseOrders.Where(po => !po.IsPrimary)
.OrderByDescending(po => po.ShipDate).ToList();
//merge 2 lists somehow so that secondary POs are inserted after their primary
}
Have you seen ThenBy and ThenByDescending methods?
purchaseOrders.Where(po => po.IsPrimary).OrderBy(po => po.ShippingAddress).ThenByDescending(x=>x.ShipDate).ToList();
I'm not sure if this is going to fit your needs because I don't quiet understand well how final list should look like (po.IsPrimary and !po.IsPrimary is confusing me).
The solution for your problem is GroupBy.
First order your object according to selected column:
var ordered = purchaseOrders.OrderBy(po => po.ShippingAddress);
Than you need to group your orders according to the primary order. I assumed the order is a string, so i created a string IEqualityComparer like so:
class OrderComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
x = x.Contains('.') ? x.Substring(0, x.IndexOf('.')) : x;
y = y.Contains('.') ? y.Substring(0, y.IndexOf('.')) : y;
return x.Equals(y);
}
public int GetHashCode(string obj)
{
return obj.Contains('.') ? obj.Substring(0, obj.IndexOf('.')).GetHashCode() : obj.GetHashCode();
}
}
and use it to group the orders:
var grouped = ordered.GroupBy(po => po.Order, new OrderComparer());
The result is a tree like structure ordered by the ShippingAddress column and grouped by the primary order id.

Resources