I have below data
primary,first,second
1,393440.09,354096.08
1,4410533.33,3969479.99
1,-4803973.41,-4323576.07
I have to aggregate and sum first and second column. Below is the script I am executing
data_load= load <filelocation> using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') As (primary:double, first:double,second:double)
dataAgrr = group data_load by primary;
sumData = FOREACH dataAgrr GENERATE
group as data,
SUM(data_load.first) as first,
SUM(data_load.second) as second,
SUM(data_load.primary) as primary;
After executing, below Output is produced:
(1.0,0.009999999951105565,-5.820766091346741E-11,3.0)
But when we manually adding second column (354096.08, 3969479.99, -4323576.07) gives 0.
Pig uses Java "double" internally.
Testing with a sample code below
import java.math.BigDecimal;
public class TestSum {
public static void main(String[] args) {
double d1 = 354096.08;
double d2 = 3969479.99;
double d3 = -4323576.07;
System.err.println("Total in double is " + ((d3 + d2 ) + d1));
BigDecimal bd1 = new BigDecimal("354096.08");
BigDecimal bd2 = new BigDecimal("3969479.99");
BigDecimal bd3 = new BigDecimal("-4323576.07");
System.err.println("Total in BigDecimal is " + bd3.add(bd2).add(bd1));
}
}
This produces
Total in double is -5.820766091346741E-11
Total in BigDecimal is 0.00
If you need a better precision, you may want to try using "bigdecimal" instead of "double" in your script.
Related
I'm writing a variable dimensioned as Decimal to an Access DB. The value calculated is 1600.91. The field in Access is set to: Field Size - Decimal, Format - Fixed, Precision - 12, Scale - 2, Decimal Places - 2
But the number that ends up in the table is 1600.90. Even if I hard-code the variable to 1600.91, Access shows 1600.90. Other rows in that column have both decimals showing correctly.
The setup for the Access connection is as follows:
Public con As New System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=|DataDirectory|\SPJ_CRVM.accdb")
Public daAnswer As New OleDbDataAdapter("SELECT * FROM [answer]", con)
Public cbAnswer = New OleDbCommandBuilder(daAnswer)
Public dtAnswer As New DataTable
Public drAnswer As DataRow
cbAnswer.quoteprefix = "["
cbAnswer.quotesuffix = "]"
daAnswer.Fill(dtAnswer)
drAnswer = dtAnswer.NewRow()
The code is as follows:
Dim ldec_gmp as Decimal
ldec_gmp = Math.Round(NUM / PVP * 12 / M, 4, MidpointRounding.AwayFromZero)
drAnswer("guar_maturity_premium") = ldec_gmp
And writing to Access is done here:
dtAnswer.Rows.Add(drAnswer)
daAnswer.Update(dtAnswer)
Update:
NUM & PVP are both dimensioned as Decimal; M is dimensioned at Integer
In the example row:
NUM = 41506.813854184606481174528411
PVP = 25.927068677092090950619294386
M = 12
ldec_gmp = 1600.9065 with MidpointRounding.AwayFromZero set to 4
ldec_gmp = 1600.91 with MidpointRounding.AwayFromZero set to 2
What am I doing wrong?
I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;
Any calls using jdbcTemplate.queryForList returns a list of Maps which have NULL values for all columns. The columns should've had string values.
I do get the correct number of rows when compared to the result I get when I run the same query in a native SQL client.
I am using the JDBC ODBC bridge and the database is MS SQL server 2008.
I have the following code in my DAO:
public List internalCodeDescriptions(String listID) {
List rows = jdbcTemplate.queryForList("select CODE, DESCRIPTION from CODE_DESCRIPTIONS where LIST_ID=? order by sort_order asc", new Object[] {listID});
//debugcode start
try {
Connection conn1 = jdbcTemplate.getDataSource().getConnection();
Statement stat = conn1.createStatement();
boolean sok = stat.execute("select code, description from code_descriptions where list_id='TRIGGER' order by sort_order asc");
if(sok) {
ResultSet rs = stat.getResultSet();
ResultSetMetaData rsmd = rs.getMetaData();
String columnname1=rsmd.getColumnName(1);
String columnname2=rsmd.getColumnName(2);
int type1 = rsmd.getColumnType(1);
int type2 = rsmd.getColumnType(2);
String tn1 = rsmd.getColumnTypeName(1);
String tn2 = rsmd.getColumnTypeName(2);
log.debug("Testquery gave resultset with:");
log.debug("Column 1 -name:" + columnname1 + " -typeID:"+type1 + " -typeName:"+tn1);
log.debug("Column 2 -name:" + columnname2 + " -typeID:"+type2 + " -typeName:"+tn2);
int i=1;
while(rs.next()) {
String cd=rs.getString(1);
String desc=rs.getString(2);
log.debug("Row #"+i+": CODE='"+cd+"' DESCRIPTION='"+desc+"'");
i++;
}
} else {
log.debug("Query execution returned false");
}
} catch(SQLException se) {
log.debug("Something went haywire in the debug code:" + se.toString());
}
log.debug("Original jdbcTemplate list result gave:");
Iterator<Map<String, Object>> it1= rows.iterator();
while(it1.hasNext()) {
Map mm = (Map)it1.next();
log.debug("Map:"+mm);
String code=(String)mm.get("CODE");
String desc=(String)mm.get("description");
log.debug("CODE:"+code+" : "+desc);
}
//debugcode end
return rows;
}
As you can see I've added some debugging code to list the results from the queryForList and I also obtain the connection from the jdbcTemplate object and uses that to sent the same query using the basic jdbc methods (listID='TRIGGER').
What is puzzling me is that the log outputs something like this:
Testquery gave resultset with:
Column 1 -name:code -typeID:-9 -typeName:nvarchar
Column 2 -name:decription -typeID:-9 -typeName:nvarchar
Row #1: CODE='C1' DESCRIPTION='BlodoverxF8rin eller bruk av blodprodukter'
Row #2: CODE='C2' DESCRIPTION='Kodetilfelle, hjertestans/respirasjonstans'
Row #3: CODE='C3' DESCRIPTION='Akutt dialyse'
...
Row #58: CODE='S14' DESCRIPTION='Forekomst av hvilken som helst komplikasjon'
...
Original jdbcTemplate list result gave:
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
...
58 repetitions total.
Why does the result from the queryForList method return NULL in all columns for every row? How can I get the result I want using jdbcTemplate.queryForList?
The xF8 should be the letter ΓΈ so I have some encoding issues, but I can't see how that may cause all values - also strings not containing any strange letters (se row#2) - to turn into NULL values in the list of maps returned from the jdbcTemplate.queryForList method.
The same code ran fine on another server against a MySQL Server 5.5 database using the jdbc driver for MySQL.
The issue was resolved by using the MS SQL Server jdbc driver rather than using the JDBC ODBC bridge. I don't know why it didn't work with the bridge though.
I need help , I have an arrayList of objects . This object contains multiple fields I'm interested in this question by two date fields (date_panne date_mise and running) and two other time fields (heure_panne and time start),
And I would like to obtain the sum of the difference between (date_panne, heure_panne) and (date_mise_en_marche; heure_mise_en_marche) to give the total time of failure.
if someone can help me please I will be gratful this is my function :
public String disponibile() throws Exception {
int nbreArrets = 0;
List<Intervention> allInterventions = interventionDAO.fetchAllIntervention();
List<Intervention> listInterventions = new ArrayList<Intervention>();
for (Intervention currentIntervention : allInterventions) {
if (currentIntervention.getId_machine() == this.intervention.getId_machine()
&& currentIntervention.getDate_panne().compareTo(getProductionStartDate()) >= 0
&& currentIntervention.getDate_panne().compareTo(getProductionEndDate()) <= 0) {
listInterventions.add(currentIntervention);
}
}
savedInterventionList = listInterventions;
return "successView" ;
}
Assuming the the dates are truncated to the day and are of type java.util.Date, and that the times only contain hours, minutes, seconds and milliseconds and are also of type Date, start by creating a method like
private Date combine(Date dateOnly, Date timeOnly) {
Calendar dateCalendar = Calendar.getInstance();
dateCalendar.setTime(dateOnly);
Calendar timeCalendar = Calendar.getInstance();
timeCalendar.setTime(timeOnly);
dateCalendar.add(Calendar.HOUR_OF_DAY, timeCalendar.get(Calendar.HOUR_OF_DAY));
dateCalendar.add(Calendar.MINUTE, timeCalendar.get(Calendar.MINUTE));
dateCalendar.add(Calendar.SECOND, timeCalendar.get(Calendar.SECOND));
dateCalendar.add(Calendar.MILLISECOND, timeCalendar.get(Calendar.MILLISECOND));
return dateCalendar.getTime();
}
Now, it's simply a matter of looping through the interventions you want to sum, computing the difference between the dates as milliseconds, and add them:
long totalMillis = 0L;
for (Intervention intervention : interventions) {
Date marche = combine(intervention.getDateMiseEnMarche(), intervention.getTimeMiseEnMarche());
Date panne = combine(intervention.getDatePanne(), intervention.getTimePanne());
long differenceInMillis = marche.getTime() - panne.getTime();
totalMillis += differenceInMillis;
}
I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op value is always C, T or X.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.
How can I do this in Pig?
One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
#Override
public Schema outputSchema(Schema input) {
return input;
}
}