Testng split dataProvider between parallel methods - parallel-processing

I have a dataProvider which is reading the data from a text file.
#DataProvider (name = "dynamicDP", parallel = true)
public Iterator<Object> matchIDs() throws IOException {
final List<Object> list = new ArrayList<>();
for (final String line: Files.readAllLines(Paths.get("C:\\mypath"),
StandardCharsets.UTF_8))
list.add(new Object[]{ line});
return list.iterator();
}
My text file is really simple it only contains these data below (each pair of letters on a separate line):
AA BB CC DD EE FF GG HH II KK
Here is my test class:
public class dataProviderParallelTest {
#Test (dataProvider="dynamicDP")
public void verifyDPdata(String comingFromDP){
System.out.printf("%nDP#1..: "+comingFromDP);
}
#Test (dataProvider="dynamicDP")
public void verifyDPdata2(String comingFromDP){
System.out.printf("%nDP#2..: "+comingFromDP);
}
}
Here is the output:
[TestNG] Running:
C:\projects\test\currentTest.xml
DP#1..: AA
DP#2..: BB
DP#1..: BB
DP#2..: AA
DP#1..: CC
DP#2..: CC
DP#1..: DD
DP#1..: EE
DP#2..: EE
DP#2..: DD
DP#1..: FF
DP#2..: FF
DP#1..: GG
DP#1..: HH
DP#2..: HH
DP#2..: GG
DP#1..: II
DP#2..: II
DP#1..: KK
DP#2..: KK
===============================================
Regression
Total tests run: 20, Failures: 0, Skips: 0
===============================================
And here is my XML file that I'm using to start my test:
<!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd" >
<suite name="Regression" parallel="methods" thread-count="2" data-provider-thread-count="2">
<test name="smokeTest11">
<classes>
<class name="regression.bo.dataProviderParallelTest"/>
</classes>
</test>
</suite>
What I have tried:
I've read this article: cedricBlog
and this stackoverflow post:stackOverFlow
What am I trying to achieve:
I am trying to share the data between the two threads. At the moment I just achieved both of the threads to execute the same data provided by the DP. My aim is to split the data between the two methods and have an output like this (DP data shared between the 2 methods):
DP#1..: AA
DP#2..: BB
DP#1..: DD
DP#2..: EE
DP#1..: CC
DP#2..: GG
DP#1..: KK
DP#1..: HH
DP#2..: II
DP#2..: FF
Is this even possible or am I missing something? Thanks in advance for your help!

Take the method as parameter to your dataprovider. Divide your data in two lists, whichever way you like. Based on method name, return one list to each.
eg.
For example, the following code prints the name of the test method inside its #DataProvider:
#DataProvider(name = "dynamicDP", parallel = true)
public Object[][] dynamicDP(Method m) {
System.out.println(m.getName());
//Divide list in two lists
if (m.getName().equals("Met1")
return list1
else
return list2
}
#Test(dataProvider="dynamicDP")
public void test1(String s) {
}
#Test(dataProvider="dynamicDP")
public void test2(String s) {
}
HTH

Okay, posting and my solution for reference. In short it is this. Count the lines in the file, and start reading 1 line at a time (from 10 different threads) until you reach the EOF. Here it is:
public volatile int currentLine=0;
public static Object writeLock = new Object();
public static Object readLock = new Object();
public long currentThread = Thread.currentThread().getId();
#Test(invocationCount = 50)
public void readOneLineGetID() throws IOException{
countLines();
if(currentLine==noOfLines){throw new SkipException("%nSkipping this test method as we got all the records we need.");}
long threadID = Thread.currentThread().getId();
synchronized(readLock){
if(currentLine<noOfLines){
System.out.printf("%nCurrent thread is..: "+ threadID);
readASpecificLineFromATextFile(currentLine);
System.out.printf("%n----------------------------------------------------------");
}
}
synchronized(writeLock){
currentLine++;
}
}
So I have 10 of these methods and I am pushing my tests concurrently onto a grid hub which then uses 10 different data providers to feed the nodes.
Only, minor point is the invocationCount (ideally i should divide by 10 and then set the invocation count on every single method accordingly; but as I don't have time to re-invent the wheel and this runs very very fast (only dealing with 200 line files normally, I decided to skip the rest of methods after the line number reaches EOF ;) )
Here is the output (10 methods, with 50 invocation count finding 10 mockup lines from my file and skipping the rest):
After some tinkering with the sync, beatiful!!! :)

Related

Return results from data table in a sequence using linq

I'm fetching rows from excel sheet in my application that holds attendance records from the bio metric machine. In order to get the best result i have to remove the redundant data. For that I have to manage check in and checkout timings at regular intervals. For instance, First check in time for entering, and then checkout time for lunch, then again check in for returning back, and last check out for going home. Meanwhile the rows in excel contains multiple check ins and check outs as the employee tends to do more that once for both.
I have managed to get records from excel and added to data table. Now for the sequence and sorting part I'm struggling to achieve my desired result. Below is my code.
protected void btnSaveAttendance_Click(object sender, EventArgs e)
{
try
{
if (FileUpload1.HasFile && Path.GetExtension(FileUpload1.FileName) == ".xls")
{
using (var excel = new OfficeOpenXml.ExcelPackage(FileUpload1.PostedFile.InputStream))
{
var tbl = new DataTable();
var ws = excel.Workbook.Worksheets.First();
var hasHeader = true; // adjust accordingly
// add DataColumns to DataTable
foreach (var firstRowCell in ws.Cells[1, 1, 1, ws.Dimension.End.Column])
tbl.Columns.Add(hasHeader ? firstRowCell.Text
: String.Format("Column {0}", firstRowCell.Start.Column));
// add DataRows to DataTable
int startRow = hasHeader ? 2 : 1;
for (int rowNum = startRow; rowNum <= ws.Dimension.End.Row; rowNum++)
{
var wsRow = ws.Cells[rowNum, 1, rowNum, ws.Dimension.End.Column];
DataRow row = tbl.NewRow();
foreach (var cell in wsRow)
row[cell.Start.Column - 1] = cell.Text;
tbl.Rows.Add(row);
}
var distinctNames = (from row in tbl.AsEnumerable()
select row.Field<string>("Employee Code")).Distinct();
DataRow[] dataRows = tbl.Select().OrderBy(u => u["Employee Code"]).ToArray();
var ss = dataRows.Where(p => p.Field<string>("Employee Code") == "55").ToArray();
}
}
}
catch (Exception ex) { }
}
The result i'm getting is:
Employee Code Employee Name Date Time In / Out
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:48 IN
55 Alex 12/27/2018 13:49 IN
55 Alex 12/27/2018 18:08 OUT
And I want to have first In and then out and then in and then out. This would iterate four times to generate the result.
Expected result is:
Employee Code Employee Name Date Time In / Out
55 Alex 12/27/2018 8:59 IN
55 Alex 12/27/2018 13:00 OUT
55 Alex 12/27/2018 13:48 IN
55 Alex 12/27/2018 18:08 OUT
Can you try to do groupby in the result like below
ss=ss.GroupBy(x=>x.DateTime).ToArray();
Build a logic, if your result have 2 successive In/Out as a sample like below.
Here In I considered as field name
var tt;
for(int i=0;i<ss.Count();i++)
{
if(ss[i].In=="In" && (tt!=null || tt.LastOrDefault().In!="In"))
tt=ss[i];
else if(ss[i].In=="Out" && (tt!=null || tt.LastOrDefault().In!="Out"))
tt=ss[i];
}

Pig sum fails with +ve and -ve values

I have below data
primary,first,second
1,393440.09,354096.08
1,4410533.33,3969479.99
1,-4803973.41,-4323576.07
I have to aggregate and sum first and second column. Below is the script I am executing
data_load= load <filelocation> using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') As (primary:double, first:double,second:double)
dataAgrr = group data_load by primary;
sumData = FOREACH dataAgrr GENERATE
group as data,
SUM(data_load.first) as first,
SUM(data_load.second) as second,
SUM(data_load.primary) as primary;
After executing, below Output is produced:
(1.0,0.009999999951105565,-5.820766091346741E-11,3.0)
But when we manually adding second column (354096.08, 3969479.99, -4323576.07) gives 0.
Pig uses Java "double" internally.
Testing with a sample code below
import java.math.BigDecimal;
public class TestSum {
public static void main(String[] args) {
double d1 = 354096.08;
double d2 = 3969479.99;
double d3 = -4323576.07;
System.err.println("Total in double is " + ((d3 + d2 ) + d1));
BigDecimal bd1 = new BigDecimal("354096.08");
BigDecimal bd2 = new BigDecimal("3969479.99");
BigDecimal bd3 = new BigDecimal("-4323576.07");
System.err.println("Total in BigDecimal is " + bd3.add(bd2).add(bd1));
}
}
This produces
Total in double is -5.820766091346741E-11
Total in BigDecimal is 0.00
If you need a better precision, you may want to try using "bigdecimal" instead of "double" in your script.

How do I transform a parameter in Pig?

I need to process a dataset in Pig, which is available once per day at midnight. Therefor I have an Oozie coordinator that takes care of the scheduling and spawns a workflow every day at 00:00.
The file names follow the URI scheme
hdfs://${dataRoot}/input/raw${YEAR}${MONTH}${DAY}${HOUR}.avro
where ${HOUR} is always '00'.
Each entry in the dataset contains a UNIX timestamp and I want to filter out those entries which have a timestamp before 11:45pm (23:45). As I need to run on datasets from the past, the value of the timestamp defining the threshold needs to be set dynamically according to the day currently processed. For example, proessing the dataset from December, 12th 2013 needs the threshold 1418337900. For this reason, setting the threshold must be done by the coordinator.
To the best of my knowledge, there is no possibility to transfrom a formatted date into a UNIX timestamp in EL. I came up with a quite hacky solution:
The coordinator passes date and time of the threshold to the respective workflow which starts the parameterized instance of the Pig script.
Excerpt of the coordinator.xml:
<property>
<name>threshold</name>
<value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -15, 'MINUTE'), 'yyyyMMddHHmm')}</value>
</property>
Excerpt of the workflow.xml:
<action name="foo">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>${applicationPath}/flights.pig</script>
<param>jobInput=${jobInput}</param>
<param>jobOutput=${jobOutput}</param>
<param>threshold=${threshold}</param>
</pig>
<ok to="end"/>
<error to="error"/>
</action>
The Pig script needs to convert this formatted datetime into a UNIX timestamp. Therefor, I have writte a UDF:
public class UnixTime extends EvalFunc<Long> {
private long myTimestamp = 0L;
private static long convertDateTime(String dt, String format)
throws IOException {
DateFormat formatter;
Date date = null;
formatter = new SimpleDateFormat(format);
try {
date = formatter.parse(dt);
} catch (ParseException ex) {
throw new IOException("Illegal Date: " + dt + " format: " + format);
}
return date.getTime() / 1000L;
}
public UnixTime(String dt, String format) throws IOException {
myTimestamp = convertDateTime(dt, format);
}
#Override
public Long exec(Tuple input) throws IOException {
return myTimestamp;
}
}
In the Pig script, a macro is created, initializing the UDF with the input of the coordinator/workflow. Then, you can filter the timestamps.
DEFINE THRESH mystuff.pig.UnixTime('$threshold', 'yyyyMMddHHmm');
d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= THRESH();
...
The problem that I have leads me to the more general question, if it is possible to transform an input parameter in Pig and use it again as some kind of constant.
Is there a better way to solve this problem or is my approach needlessly complicated?
Edit: TL;DR
After more searching I found someone with the same problem:
http://grokbase.com/t/pig/user/125gszzxnx/survey-where-are-all-the-udfs-and-macros
Thanks Gaurav for recommending the UDFs in piggybank.
It seems that there is no performant solution without using declare and a shell script.
You can put the Pig script into a Python script and pass the value.
#!/usr/bin/python
import sys
import time
from org.apache.pig.scripting import Pig
P = Pig.compile("""d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= '$thresh';
""")
jobinput = {whatever you defined}
thresh = {whatever you defined in the UDF}
Q = P.bind({'thresh':thresh,'jobinput':jobinput})
results = Q.runSingle()
if results.isSuccessful() == "FAILED":
raise "Pig job failed"

Aparapi add sample

I'm studing Aparapi (https://code.google.com/p/aparapi/) and have a strange behaviour of one of the sample included.
The sample is the first, "add". Building and executing it, is ok. I also put the following code for testing if the GPU is really used
if(!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
and it works fine.
But, if I try to change the size of the array from 512 to a number greater than 999 (for example 1000), I have the following output:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize[0] = 1000, localSize[0] = 128
Apr 18, 2013 1:31:01 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###
JTP
Kernel did not execute on the GPU!
Here's my code:
final int size = 1000;
final float[] a = new float[size];
final float[] b = new float[size];
for (int i = 0; i < size; i++) {
a[i] = (float)(Math.random()*100);
b[i] = (float)(Math.random()*100);
}
final float[] sum = new float[size];
Kernel kernel = new Kernel(){
#Override public void run() {
int gid = getGlobalId();
sum[gid] = a[gid] + b[gid];
}
};
Range range = Range.create(size);
kernel.execute(range);
System.out.println(kernel.getExecutionMode());
if (!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
System.out.println("Kernel did not execute on the GPU!");
}
kernel.dispose();
}
I tried specifying the size using
Range range = Range.create(size, 128);
as suggested in a Google group, but nothing changed.
I'm currently running on Mac OS X 10.8 with Java 1.6.0_43. Aparapi version is the latest (2012-01-23).
Am I missing something? Any ideas?
Thanks in advance
Aparapi inherits a 'Grid Style' of implementation from OpenCL. When you specify a range of execution (say 1024), OpenCL will break this 'range' into groups of equal size. Possibly 4 groups of 256, or 8 groups of 128.
The group size must be a factor of range (so assert(range%groupSize==0)).
By default Aparapi internally selects the group size.
But you are choosing to fully specify the range and group size to using
Range r= Range.range(n,128)
You are responsible for ensuring that n%128==0.
From the error, it looks like you chose Range.range(1000,128).
Sadly 1000 % 128 != 0 so this range will fail.
If you specifiy
Range r = Range.range(n)
Aparapi will choose a valid group size, by finding the highest common factor of n.
Try dropping the 128 as the the second arg.
Gary

Pig Changing Schema to required type

I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op value is always C, T or X.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.
How can I do this in Pig?
One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
#Override
public Schema outputSchema(Schema input) {
return input;
}
}

Resources