sort documents by effective date - sorting

Each document ( inputStream IS) has field called effective date. I need all these individual documents to be combined into one document sorted by effective date.
import java.util.Properties;
import java.io.InputStream;
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
dataContext.storeStream(is, props);
}
Thanks
Nag

Add your documents to an ArrayList, then use List.sort(Comparator) with a comparator that compares the dates. After that iterate over the List with a for-each loop and add the documents to your output.

Related

How to count distinct fields in mongodb java Api

I need to find count of a distinct field. I used MongoCollection.distinct() which returns DistinctIterable. But it does not have any size method. To find the size I need to iterate DistinctIterable and find the size. Is there any method by which I can find the size of the distinct field values with out iterating it?
MongoCollection collection = db.getCollection("test");
DistinctIterable disIterable =collection.distinct("name");
int count =0;
Iterator iterator = disIterable.iterator();
while(iterator.hasNext()) {
iterator.next();
count = count +1;
}
try use it !!
public long returnSize(){
MongoCollection collection = db.getCollection("test");
DistinctIterable disIterable =collection.distinct("name");
return disIterable.into(new ArrayList<>()).size()
}

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Java8 stream average of object property in collection

I'm new to Java so if this has already been answered somewhere else then I either don't know enough to search for the correct things or I just couldn't understand the answers.
So the question being:
I have a bunch of objects in a list:
try(Stream<String> logs = Files.lines(Paths.get(args))) {
return logs.map(LogLine::parseLine).collect(Collectors.toList());
}
And this is how the properties are added:
LogLine line = new LogLine();
line.setUri(matcher.group("uri"));
line.setrequestDuration(matcher.group("requestDuration"));
....
How do I sort logs so that I end up with list where objects with same "uri" are displayed only once with average requestDuration.
Example:
object1.uri = 'uri1', object1.requestDuration = 20;
object2.uri = 'uri2', object2.requestDuration = 30;
object3.uri = 'uri1', object3.requestDuration = 50;
Result:
object1.uri = 'uri1', 35;
object2.uri = 'uri2', 30;
Thanks in advance!
Take a look at Collectors.groupingBy and Collectors.averagingDouble. In your case, you could use them as follows:
Map<String, Double> result = logLines.stream()
.collect(Collectors.groupingBy(
LogLine::getUri,
TreeMap::new,
Collectors.averagingDouble(LogLine::getRequestDuration)));
The Collectors.groupingBy method does what you want. It is overloaded, so that you can specify the function that returns the key to group elements by, the factory that creates the returned map (I'm using TreeMap here, because you want the entries ordered by key, in this case the URI), and a downstream collector, which collects the elements that match the key returned by the first parameter.
If you want an Integer instead of a Double value for the averages, consider using Collectors.averagingInt.
This assumes LogLine has getUri() and getRequestDuration() methods.

how to handle skewed data while grouping in pig

I am doing a group by operation in which one reduce task is running very longer. Below is the sample code snippet and the description of the issue,
inp =load 'input' using PigStorage('|') AS(f1,f2,f3,f4,f5);
grp_inp = GROUP inp BY (f1,f2) parallel 300;
Since there is skew in data i.e. too many values for one key, one reducer is running for 4 hours. Rest all reduce tasks gets completed in 1 min or so.
What can I do to fix this issue, any alternative approaches ? Any help would be greatly appreciated. Thanks!
You may have to check few things :-
1> Filter out records which have both f1 and f2 value as NULL (if any)
2> Try to use hadoop combiner by implementing algebraic interface if possible :-
https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch10s02.html
3> Using Custom partitioner to use another key for distributing data across reducer.
Here is the sample code I used to partition my skewed data after join (same can be used after group also) :-
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.io.NullableTuple;
import org.apache.pig.impl.io.PigNullableWritable;
public class KeyPartitioner extends Partitioner<PigNullableWritable, Writable> {
/**
* Here key contains value of current key used for partitioning and Writable
* value conatins all fields from your tuple. I used my 5th field from tuple to do partitioning as I knew it has evenly distributed value.
**/
#Override
public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {
Tuple valueTuple = (Tuple) ((NullableTuple) value).getValueAsPigType();
try {
if (valueTuple.size() > 5) {
Object hashObj = valueTuple.get(5);
Integer keyHash = Integer.parseInt(hashObj.toString());
int partitionNo = Math.abs(keyHash) % numPartitions;
return partitionNo;
} else {
if (valueTuple.size() > 0) {
return (Math.abs(valueTuple.get(1).hashCode())) % numPartitions;
}
}
} catch (NumberFormatException | ExecException ex) {
Logger.getLogger(KeyPartitioner.class.getName()).log(Level.SEVERE, null, ex);
}
return (Math.abs(key.hashCode())) % numPartitions;
}
}

Processing: How can I find the number of times two fields are equal in a CSV file?

I'm learning Processing for the first time and I've been tasked to deal with data but it's been terribly confusing for me.
For every line of a CSV file (apart from the header), I want to compare two specific columns of each. i.e. ListA vs ListB
For example, with the data below:
ListA,ListB
Male,Yes
Male,No
Female,Yes
Male,Yes
And for example, I want to check for all instances that a value in ListA is "Male" AND that the corresponding value in ListB is "Yes". In this scenario, I should get the value "2" for the two rows this is true.
How would I do that?
For now, I have a 2D String array of the data in the CSV file. From that I managed to assign specific columns as ListA and ListB. I tried using sort but it would only sort one list and not both at the same time.
Current relevant code:
for (int i=1; i<lines.length; i++) {
listA[i-1] = csv[i][int(whichA)];
listB[i-1] = csv[i][int(whichB)];
}
lA = Arrays.asList(listA);
lB = Arrays.asList(listB);
Not sure if this code really helps makes things clearer though. :/
Any help would be appreciated. Thank you.
So something like this should do what you need it to. Pseudocode:
int numRows = 0;
for (int i = 0; i < length; ++i) {
if (array1[i] equals "Male" AND array2[i] equals "Yes") {
++numRows;
//add to new collection here if you need the data
}
}

Resources