I want to run a spark job (spark v1.5.1) over some generated S3 paths containing avro files. I'm loading them with:
val avros = paths.map(p => sqlContext.read.avro(p))
Some of the paths will not exist though. How can I get spark to ignore those empty paths? Previously I've used this answer, but I'm not sure how to use that with the new dataframe API.
Note: I'm ideally looking for a similar approach to the linked answer that just makes input paths optional. I don't particularly want to have to explicitly check for the existence of paths in S3 (since that's cumbersome and may make development awkward), but I guess that's my fallback if there's no clean way to implement this now.
I would use the scala Try type in order to handle the possibility of failure when reading a directory of avro files. With 'Try' we can make the possibility of failure explicit in our code, and handle it in a functional manner:
object Main extends App {
import scala.util.{Success, Try}
import org.apache.spark.{SparkConf, SparkContext}
import com.databricks.spark.avro._
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("example"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//the first path exists, the second one doesn't
val paths = List("/data/1", "/data/2")
//Wrap the attempt to read the paths in a Try, then use collect to filter
//and map with a single partial function.
val avros =
paths
.map(p => Try(sqlContext.read.avro(p)))
.collect{
case Success(df) => df
}
//Do whatever you want with your list of dataframes
avros.foreach{ df =>
println(df.collect())
}
sc.stop()
}
Related
I'm building a transpiler and need to understand the protobuf/go scope lookup system. I've been trying to google the docs and finding no luck.
Q: Is there a shared package scope lookup that you can do when importing Types in Go/protobufs?
Here is the example that I'm questioning:
proto1:
package cosmos.crypto.keyring.v1;
...
message Ledger {
hd.v1.BIP44Params path = 1;
}
proto2:
package cosmos.crypto.hd.v1;
message BIP44Params {
...
}
There are two syntaxes I've seen that do make sense so far:
full scope
message Ledger {
cosmos.crypto.hd.v1.BIP44Params path = 1;
}
Or I’ve also seen versions like this
completely unscoped
message Ledger {
BIP44Params path = 1;
}
partially scoped?
But the style I'm seeing is partially scoped
message Ledger {
hd.v1.BIP44Params path = 1;
}
Is the reason they leave off the cosmos.crypto because these two packages share cosmos.crypto in the root of their package name?
Or is it a more generic scope lookup based on the import?
Any insight or reading links appreciated :)
I'm not sure I fully get the question but I will try to answer. Let me know if you need me to change that.
This is a combination of both. You need to have the package and import the .proto file. Let me explain. If you have two file define like this:
proto1.proto
syntax = "proto3";
package cosmos.crypto.keyring.v1;
message Ledger {
hd.v1.BIP44Params path = 1;
}
proto2.proto
syntax = "proto3";
package cosmos.crypto.hd.v1;
message BIP44Params {}
trying to compile will tell you that "hd.v1.BIP44Params" is not defined. This is because the proto1.proto is not aware of other definitions. Now, if you import "proto2.proto"; in the proto1.proto, it will be aware of the BIP44Params definition and will notice the package definition.
With this package definition, it will be able to access the following type definition:
cosmos.crypto.hd.v1.BIP44Params - which is pretty self explanatory
hd.v1.BIP44Params - because the two package matches before the hd part.
but it should be able to access:
BIP44Params - because there is no such type defined in cosmos.crypto.keyring.v1 package
Hope that's clear
How can we convert a List into HashBasedTable in Java8?
Current Code is like:
import org.glassfish.jersey.internal.guava.HashBasedTable;
import org.glassfish.jersey.internal.guava.Table;
List<ApplicationUsage> appUsageFromDB = computerDao.findAllCompAppUsages(new HashSet<>(currentBatch));
Table<String, String, Integer> table = HashBasedTable.create();
for(ApplicationUsage au: appUsageFromDB) {
table.put(au.getId(), au.getName(), au);
}
I need to store composite key in this and later fetch the same.
If those internals are guava-21 at least, you could do via their own collector, but I do not see anything wrong with what you are doing with a simple loop.
Table<String, String, ApplicationUsage> result =
appUsageFromDB.stream()
.collect(ImmutableTable.toImmutableTable(
ApplicationUsage::getId,
ApplicationUsage::getName,
Function.identity()
));
First, you should never rely on internal packages, just add Guava to you project explicitly. You can use Tables#toTable collector, if you want to have mutable table as a result, otherwise immutable one as presented in #Eugene's answer is just fine:
import com.google.common.collect.HashBasedTable;
import com.google.common.collect.Table;
import com.google.common.collect.Tables;
// ...
Table<String, String, ApplicationUsage> table2 = appUsageFromDB.stream()
.collect(Tables.toTable(
ApplicationUsage::getId,
ApplicationUsage::getName,
au -> au,
HashBasedTable::create
));
Also, your code doesn't compile, because it expects Integer as table value, but you're adding ApplicationUsage in your loop. Change types and third argument in table collector accordingly if needed.
In my Scalding job, I have code like this:
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
class MyJob(args: Args) extends Job(args) {
FileInputFormat.setInputPathFilter(???, classOf[MyFilter])
// ... rest of job ...
}
class MyFilter extends PathFilter {
def accept(path:Path): Boolean = true
}
My problem is that the first argument of the FileInputFormat.setInputPathFilter method needs to be of type org.apache.hadoop.mapreduce.Job. How can I access the Hadoop job object in my Scalding job?
Disclaimer
There is no way to extract Job class. But you can (but should never do!) extract JobConf. After that you will be able to use FileInputFormat.setInputPathFilter from mapreduce.v1 API (org.apache.hadoop.mapred.JobConf) which will permit to archive the filtering.
But I suggest you not to do this. Read the end of the answer,
How can you do this?
Override stepStrategy method of scalding.Job to implement FlowStepStrategy. For example this implementation permits to change the name of mapreduce job
override def stepStrategy: Option[FlowStepStrategy[_]] = Some(new FlowStepStrategy[AnyRef]{
override def apply(flow: Flow[AnyRef], predecessorSteps: util.List[FlowStep[AnyRef]], step: FlowStep[AnyRef]): Unit =
step.getConfig match {
case conf: JobConf =>
# here you can modify the JobConf of each job.
conf.setJobName(...)
case _ =>
}
})
Why should one not do this?
Accessing JobConf to add a path filtering will work only if you are using the specific Sources and will break if you are using some others. Also you will be mixing different levels of abstraction. And I am not starting on how are you suppose to know what JobConf you actually need to modify (most of scalding jobs I saw are multi-steps)
How should one resolve this problem?
I suggest you to look closely on a type of Source you are using. I am pretty sure there is a function to apply a path filtering there during or before Pipe (or TypedPipe) construction.
I want to do sentence detection using OPenNLP and Hadoop. I have implemented same on Java successfully. Want to implement same on Mapreduce platform. Can anyone help me out?
I have done this two different ways.
One way is to push out your Sentence detection model to each node to a standard dir (ie /opt/opennlpmodels/), and at the class level in your mapper class read in the serialized model, and then use it appropriately in your map or reduce function.
Another way is to put the model in a database or the distributed cache (as a blob or something... I have used Accumulo to store Document categorization models before like this). then at the class level make the connection to the database and get the model as a bytearrayinputstream.
I have used Puppet to push out the models, but use whatever you typically use to keep files up to date on your cluster.
depending on your hadoop version you may be able to sneak the model in as a property on jobsetup and then only the master (or wherever you launch jobs from) will need to have the actual model file on it. I've never tried this.
If you need to know how to actually use the OpenNLP sentence detector let me know and I'll post an example.
HTH
import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;
public class SentenceDetection {
SentenceDetector sd;
public Span[] getSentences(String docTextFromMapFunction) throws Exception {
if (sd == null) {
sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
}
/**
* this gives you the actual sentences as a string array
*/
// String[] sentences = sd.sentDetect(docTextFromMapFunction);
/**
* this gives you the spans (the charindexes to the start and end of each
* sentence in the doc)
*
*/
Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
/**
* you can do this as well to get the actual sentence strings based on the spans
*/
// String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
return sentenceSpans;
}
}
HTH... just make sure the file is in place. There are more elegant ways of doing this but this works and it's simple.
We recently started using protobuffers in the company I work for, i was wondering what was the best practice regarding a message that holds other messages as fields.
Is it common to write everything in one big proto file or is it better to separate the different messages to different files and import the messages you need in the main file?
For example:
Option 1:
message A {
message B {
required int id = 1;
}
repeated B ids = 1;
}
Option 2:
import B.proto;
message A {
repeated B ids = 1;
}
And in a different file:
message B {
required int id = 1;
}
It depends on your dataset and the usage.
if your data set is small, you should prefer option 1. It leeds to less coding for serialization and deserialization.
if your data set is big, you should prefer option 2. If the file is too big, you can't load it completely into memory. And it will be very slow, if you need only one information and you read all the information of the file.
Maybe this is helpful.