Access hdfs file from spark worker node

Access hdfs file from spark worker node - hadoop

I am working on a spark application which needs to access and update an object stored as a file in hdfs. I'm unable to figure out how can I do it?
If I'm creating a FileSystem hdfs object and using it:
boolean fileExists = hdfs.exists(new org.apache.hadoop.fs.Path(filePath));
if (fileExists){
JavaRDD<MyObject> modelRDD = sc.objectFile(filePath);
}
I get:
ERROR Executor: Exception in task 110.0 in stage 1.0 (TID 112)
java.lang.NullPointerException
This part of code runs at the worker so I'm assuming it fails because it doesn't have access to the Spark Context. In such a case, how can I access this hdfs file?
this hdfs file resides at the driver node. I can change replace hdfs with hive, storing data as byte array in hive, but even hive context access is not possible from worker node.
Adding full code for better understanding:
public class MyProgram {
private static JavaSparkContext sc;
private static HiveContext hiveContext;
private static String ObjectPersistenceDir = "/metadata/objects";
private static org.apache.hadoop.fs.FileSystem hdfs;
private static String NameNodeURI = "hdfs://<mymachineurl>:9000";
// create and maintain a cache of objects for every run session
//private static HashMap<String, MyObject> cacheObjects;
public static void main(String ... args) {
System.out.println("Inside constructor: creating Spark context and Hive context");
System.out.println("Starting Spark context and SQL context");
sc = new JavaSparkContext(new SparkConf());
hiveContext = new HiveContext(sc);
//cacheObjects= new HashMap<>();
//DataFrame loadedObjects= hiveContext.sql("select id, filepath from saved_objects where name = 'TEST'");
//List<Row> rows = loadedObjects.collectAsList();
//for(Row row : rows){
// String key = (String) row.get(0) ;
// String value = (String) row.get(1);
// JavaRDD<MyObject> objectRDD = sc.objectFile(value);
// cacheObjects.put(key, objectRDD.first());
//}
DataFrame partitionedDF = hiveContext.sql('select * from mydata');
String partitionColumnName = "id";
JavaRDD<Row> partitionedRecs = partitionedDF.repartition(partitionedDF.col(partitionColumnName)).javaRDD();
FlatMapFunction<Iterator<Row>, MyObject> flatMapSetup = new FlatMapFunction<java.util.Iterator<Row>, MyObject>() {
List<MyObject> lm_list = new ArrayList<>();
MyObject object = null;
#Override
public List<MyObject> call(java.util.Iterator<Row> it) throws Exception {
// for every row, create a record and update the object
while (it.hasNext()) {
Row row = it.next();
if (object == null) {
String objectKey = "" + id;
//object = cacheObjects.get(objectKey);
String modelPath = ModelPersistenceDir + "/" +'TEST'+ "/" + id;
JavaRDD<MyObject> objectRDD = sc.objectFile(objectPath);
object = objectRDD.collect().get(0);
// object not in cache means not already created
if(object == null){
if (object == null){
ObjectDef objectDef = new ObjectDef('TEST');
object = new MyObject(objectDef);
}
}
}
/*
/ some update on object
*/
String objectKey = "" + id ;
cacheObjects.put(objectKey, object);
// Algorithm step 2.6 : to save in hive, add to list
lm_list.add(object);
} // while Has Next ends
return lm_list;
} // Call -- Iterator ends
};//); //Map Partition Ends
//todo_nidhi put all objects in collectedObject back to hive
List<MyObject> collectedObject = partitionedRecs.mapPartitions(flatMapSetup).collect();
}

Related

Apache Storm: Topology submission exception: [x] subscribes from non-existent stream

Sorry if the question is solved, but I tried to find it and I haven't had success. There are some similar, but I don't found help where I've seen. I have the next problem:
603 [main] WARN b.s.StormSubmitter - Topology submission exception:
Component: [escribirFichero] subscribes from non-existent stream:
[default] of component [buscamosEnKlout]
Exception in thread "main" java.lang.RuntimeException:
InvalidTopologyException(msg:Component:
[escribirFichero] subscribes from non-existent stream:
[default] of component [buscamosEnKlout])
I can't understand why I have this exception. I declare the bolt "buscamosEnKlout" before I use "escribirFichero". Next to my topology I'll put the elemental lines of the bolts. I know the spout is OK,because a trial-and-error approach.
The code of my topology is:
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.StormSubmitter;
import backtype.storm.stats.RollingWindow;
import backtype.storm.topology.BoltDeclarer;
import backtype.storm.topology.TopologyBuilder;
import bolt.*;
import spout.TwitterSpout;
import twitter4j.FilterQuery;
public class TwitterTopologia {
private static String consumerKey = "xxx1";
private static String consumerSecret = "xxx2";
private static String accessToken = "yyy1";
private static String accessTokenSecret="yyy2";
public static void main(String[] args) throws Exception {
/**************** SETUP ****************/
String remoteClusterTopologyName = null;
if (args!=null) { ... }
TopologyBuilder builder = new TopologyBuilder();
FilterQuery tweetFilterQuery = new FilterQuery();
tweetFilterQuery.track(new String[]{"Vacaciones","Holy Week", "Semana Santa","Holidays","Vacation"});
tweetFilterQuery.language(new String[]{"en","es"});
TwitterSpout spout = new TwitterSpout(consumerKey, consumerSecret, accessToken, accessTokenSecret, tweetFilterQuery);
KloutBuscador buscamosEnKlout = new KloutBuscador();
FileWriterBolt fileWriterBolt = new FileWriterBolt("idUsuarios.txt");
builder.setSpout("spoutLeerTwitter",spout,1);
builder.setBolt("buscamosEnKlout",buscamosEnKlout,1).shuffleGrouping("spoutLeerTwitter");
builder.setBolt("escribirFichero",fileWriterBolt,1).shuffleGrouping("buscamosEnKlout");
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("twitter-fun", conf, builder.createTopology());
Thread.sleep(460000);
cluster.shutdown();
}
}
}
Bolt "KloutBuscador", alias "buscamosEnKlout", is the next code:
String text = tuple.getStringByField("id");
String cadenaUrl;
cadenaUrl = "http://api.klout.com/v2/identity.json/twitter?screenName=";
cadenaUrl += text.replaceAll("\\[", "").replaceAll("\\]","");
cadenaUrl += "&key=" + kloutKey;
URL url = new URL(cadenaUrl);
HttpURLConnection c = (HttpURLConnection) url.openConnection();
...........c.setRequestMethod("GET");c.setRequestProperty("Content-length", "0");c.setUseCaches(false);c.setAllowUserInteraction(false);c.connect();
int status = c.getResponseCode();
StringBuilder sb = new StringBuilder();
switch (status) {
case 200:
case 201:
BufferedReader br = new BufferedReader(new InputStreamReader(c.getInputStream()));
String line;
while ((line = br.readLine()) != null) sb.append(line + "\n");
br.close();
}
JSONObject jsonResponse = new JSONObject(sb.toString());
//getJSONArray("id");
String results = jsonResponse.toString();
_collector.emit(new Values(text,results));
And the second bolt, fileWriterBolt, alias "escribirFichero", is the next one:
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
_collector = outputCollector;
try {
writer = new PrintWriter(filename, "UTF-8");...}...}
public void execute(Tuple tuple) {
writer.println((count++)+":::"+tuple.getValues());
//+"+++"+tweet.getUser().getId()+"__FINAL__"+tweet.getUser().getName()
writer.flush();
// Confirm that this tuple has been treated.
//_collector.ack(tuple);
}
If I pass over the bolt of Klous and only write the result of the spout, it works. I don't understand why the Klous's bolt causes this failure

Your buscamosEnKlout bolt needs to declare the format of the tuples it will emit, as well as which streams it will emit to. You most likely haven't implemented declareOutputFields correctly in that bolt. It should contain something like declarer.declare(new Fields("your-text-field", "your-results-field"))

Gson: How do I deserialize an inner JSON object to a map if the property name is not fixed?

My client retrieves JSON content as below:
{
"table": "tablename",
"update": 1495104575669,
"rows": [
{"column5": 11, "column6": "yyy"},
{"column3": 22, "column4": "zzz"}
]
}
In rows array content, the key is not fixed. I want to retrieve the key and value and save into a Map using Gson 2.8.x.
How can I configure Gson to simply use to deserialize?
Here is my idea:
public class Dataset {
private String table;
private long update;
private List<Rows>> lists; <-- little confused here.
or private List<HashMap<String,Object> lists
Setter/Getter
}
public class Rows {
private HashMap<String, Object> map;
....
}
Dataset k = gson.fromJson(jsonStr, Dataset.class);
log.info(k.getRows().size()); <-- I got two null object
Thanks.

Gson does not support such a thing out of box. It would be nice, if you can make the property name fixed. If not, then you can have a few options that probably would help you.
Just rename the Dataset.lists field to Dataset.rows, if the property name is fixed, rows.
If the possible name set is known in advance, suggest Gson to pick alternative names using the #SerializedName.
If the possible name set is really unknown and may change in the future, you might want to try to make it fully dynamic using a custom TypeAdapter (streaming mode; requires less memory, but harder to use) or a custom JsonDeserializer (object mode; requires more memory to store intermediate tree views, but it's easy to use) registered with GsonBuilder.
For option #2, you can simply add the names of name alternatives:
#SerializedName(value = "lists", alternate = "rows")
final List<Map<String, Object>> lists;
For option #3, bind a downstream List<Map<String, Object>> type adapter trying to detect the name dynamically. Note that I omit the Rows class deserialization strategy for simplicity (and I believe you might want to remove the Rows class in favor of simple Map<String, Object> (another note: use Map, try not to specify collection implementations -- hash maps are unordered, but telling Gson you're going to deal with Map would let it to pick an ordered map like LinkedTreeMap (Gson internals) or LinkedHashMap that might be important for datasets)).
// Type tokens are immutable and can be declared constants
private static final TypeToken<String> stringTypeToken = new TypeToken<String>() {
};
private static final TypeToken<Long> longTypeToken = new TypeToken<Long>() {
};
private static final TypeToken<List<Map<String, Object>>> stringToObjectMapListTypeToken = new TypeToken<List<Map<String, Object>>>() {
};
private static final Gson gson = new GsonBuilder()
.registerTypeAdapterFactory(new TypeAdapterFactory() {
#Override
public <T> TypeAdapter<T> create(final Gson gson, final TypeToken<T> typeToken) {
if ( typeToken.getRawType() != Dataset.class ) {
return null;
}
// If the actual type token represents the Dataset class, then pick the bunch of downstream type adapters
final TypeAdapter<String> stringTypeAdapter = gson.getDelegateAdapter(this, stringTypeToken);
final TypeAdapter<Long> primitiveLongTypeAdapter = gson.getDelegateAdapter(this, longTypeToken);
final TypeAdapter<List<Map<String, Object>>> stringToObjectMapListTypeAdapter = stringToObjectMapListTypeToken);
// And compose the bunch into a single dataset type adapter
final TypeAdapter<Dataset> datasetTypeAdapter = new TypeAdapter<Dataset>() {
#Override
public void write(final JsonWriter out, final Dataset dataset) {
// Omitted for brevity
throw new UnsupportedOperationException();
}
#Override
public Dataset read(final JsonReader in)
throws IOException {
in.beginObject();
String table = null;
long update = 0;
List<Map<String, Object>> lists = null;
while ( in.hasNext() ) {
final String name = in.nextName();
switch ( name ) {
case "table":
table = stringTypeAdapter.read(in);
break;
case "update":
update = primitiveLongTypeAdapter.read(in);
break;
default:
lists = stringToObjectMapListTypeAdapter.read(in);
break;
}
}
in.endObject();
return new Dataset(table, update, lists);
}
}.nullSafe(); // Making the type adapter null-safe
#SuppressWarnings("unchecked")
final TypeAdapter<T> typeAdapter = (TypeAdapter<T>) datasetTypeAdapter;
return typeAdapter;
}
})
.create();
final Dataset dataset = gson.fromJson(jsonReader, Dataset.class);
System.out.println(dataset.lists);
The code above would print then:
[{column5=11.0, column6=yyy}, {column3=22.0, column4=zzz}]

Not understanding the path in distributed path

From the below code I didn't understand 2 things:
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())
I didn't understand URI path has to be present in the HDFS. Correct me if I am wrong.
And what is p.getname().equals() from the below code:
public class MyDC {
public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {
private Map < String, String > abMap = new HashMap < String, String > ();
private Text outputKey = new Text();
private Text outputValue = new Text();
protected void setup(Context context) throws
java.io.IOException, InterruptedException {
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
for (Path p: files) {
if (p.getName().equals("abc.dat")) {
BufferedReader reader = new BufferedReader(new FileReader(p.toString()));
String line = reader.readLine();
while (line != null) {
String[] tokens = line.split("\t");
String ab = tokens[0];
String state = tokens[1];
abMap.put(ab, state);
line = reader.readLine();
}
}
}
if (abMap.isEmpty()) {
throw new IOException("Unable to load Abbrevation data.");
}
}
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
String row = value.toString();
String[] tokens = row.split("\t");
String inab = tokens[0];
String state = abMap.get(inab);
outputKey.set(state);
outputValue.set(row);
context.write(outputKey, outputValue);
}
}
public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyDC.class);
job.setJobName("DCTest");
job.setNumReduceTasks(0);
try {
DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());
} catch (Exception e) {
System.out.println(e);
}
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

The idea of Distributed Cache is to make some static data available to the task node before it starts its execution.
File has to be present in HDFS ,so that it can then add it to the Distributed Cache (to each task node)
DistributedCache.getLocalCacheFile basically gets all the cache files present in that task node. By if (p.getName().equals("abc.dat")) { you are getting the appropriate Cache File to be processed by your application.
Please refer to the docs below:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)

DistributedCache is an API which is used to add a file or a group of files in the memory and will be available for every data-nodes whether the map-reduce will work. One example of using DistributedCache is map-side joins.
DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration()) will add the abc.dat file in the cache area. There can be n numbers of file in the cache and p.getName().equals("abc.dat")) will check the file which you required. Every path in HDFS will be taken under Path[] for map-reduce processing. For example :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
The first Path(args[0]) is the first argument
(input file location) you pass while Jar execution and Path(args[1]) is the second argument which the output file location. Everything is taken as Path array.
In the same way when you add any file to cache it will get arrayed in the Path array which you shud be retrieving using the below code.
Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());
It will return all the files present in the cache and you will your file name by p.getName().equals() method.

Implement Hadoop Map with JavaPairRDD as Spark Way

I have an RDD:
JavaPairRDD<Long, ViewRecord> myRDD
which is created via newAPIHadoopRDD method. I have an existed map function which I want to implement it in Spark way:
LongWritable one = new LongWritable(1L);
protected void map(Long key, ViewRecord viewRecord, Context context)
throws IOException ,InterruptedException {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
tuple.getKey().set(url);
tuple.getValue().set(day);
context.write(tuple, one);
};
PS: tuple is derived from:
KeyValueWritable<Text, LongWritable>
and can be found here: TextLong.java

I don't know what tuple is but if you just want to map record to tuple with key (url, day) and value 1L you can do it like this:
result = myRDD
.values()
.mapToPair(viewRecord -> {
String url = viewRecord.getUrl();
long day = viewRecord.getDay();
return new Tuple2<>(new Tuple2<>(url, day), 1L);
})
//java 7 style
JavaPairRDD<Pair, Long> result = myRDD
.values()
.mapToPair(new PairFunction<ViewRecord, Pair, Long>() {
#Override
public Tuple2<Pair, Long> call(ViewRecord record) throws Exception {
String url = record.getUrl();
Long day = record.getDay();
return new Tuple2<>(new Pair(url, day), 1L);
}
}
);

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?

You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}

Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Access hdfs file from spark worker node - hadoop

Related

Apache Storm: Topology submission exception: [x] subscribes from non-existent stream

Gson: How do I deserialize an inner JSON object to a map if the property name is not fixed?

Not understanding the path in distributed path

Implement Hadoop Map with JavaPairRDD as Spark Way

Any suggestions for reading two different dataset into Hadoop at the same time?

Categories

Resources