Write data to local disk in each datanode - hadoop

I want to store some value in map task into local disk in each data node. For example,
public void map (...) {
//Process
List<Object> cache = new ArrayList<Object>();
//Add value to cache
//Serialize cache to local file in this data node
}
How can I store this cache object to local disk in each data node, because if I store this cache in map function like above, then the performance will be terrible because I/O task?
I mean is there any way to wait for map task in this data node run completely and then we will store this cache into local disk? Or does Hadoop have a function to solve this issue?

Please see below example, the created file will be somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp
Please see #Chris Nauroth answer,
Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}

Related

Kafka state store not available in distributed environment

I have a business application with the following versions
spring boot(2.2.0.RELEASE) spring-Kafka(2.3.1-RELEASE)
spring-cloud-stream-binder-kafka(2.2.1-RELEASE)
spring-cloud-stream-binder-kafka-core(3.0.3-RELEASE)
spring-cloud-stream-binder-kafka-streams(3.0.3-RELEASE)
We have around 20 batches.Each batch using 6-7 topics to handle the business.Each service has its own state store to maintain the status of the batch whether its running/Idle.
Using the below code to query th store
#Autowired
private InteractiveQueryService interactiveQueryService;
public ReadOnlyKeyValueStore<String, String> fetchKeyValueStoreBy(String storeName) {
while (true) {
try {
log.info("Waiting for state store");
return new ReadOnlyKeyValueStoreWrapper<>(interactiveQueryService.getQueryableStore(storeName,
QueryableStoreTypes.<String, String> keyValueStore()));
} catch (final IllegalStateException e) {
try {
Thread.sleep(1000);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
}
}
When deploying the application in one instance(Linux machine) every thing is working fine.While deploying the application in 2 instance we find the folowing observations
state store is available in one instance and other dosen't have.
When the request is being processed by the instance which has the state store every thing is fine.
If the request falls to the instance which does not have state store the application is waiting in the while loop indefinitley(above code snippet).
While the instance without store waiting indefinitely and if we kill the other instance the above code returns the store and it was processing perfectly.
No clue what we are missing.
When you have multiple Kafka Streams processors running with interactive queries, the code that you showed above will not work the way you expect. It only returns results, if the keys that you are querying are on the same server. In order to fix this, you need to add the property - spring.cloud.stream.kafka.streams.binder.configuration.application.server: <server>:<port> on each instance. Make sure to change the server and port to the correct ones on each server. Then you have to write code similar to the following:
org.apache.kafka.streams.state.HostInfo hostInfo = interactiveQueryService.getHostInfo("store-name",
key, keySerializer);
if (interactiveQueryService.getCurrentHostInfo().equals(hostInfo)) {
//query from the store that is locally available
}
else {
//query from the remote host
}
Please see the reference docs for more information.
Here is a sample code that demonstrates that.

Integrate key-value database with Spark

I'm having trouble understanding how Spark interacts with storage.
I would like to make a Spark cluster that fetches data from a RocksDB database (or any other key-value store). However, at this moment, the best I can do is fetch the whole dataset from the database into memory in each of the cluster nodes (into a map for example) and build an RDD from that object.
What do I have to do to fetch only the necessary data (like Spark does with HDFS)? I've read about Hadoop Input Format and Record Readers, but I'm not completely grasping what I should implement.
I know this is kind of a broad question, but I would really appreciate some help to get me started. Thank you in advance.
Here is one possible solution. I assume you have client library for the key-value store(RocksDB in your case) that you want to access.
KeyValuePair represents a bean class representing one Key-value pair from your key-value store.
Classes
/*Lazy iterator to read from KeyValue store*/
class KeyValueIterator implements Iterator<KeyValuePair> {
public KeyValueIterator() {
//TODO initialize your custom reader using java client library
}
#Override
public boolean hasNext() {
//TODO
}
#Override
public KeyValuePair next() {
//TODO
}
}
class KeyValueReader implements FlatMapFunction<KeyValuePair, KeyValuePair>() {
#Override
public Iterator<KeyValuePair> call(KeyValuePair keyValuePair) throws Exception {
//ignore empty 'keyValuePair' object
return new KeyValueIterator();
}
}
Create KeyValue RDD
/*list with a dummy KeyValuePair instance*/
ArrayList<KeyValuePair> keyValuePairs = new ArrayList<>();
keyValuePairs.add(new KeyValuePair());
JavaRDD<KeyValuePair> keyValuePairRDD = javaSparkContext.parallelize(keyValuePairs);
/*Read one key-value pair at a time lazily*/
keyValuePairRDD = keyValuePairRDD.flatMap(new KeyValueReader());
Note:
Above solution creates an RDD with two partitions by default(one of them will be empty). Increase the partitions before applying any transformation on keyValuePairRDD to distribute the processing across executors.
Different ways to increase partitions:
keyValuePairRDD.repartition(partitionCounts)
//OR
keyValuePairRDD.partitionBy(...)

Accessing Distributed Cache in Pig StoreFunc

I have looked at all the other threads on this topic and still have not found an answer...
Put simply, I want to access hadoop distributed cache from a Pig StoreFunc, and NOT from within a UDF directly.
Relevant PIG code lines:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
...
STORE BLAH INTO /path/ using CustomStorage();
Relevant Java Code:
public class KeyValStorage<M extends Message> extends BaseStoreFunc /* ElephantBird Storage which inherits from StoreFunc */ {
...
public KeyValStorage(String param1, String param2, String param3) {
...
try {
InputStream is = new FileInputStream(configName);
try {
prop.load(is);
} catch (IOException e) {
System.out.println("PROPERTY LOADING FAILED");
e.printStackTrace();
}
} catch (FileNotFoundException e) {
System.out.println("FILE NOT FOUND");
e.printStackTrace();
}
}
...
}
configName is the name of the LOCAL file that I should be able to read from distributed cache, however, I am getting a FileNotFoundException. When I use the EXACT same code from within a PIG UDF directly, the file is found, so I know the file is being shipped via distributed cache. I set the appropriate param to make sure this happens:
<property><name>mapred.cache.files</name><value>/path/to/file/file.properties#configName</value></property>
Any ideas how I can get around this?
Thanks!
StroreFunc's constructor is called both at frontend and backend. When it is called from the frontend, (before the job is launched) then you'll get FileNotFoundException because at this point the files from the distributed cache are not yet copied to the nodes' local disk.
You may check whether you are at the backend (when the job is being executed) and load the file only in this case e.g:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
set mapreduce.job.cache.files hdfs://host/user/cache/file.txt#config
...
STORE BLAH INTO /path/ using CustomStorage();
public KeyValStorage(String param1, String param2, String param3) {
...
try {
if (!UDFContext.getUDFContext().isFrontend()) {
InputStream is = new FileInputStream("./config");
BufferedReader br = new BufferedReader(new InputStreamReader(is));
...
...
}

Twitter crawler: why does the memory grow?

I have been trying to crawl Twitter via the Streaming API and by filtering the retrieved tweets by keywords/hashtags/users.
Here is my example using HBC (although the same problem happens with Twitter4J):
// After connection:
final BlockingQueue<String> queue = new LinkedBlockingQueue<String>(10000);
StatusesFilterEndpoint filterQuery = new StatusesFilterEndpoint();
filterQuery.followings(myListOfUserIDs);
filterQuery.trackTerms(myListOfKeywordsAndHashtags);
final ExecutorService executor = Executors.newFixedThreadPool(4);
Runnable tweetAnalyzer = defineRunnable(queue);
for (int i = 0; i < NUM_THREADS; i++)
executor.execute(tweetAnalyzer);
where the analyzer tweetAnalyzer is returned by:
private Runnable defineRunnable(final BlockingQueue<String> queue) {
return new Runnable() {
#Override
public void run() {
while (true)
try {
System.out.println(queue.take());
}
catch (InterruptedException e) {
e.printStackTrace();
}
}
};
}
However, the process continues to grow in memory.
Two questions:
How to design this crawler properly, so that it does not grow in memory and does not saturate the RAM?
How to select the best queue length (here set to 10000) so that it does not saturate? I have seen that using this length the queue continues to be full of tweets (it never goes empty) and I am able to crawl 700 tweets/min, which is huge)
Thank you in advance.
It's a bit hard to determine from the snippets that you provide. Do you register StatusesFilterEndpoint correctly?
I would recommend that you write a separate thread to monitor the size of the queue.
Obvious you are not able to proceed all the twitter messages you download. So you can only:
reduce the number of tweets you download by filtering more aggressively
Sample the input by throwing away every n message.
use a faster machine although for the tweetAnalyzer you display in the question this might not help.
deploy on a cluster

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
Actually stdout only shows the System.out.println() of the non-map reduce classes.
The System.out.println() for map and reduce phases can be seen in the logs. Easy way to access the logs is
http://localhost:50030/jobtracker.jsp->click on the completed job->click on map or reduce task->click on tasknumber->task logs->stdout logs.
Hope this helps
Another way is through the terminal:
1) Go into your Hadoop_Installtion directory, then into "logs/userlogs".
2) Open your job_id directory.
3) Check directories with _m_ if you want the mapper output or _r_ if you're looking for reducers.
Example: In Hadoop-20.2.0:
> ls ~/hadoop-0.20.2/logs/userlogs/attempt_201209031127_0002_m_000000_0/
log.index stderr stdout syslog
The above means:
Hadoop_Installation: ~/hadoop-0.20.2
job_id: job_201209031127_0002
_m_: map task , "map number": _000000_
4) open stdout if you used "system.out.println" or stderr if you used "system.err.append".
PS. other hadoop versions might have a sight different hierarchy but they're all should be under $Hadoop_Installtion/logs/userlogs.
On a Hadoop cluster with yarn, you can fetch the logs, including stdout, with:
yarn logs -applicationId application_1383601692319_0008
For some reason, I've found this to be more complete than what I see in the webinterface. The webinterface did not list the output of System.out.println() for me.
to get your stdout and log message on the console you can use apache commons logging framework in to your mapper and reducer.
public class MyMapper extends Mapper<..,...,..,...> {
public static final Log log = LogFactory.getLog(MyMapper.class)
public void map() throws Exception{
// Log to stdout file
System.out.println("Map key "+ key);
//log to the syslog file
log.info("Map key "+ key);
if(log.isDebugEanbled()){
log.debug("Map key "+ key);
}
context.write(key,value);
}
}
After most of the options above did not work for me, I realized that on my single node cluster, I can use this simple method:
static private PrintStream console_log;
static private boolean node_was_initialized = false;
private static void logPrint(String line){
if(!node_was_initialized){
try{
console_log = new PrintStream(new FileOutputStream("/tmp/my_mapred_log.txt", true));
} catch (FileNotFoundException e){
return;
}
node_was_initialized = true;
}
console_log.println(line);
}
Which, for example, can be used like:
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
logPrint("map input: key-" + key.toString() + ", value-" + value.toString());
//actual impl of 'map'...
}
After that, the prints can be viewed with: cat /tmp/my_mapred_log.txt.
To get rid of prints from prior hadoop runs you can simple use rm /tmp/my_mapred_log.txt before running hadoop again.
notes:
The solution by Rajkumar Singh is likely better if you have the time to download and integrate a new library.
This could work for multi-node clusters if you have a way to access "/tmp/my_mapred_log.txt" on each worker node machine.
If for some strange reason you already have a file named "/tmp/my_mapred_log.txt", consider changing the name (just make sure to give an absolute path).

Resources