saveAsTextFile has null output in Spark Steaming via DataBricks - spark-streaming

The following statement in SCALA as part of Spark Streaming module, 1) creates the part-xxxxx files, 2) but they are all empty in (Databricks). Wondering why this is so, as when output to console it is displayed correctly.
QS.foreachRDD(q=> {
var file=q.map(_.toUpper + "...")
file.saveAsTextFile("/QS/filexxx")
})

Related

How to see print() results in Tarantool Docker container

I am using tarantool/tarantool:2.6.0 Docker image (the latest at the moment) and writing lua scripts for the project. I try to find out how to see the results of callin' print() function. It's quite difficult to debug my code without print() working.
In tarantool console print() have no effect also.
Using simple print()
Docs says that print() works to stdout, but I don't see any results when I watch container's logs by docker logs -f <CONTAINER_NAME>
I also tried to set container's logs driver to local. Than I get one time print to container's logs, but only once...
The container's /var/log directory is always empty.
Using box.session.push()
Using box.session.push() works fine in console, but when I use it in lua script:
-- app.lua
function log(s)
box.session.push(s)
end
-- No effect
log('hello')
function say_something(s)
log(s)
end
box.schema.func.create('say_something')
box.schema.user.grant('guest', 'execute', 'function', 'say_something')
And then call say_something() from nodeJs connector like this:
const TarantoolConnection = require('tarantool-driver');
const conn = new TarantoolConnection(connectionData);
const res = await conn.call('update_links', 'hello');
I get error:
Any suggestions?
Thanx!
I suppose you've missed io.flush() after print command.
After I added io.flush() after each print call my messages start to write to logs (docker logs -f <CONTAINER_NAME>).
Also I'd recommend to use log module for such purpose. It writes to stderr without buffering.
Regarding the error in the connector, I think nodejs connector simply doesn't support pushes.

How should look like a morphline for MapReduceIndexerTool?

I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.
For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.
Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.
I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?
I can't find any example on a morphline adapted for this map reduce job.
Thank you respectfully.
Cloudera has a guide which is having almost similar use case given under morphline.
In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.
The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:
In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.
For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:
SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]
When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.
This is command I'm using to index files and merge them to running collection:
sudo -u hdfs hadoop --config /etc/hadoop/conf \
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file /local/path/morphlines_file \
--output-dir hdfs://localhost/mrit/out \
--zk-host localhost:2181/solr \
--collection collection1 \
--go-live \
hdfs:/mrit/in/my-avro-file.avro
Solr should be configured to work with HDFS and collection should exist.
All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.

Spark - on EMR saveAsTextFile wont write data to local dir

Running Spark on EMR (AMI 3.8). When trying to write an RDD to a local file, I am getting no results on the name/master node.
On my previous EMR cluster (same version of Spark installed with bootstrap script instead of as an add-on to EMR), the data would write to the local dir on the name node. Now I can see it appearing in "/home/hadoop/test/_temporary/0/task*" directories on the other nodes in the cluster, but only the 'SUCCESS' file on the master node.
How can I get the file to write to the name/master node only?
Here is an example of the command I am using:
myRDD.saveAsTextFile("file:///home/hadoop/test")
I can do this in a round about way using by pushing to HDFS first then writing the results to local filesystem with shell commands. But I would love to hear if others have a more elegant approach.
//rdd to local text file
def rddToFile(rdd: RDD[_], filePath: String) = {
//setting up bash commands
val createFileStr = "hadoop fs -cat " + filePath + "/part* > " + filePath
val removeDirStr = "hadoop fs -rm -r " + filePath
//rm dir in case exists
Process(Seq("bash", "-c", removeDirStr)) !
//save data to HDFS
rdd.saveAsTextFile(filePath)
//write data to local file
Process(Seq("bash", "-c", createFileStr)) !
//rm HDFS dir
Process(Seq("bash", "-c", removeDirStr)) !
}

mapreduce program not producing the requied output in distributed mode

I need some help in my map-reduce code.
The code run's perfectly in eclipse and in standalone mode, but when i package the code and try running it locally on pseudo distributed mode, the output is not as i expect.
Map input records = 11
Map input records = 11
Reduce input records = 11
Reduce output records = 0
These are the values i get.
where as when i run the same code in eclipse or in standalone mode with same config & input file
Map input records = 11
Map output records = 11
Reduce input records = 11
Reduce output records = 4
Can any one tell me whats wrong..??
i tried both the ways of building .jar file for eclipse -> export -> runable jar and form terminal as well(javac -classpath hadoop-core-1.0.4 -d classes mapredcode.java && jar -cvf mapredcode.jar -C classes/ .)
and how do i debug this..
Are you using a combiner() method?
And if yes. then is the o/p of combiner the same as that of the mapper?
Because in Hadoop, Combiner is run at the disposal of Hadoop itself and may not be running in the pseudo-disrtibuted mode in your case.
The combiner in itself is nothing but a reducer that is used to lower the network traffic.
And the code should be such that even if a Combiner is not running, the reducer should get the expected format from the mapper.
Hope it helps.

Can we set the multiple generic arguments with -D option in GenericOptionsParser?

I want to pass multiple configuration parameters to my Hadoop job through GenericOptionsParser.
With "-D abc=xyz" I can pass one argument and able to retrieve the same from the configuration object but I am not able to pass the multiple argument.
Is it possible to pass multiple argument?If yes how?
Passed the parameters as -D color=yellow -D number=10
Had the following code in the run() method
String color = getConf().get("color");
System.out.println("color = " + color);
String number = getConf().get("number");
System.out.println("number = " + number);
The following was the o/p in the console
color = yellow
number = 10
I recently ran in to this issue after upgrading from Hadoop 1.2.1 to Hadoop 2.4.1. The problem is that Hadoop's dependency on commons-cli 1.2 was being omitted due to a conflict with commons-cli 1.1 that was pulled in from Cassandra 2.0.5.
After a quick look through the source it looks like commons-cli options that have an uninitialized number of values (what Hadoop's GenericOptionsParser does) default to a limit of 1 in version 1.1 and no limit in 1.2.
I hope that helps!
I tested passing multiple parameters and I used the -D flag multiple times.
$HADOOP_HOME/bin/hadoop jar /path/to/my.jar -D mapred.heartbeats.in.second=80 -D mapred.map.max.attempts=2 ...`
Doing this changed the values to what I specified in the Job's configuration.

Resources