How do I log to file in Scalding?

How do I log to file in Scalding? - hadoop

In my Scalding map reduce code, I want to log out certain steps that are happening so that I can debug the map-reduce jobs if something goes wrong.
How can I add logging to my scalding job?
E.g.
import com.twitter.scalding._
class WordCountJob(args: Args) extends Job(args) {
//LOG: Starting job at time blah..
TextLine( args("input") )
.read
.flatMap('line -> 'word) {
line: String =>
line.trim.toLowerCase.split("\\W+")
}
.groupBy('word) { group => group.size('count) }
}
.write(Tsv(args("output")))
//LOG - ending job at time...
}

Any logging framework will do. You can obviously also use println() - it will appear in your job's stdout log file in the job history of your hadoop cluster (in hdfs mode) or in your console (in local mode).
Also consider defining a trap with the addTrap() method for catching erroneous records.

Related

Console Output in pipeline:Jenkins

I have created a complex pipeline. In each stage I have called a job. I want to see the console output for each job in a stage in Jenkins. How to get it?

The object returned from a build step can be used to query the log like this:
pipeline {
agent any
stages {
stage('test') {
steps {
echo 'Building anotherJob and getting the log'
script {
def bRun = build 'anotherJob'
echo 'last 100 lines of BuildB'
for(String line : bRun.getRawBuild().getLog(100)){
echo line
}
}
}
}
}
}
The object returned from the build step is a RunWrapper class object. The getRawBuild() call is returning a Run object - there may be other options than reading the log line-by-line from the looks of this class. For this to work you need to either disable the pipeline sandbox or get script approvals for these methods:
method hudson.model.Run getLog int
method org.jenkinsci.plugins.workflow.support.steps.build.RunWrapper getRawBuild
If you are doing this for many builds, it would be worth putting some code in a pipeline shared library to do what you need or define a function in the pipeline.

Writing Parquet file in standalone mode works..multiple worker mode fails

In Spark, version 1.6.1 (code is in Scala 2.10), I am trying to write a data frame to a Parquet file:
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Overwrite).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
When I do it in development mode, everything works fine. It also works fine if I setup a master and one worker in standalone mode in a docker environment (separate docker containers) on the same machine. It fails when I try to execute it on a cluster (1 master, 5 workers). If I set it up local on the master it also works.
When I try to execute it, I get following stacktrace:
{
"duration": "18.716 secs",
"classPath": "LDFSparkLoaderJobTest2",
"startTime": "2016-07-18T11:41:03.299Z",
"context": "sql-context",
"result": {
"errorClass": "org.apache.spark.SparkException",
"cause": "Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, curry-n3): java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.abortTask$1(WriterContainer.scala:294)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:271)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:",
"stack":[
"org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)",
"scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)",
"scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)",
"org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)",
"scala.Option.foreach(Option.scala:236)",
"org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)",
"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)",
"org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)",
"org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)",
"org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)",
"org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)",
"org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)",
"org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)",
"org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)",
"org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)",
"org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)",
"org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)",
"org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)",
"org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)",
"org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)",
"LDFSparkLoaderJobTest2$.readFile(SparkLoaderJob.scala:55)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:48)",
"LDFSparkLoaderJobTest2$.runJob(SparkLoaderJob.scala:18)",
"spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:268)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)",
"scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"
],
"causingClass": "org.apache.spark.SparkException",
"message": "Job aborted."
},
"status": "ERROR",
"jobId": "54ad3056-3aaa-415f-8352-ca8c57e02fe9"
}
Notes:
The job is submitted via the Spark Jobserver.
The file that needs to be converted to a Parquet file is 15.1 MB in size.
Question:
Is there something I am doing wrong (I followed the docs)
Or is there another way I can create the Parquet file, so all my workers have access to it?

In your stand alone setup only one worker is working with ParquetRecordWriter. so it worked fine.
In case of real test i.e. cluster (1 master, 5 workers). with ParquetRecordWriter it will fail since you are concurrently writing with multiple workers...
pls try below.
import sc.implicits._
val triples = file.map(p => _parse(p, " ", true)).toDF()
triples.write.mode(SaveMode.Append).parquet("hdfs://some.external.ip.address:9000/tmp/table.parquet")
pls. see SaveMode.Append "append" When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.

I had not exactly the same, but similar issues writing dataframes to parquet files in cluster mode. Those problems disappeared when deleteing the file, just before writing, using this convenience function 'write(..)' :
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
..
def main(arg: Array[String]) {
..
val fs = FileSystem.get(sc.hadoopConfiguration)
..
def write(df:DataFrame, fn:String ) = {
val op1=s"hdfs:///user/you/$fn"
fs.delete(new Path(op1))
df.write.parquet(op1)
}
Give it a go, tell me if it works for you...

Hadoop mapreduce design / Routing mapper and reducer on one job

I want to make an mapreduce design like this inside one job.
Example :
I want on a job: *************************************************************
[Mapper A] ---> [Mapper C]
[Mapper B] ---> [Reducer B]
After that [Reducer B] ---> [Mapper C]
[Mapper C] ---> [Reducer C] ******************************************************************************
So [Mapper A] & [Reducer B] ---> [Mapper C]. And next [Mapper C] continue to [Reducer C]. I want all scenario above run on one job.
It's like a routing inside one mapreduce job. I can route many mappers to particular reducer and continue it to other mapper than reducer again inside one job. I need your suggest bro
Thanks.....

--Edit starts
To simplify the problem lets say you have three jobs JobA, JobB, JobC each comprising of a map and a reduce phase.
Now you want to use mapper output of JobA in mapper task of JobC, so JobC just needs to wait for JobA to finish its map task, you can use MultipleOutputs class in your JobA to preserve/write map phase output at a location which JobC can poll for.
--Edit ends
Programatically you can do something like below code, where getJob() should be defined in respective Map-reduce class, where you specify configuration, DistributedCache, input formats etc.
main () {
processMapperA();
processMapReduceB();
processMapReduceC();
}
processMapperA()
{
// configure the paths/inputs needed, for example sake I am taking two paths
String path1 = "path1";
String path2 = "path2";
String[] mapperApaths = new String[]{path1, path2};
Job mapperAjob = MapperA.getJob(mapperApaths, <some other params you want to pass>);
mapperAjob.submit();
mapperAjob.waitForCompletion(true);
}
processMapReduceB()
{
// init input params to job
.
.
Job mapReduceBjob = MapReduceB.getJob(<input params you want to pass>);
mapReduceBjob.submit();
mapReduceBjob.waitForCompletion(true);
}
processMapReduceC()
{
// init input params to job
.
.
Job mapReduceCjob = MapReduceC.getJob(<input params you want to pass like outputMapperA, outputReducerB>);
mapReduceCjob.submit();
mapReduceCjob.waitForCompletion(true);
}
To gain more control on the workflow, you can consider using Oozie or SpringBatch.
With Oozie you can define workflow.xml's and schedule execution of each job as required.
SpringBatch can also be used for same but would require some coding and understanding, if you have a background to it, it can be used straightaway.
--Edit starts
Oozie is a workflow management tool, it allows you to configure and schedule jobs.
--Edit Ends
Hope this helps.

R script log is not getting as output on oozie tasks log

i am using below Rscript as a mapper on hadoop streaming. i want to see log info\warn etc on console of tasktracker or any other place of log that oozie does however its not coming any reason . My oozie job is successfully completed
Script
#! /usr/bin/env Rscript
library(methods)
library(utils)
library(devtools)
library(corpcor)
library(getopt)
library(logging)
library(HadoopStreaming)
main <- function() {
paste("A", 1:50, sep = "")
input <- file("stdin", open = "r")
loginfo("CUSTOM ERROR")
targets <- read.table(file="meta_reference1.csv", sep=";")
print("############################################")
print(target)
close(input)
}
Updated Rscript for testing purpose
#! /usr/bin/env Rscript
library(methods)
library(utils)
library(devtools)
library(corpcor)
library(getopt)
library(logging)
library(HadoopStreaming)
main <- function() {
write("prints to stderr", stderr())
write("prints to stdout", stdout())
}
No log appeared .. please suggest

Try using write instead of print, worked for me.
write("prints to stderr", stderr())
write("prints to stdout", stdout())

Hadoop streaming - remove trailing tab from reducer output

I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs.
My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted.
How do I remove it?
I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeReducer.java :
reduceOutFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t").getBytes("UTF-8");
So I tried passing -D stream.reduce.output.field.separator= as an argument to the job with no luck. I also tried -D mapred.textoutputformat.separator= and -D mapreduce.output.textoutputformat.separator= with no luck.
I've searched google of course and nothing I found worked. One search result even stated there was no argument that could be passed to achieve the desired result (though, the hadoop version in that case was really really old).
Here is my code (with added line breaks for readability):
hadoop jar streaming.jar -files s3n://path/to/a/file.json#file.json
-D mapred.output.compress=true -D stream.reduce.output.field.separator=
-input s3n://path/to/some/input/*/* -output hdfs:///path/to/output/dir
-mapper 'php my_mapper.php' -reducer 'php my_reducer.php'

As helpful for others, using the tips above, I was able to do an implementation:
CustomOutputFormat<K, V> extends org.apache.hadoop.mapred.TextOutputFormat<K, V> {....}
with exactly one line of the built-in implementation of 'getRecordWriter' changed to:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "");
instead of:
String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
after compiling that into a Jar, and including it into my hadoop streaming call (via the instructions on hadoop streaming), the call looked like:
hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar \
-archives 'hdfs:///user/the/path/to/your/jar/onHDFS/theNameOfTheJar.jar' \
-libjars theNameOfTheJar.jar \
-outputformat com.yourcompanyHere.package.path.tojavafile.CustomOutputFormat \
-file yourMapper.py -mapper yourMapper.py \
-file yourReducer.py -reducer yourReducer.py \
-input $yourInputFile \
-output $yourOutputDirectoryOnHDFS
I also included the jar in the folder I issued that call from.
It was working great for my needs (and it created no tabs at the end of the line after the reducer).
update: based on a comment implying this is indeed helpful for others, here's the full source of my CustomOutputFormat.java file:
import java.io.DataOutputStream;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordWriter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Progressable;
import org.apache.hadoop.util.ReflectionUtils;
public class CustomOutputFormat<K, V> extends TextOutputFormat<K, V> {
public RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job, String name,
Progressable progress) throws IOException {
boolean isCompressed = getCompressOutput(job);
//Channging the default from '\t' to blank
String keyValueSeparator = job.get("mapred.textoutputformat.separator", ""); // '\t'
if (!isCompressed) {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,
GzipCodec.class);
// create the named codec
CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);
// build the filename including the extension
Path file = FileOutputFormat.getTaskOutputPath(job, name + codec.getDefaultExtension());
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new LineRecordWriter<K, V>(new DataOutputStream(
codec.createOutputStream(fileOut)), keyValueSeparator);
}
}
}
FYI: For your usage context, be sure to check this does not adversely affect hadoop-streaming managed interactions (in terms of separating key vs. value) between your mapper and reducer. To clarify:
From my testing -- if you have a 'tab' in every line of your data (with something on each side of it), you can leave the built in defaults as they are: streaming will interpret the first thing before the first tab as your 'key', and all on that row after it as your 'value.' As such, it does not see a 'null value,' and won't append a tab that shows up after your reducer. (You'll see your final outputs sorted on the value of the 'key' that streaming interprets in each row as what it sees as occuring before each tab.)
Conversely, if you have no tabs in your data, and you don't override the defaults using the above trick(s), then you'll see the tabs after the run completes, for which the above override becomes a fix.

Looking at the org.apache.hadoop.mapreduce.lib.output.TextOutputFormat source, I see 2 things:
The write(key,value) method writes a separator if key or value is non-null
The separator is always set, using the default (\t), when the mapred.textoutputformat.separator returns null (which I'm assuming happens with -D stream.reduce.output.field.separator=
Your only solution maybe to write your own OutputFormat that works around these 2 issues.
My testing
In a task I had, I wanted to reformat a line from
id1|val1|val2|val3
id1|val1
into:
id1|val1,val2,val3
id2|val1
I had a custom mapper (Perl script) to convert the lines. And for this task, I initially tried to do as a key-only (or value-only) input, but got the results with the trailing tab.
At first I just specified:
-D stream.map.input.field.separator='|' -D stream.map.output.field.separator='|'
This gave the mapper a key, value pair, since my mapping wanted a key anyway. But this output now had the tab after the first field
I got the desired output when I added:
-D mapred.textoutputformat.separator='|'
If I didn't set it or set to blank
-D mapred.textoutputformat.separator=
then I would again get a tab after the first field.
It made sense once I looked at the source for TextOutputFormat

I too had this problem. I was using a python, map-only job that was basically just emitting lines of CSV data. After examining the output, I noted the \t on the end of every line.
foo,bar,baz\t
What I discovered is the mapper, and the Python stream, are both dealing with key value pairs. If you don't emit the default separator the whole line of CSV data is considered the "key" and the framework, which requires a key and a value, slaps on a \t and a empty value.
Since my data was essentially a CSV string, I set the separator for both stream and mapred output to comma. The framework read everything up to the first comma as the key and everything after the first comma as the value. Then when it wrote out the results to file it wrote key comma value which effectively created the output I was after.
foo,bar,baz
In my case, I added the below to prevent the framework from adding \t to the end of my csv output...
-D mapred.reduce.tasks=0 \
-D stream.map.output.field.separator=, \
-D mapred.textoutputformat.separator=, \

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I log to file in Scalding? - hadoop

Any logging framework will do. You can obviously also use println() - it will appear in your job's stdout log file in the job history of your hadoop cluster (in hdfs mode) or in your console (in local mode). Also consider defining a trap with the addTrap() method for catching erroneous records.

Related

Console Output in pipeline:Jenkins

Writing Parquet file in standalone mode works..multiple worker mode fails

Hadoop mapreduce design / Routing mapper and reducer on one job

R script log is not getting as output on oozie tasks log

Hadoop streaming - remove trailing tab from reducer output

Categories

Resources