Oozie generate set of files in directory - hadoop

I'm trying to ingest log files into hadoop.
I'd like to use oozie to trigger my ingestion task (written in spark),and have oozie pass the filenames to my task.
I expect the log files to be set out as:
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/Log1.2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.1.log
/example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/Log2.2.log
(etc).
So, now I have two problems:
1. How to get oozie to generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log1/ and pass it to my app; and
How to get oozie to in parallel generate all the file names under /example/${YEAR}-${MONTH}-${DAY}-${HOUR}:${MINUTE}/Log2/ and pass it to a second invocation of my task.

DateTime wise File name create can be done by using small Java Program, which can be call from Oozie Workflow.xml,
somthing like
String processedDateString = (new SimpleDateFormat("yyyyMMddhhmmss")).format(new Date(timeInMilis));
and while calling the same jar in workflow
<main-class>NameFile.jar</main-class>
<arg>Path=${output_path}</arg>
<arg>Name=${name}</arg>
<arg>processedDate=${(wf:actionData('Rename')['ProcessedDate'])}</arg>
For Copying/Moving you can use same Java program with Copy Action.
for log1 and log2 location you can mention in job.properties

Related

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

How to use airflow for real time data processing

I have a scenario where i want to process csv file and load to someother database:
Cases
pic csv file and load to mysql with the same name as csv
then do some modification on loaded rows using python task file
after that extract data from mysql and load to some other database
CSV files are coming from remote server to one airflow server in a folder.
We have to pick these csv file and process through python script.
Suppose i pick one csv file then i need to pass this csv file to rest of the operator in a dependency manner like
filename : abc.csv
task1 >> task2 >> task3 >>task4
So abc.csv should be available for all the task.
Please tell how to proceed.
Your scenarios don't have anything to do with realtime. This is ingesting on a schedule/interval. Or perhaps you could use a SensorTask Operator t detect data availability.
Implement each of your requirements as functions and call them from operator instances.
Add the operators to a DAG with a schedule appropriate for your incoming feed.
How you pass and access params is
-kwargs python_callable when initing an operator
-context['param_key'] in execute method when extending an operator
-jinja templates
relevant...
airflow pass parameter from cli
execution_date in airflow: need to access as a variable
The way tasks communicate in Airflow is using XCOM, but it is meant for small values, not for file content.
If you want your tasks to work with the same csv file you should save it on some location and then pass in the XCOM the path to this location.
We are using the LocalExecutor, so the local file system is fine for us.
We decided to create a folder for each dag with the name of the dag. Inside that folder we generate a folder for each execution date (we do this in the first task, that we always call start_task). Then we pass the path of this folder to the subsequent tasks via Xcom.
Example code for the start_task:
def start(share_path, **context):
execution_date_as_string = context['execution_date'].strftime(DATE_FORMAT)
execution_folder_path = os.path.join(share_path, 'my_dag_name', execution_date_as_string)
_create_folder_delete_if_exists(execution_folder_path)
task_instance = context['task_instance']
task_instance.xcom_push(key="execution_folder_path", value=execution_folder_path)
start_task = PythonOperator(
task_id='start_task',
provide_context=True,
python_callable=start,
op_args=[share_path],
dag=dag
)
The share_path is the base directory for all dags, we keep it in the Airflow variables.
Subsequent tasks can get the execution folder with:
execution_folder_path = task_instance.xcom_pull(task_ids='start_task', key='execution_folder_path')

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Hadoop preinstalled example Jars

I just successfully set up Hadoop on my local machines. I am following one of the examples in a popular book I just bought. I am trying to get a list of all hadoop examples that comes with installation. I type the following command to do so:
bin/hadoop jar hadoop-*-examples.jar
Once I enter this I am supposed to get a list of Hadoop examples right? However all I see is this error message:
Not a valid JAR: /home/user/hadoop/hadoop-*-examples.jar
How do I solve this problem? Is it just a simple permission issue?
This is most probably configuration issue or usage of invalid file paths.
Most probably the name of hadoop-*-examples.jar is not correct because in my version of Hadoop (1.0.0) file name is hadoop-examples-1.0.0.jar.
So I have run following command to list all examples and it works like charm:
bin/hadoop jar hadoop-examples-*.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
Also if I use same file name pattern as You I got error:
bin/hadoop jar hadoop-*examples.jar
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-*examples.jar
HTH
You must specify the class name of the jar file which you want to use:
hadoop jar pathtojarfile classname arg1 arg2 ..
Example:
hadoop jar example.jar wordcount inputPath outputPath
#Anup. The full / relative path to the jar file is required.
In your case it might be /home/user/hadoop/share/hadoop-*-examples.jar
The complete command from hadoop's directory might be
/home/user/hadoop/bin/hadoop /home/user/hadoop/share/hadoop-*-examples.jar
(I used absolute full paths there, but you can use relative paths).
you will find the jar in $HADOOP_HOME/share/hadoop/mapreduce/hadoop-*-examples*.jar

Resources