Concatenate S3 files to read in EMR - hadoop

I have an S3 bucket with log files that I want to concatenate, then use as an input to an EMR job. The log files are in paths like: bucket-name/[date]/product/out/[hour]/[minute-based-file]. I'd like to take all the minute logs in all the hour directories in all the date directories, and concatenate them into one file. I want to use that file as an input to an EMR job. The original log files need to be preserved, and the new combined log file will probably be written to a different S3 bucket.
I tried using hadoop fs -getmerge on the EMR master node via SSH, but got this error:
This file system object (file:///) does not support access to the request path 's3://target-bucket-name/merged.log'
The source S3 bucket has some other files in it, so I don't want to include all of its files. The wildcard match looks like this: s3n://bucket-name/*/product/out/*/log.*.
The purpose is to get around the problem of having tens/hundreds of thousands of small (10k-3mb) input files to EMR, and instead give it one large file that it can split more efficiently.

I ended up just writing a script that wraps some Hadoop filesystem commands to do this.
#!/usr/bin/env ruby
require 'date'
# Merge minute-based log files into daily log files
# Usage: Run on EMR master (e.g. SSH to master then `ruby ~/merge-historical-logs.rb [FROM [TO]]`)
SOURCE_BUCKET_NAME = 's3-logs-bucket'
DESTINATION_BUCKET_NAME = 's3-merged-logs-bucket'
# Optional date inputs
min_date = if ARGV[0]
min_date_args = ARGV[0].split('-').map {|item| item.to_i}
Date.new(*min_date_args)
else
Date.new [2012, 9, 1]
end
max_date = if ARGV[1]
max_date_args = ARGV[1].split('-').map {|item| item.to_i}
Date.new(*max_date_args)
else
Date.today
end
# Setup directories
hdfs_logs_dir = '/mnt/tmp/logs'
local_tmp_dir = './_tmp_merges'
puts "Cleaning up filesystem"
system "hadoop fs -rmr #{hdfs_logs_dir}"
system "rm -rf #{local_tmp_dir}*"
puts "Making HDFS directories"
system "hadoop fs -mkdir #{hdfs_logs_dir}"
# We will progress backwards, from max to min
date = max_date
while date >= min_date
# Format date pieces
year = date.year
month = "%02d" % date.month
day = "%02d" % date.day
# Make a directory in HDFS to store this day's hourly logs
today_hours_dir = "#{hdfs_logs_dir}/#{year}-#{month}-#{day}"
puts "Making today's hourly directory"
system "hadoop fs -mkdir #{today_hours_dir}"
# Break the day's hours into a few chunks
# This seems to avoid some problems when we run lots of getmerge commands in parallel
[*(0..23)].each_slice(8).to_a.each do |hour_chunk|
hour_chunk.each do |_hour|
hour = "%02d" % _hour
# Setup args to merge minute logs into hour logs
source_file = "s3://#{SOURCE_BUCKET_NAME}/#{year}-#{month}-#{day}/product/out/#{hour}/"
output_file = "#{local_tmp_dir}/#{hour}.log"
# Launch each hour's getmerge in the background
full_command = "hadoop fs -getmerge #{source_file} #{output_file}"
puts "Forking: #{full_command}"
fork { system full_command }
end
# Wait for this batch of the germerge's to finish
Process.waitall
end
# Delete the local temp files Hadoop created
puts "Removing temp files"
system "rm #{local_tmp_dir}/.*.crc"
# Move local hourly logs to hdfs to free up local space
puts "Moving local logs to HDFS"
system "hadoop fs -put #{local_tmp_dir}/* #{today_hours_dir}"
puts "Removing local logs"
system "rm -rf #{local_tmp_dir}"
# Merge the day's hourly logs into a single daily log file
daily_log_file_name = "#{year}-#{month}-#{day}.log"
daily_log_file_path = "#{local_tmp_dir}_day/#{daily_log_file_name}"
puts "Merging hourly logs into daily log"
system "hadoop fs -getmerge #{today_hours_dir}/ #{daily_log_file_path}"
# Write the daily log file to another s3 bucket
puts "Writing daily log to s3"
system "hadoop fs -put #{daily_log_file_path} s3://#{DESTINATION_BUCKET_DIR}/daily-merged-logs/#{daily_log_file_name}"
# Remove daily log locally
puts "Removing local daily logs"
system "rm -rf #{local_tmp_dir}_day"
# Remove the hourly logs from HDFS
puts "Removing HDFS hourly logs"
system "hadoop fs -rmr #{today_hours_dir}"
# Go back in time
date -= 1
end

Related

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

copy file from unc to hdfs using shellscript

I have UNC path folders in this path " //aloha/log/folderlevel1/folderlevel2/"
Each of these level2 folders will have files like "empllog.txt","deptlog.txt","adminlog.txt" and few others files as well.
I want to copy the content of this particular folders if they were created in last 24 hours & only if these 3 files are present to HDFS cloudera cluster.But if one of these files are not present , then that particular folder should not be copied. Also I need to preserve the folderstructre.
i.e In HDFS it should be "/user/test/todaydate/folderlevel1/folderlevel2"
I have written below shell script to copy files to hdfs with date folder created. But not sure how to proceed further with UNC Paths & other criterias.
day=$(date +%Y-%m-%d)
srcdir="/home/test/sparkjops"
stdir="/user/test/$day/"
hadoop dfs -mkdir $day /user/test
for f in ${srcdir}/*
do
if [ $f == "$srcdir/empllog.txt" ]
then
hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/deptlog.txt" ]
then hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/adminlog.txt" ]
then hadoop dfs -put $f $stdir
fi
done
I have tried to change the UNC Path like below . It did not do anything. No error & did not copy the content as well.
srcdir="//aloha/log/*/*"
srcdir='//aloha/log/*/*'
srcdir="\\aloha\log\*\*"
Appreciate all help.
Thanks.
EDIT 1 :
I ran it with code sh -x debug mode.and also with bash -x(just to check). But It returned that file not found error as below
test#ubuntu:~/sparkjops$ sh -x ./hdfscopy.sh
+ date +%Y-%m-%d
+ day=2016-12-24
+ srcdir= //aloha/logs/folderlevel1/folderlevel2
+ stdir=/user/test/2016-12-24/
+ hadoop dfs -mkdir 2016-12-24 /user/test
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
mkdir: `2016-12-24': File exists
mkdir: `/user/test': File exists
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/empllog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/deptlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/adminlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
test#ubuntu:~/sparkjops$
But not able to understand why it is not reading from that path. I have tried different escaping sequences as well(doubleslash for each slash, forwardslash as we do in window folderpath) . But none working. All are throwing same error message. I am not sure how to read this file in the script. Any help would be appreciated.

Merging small files in hadoop

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute.
After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
Now my question is how i will get the next 10 files excluding the first 10 files?
can some please help me
Here is one more alternate, this is still the legacy approach pointed out by #Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
step 1 : create a tmp directory
hadoop fs -mkdir tmp
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
step 5 - remove output
hadoop fs -rm -R output/
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
Steps to schedule a cron job for merging small files
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
step 2: schedule the script using cron to run every minute using cron expression
a) edit crontab by choosing an editor
>crontab -e
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
The merge job will be scheduled to run for every minute.
Hope this was helpful.
#Andrew pointed you to a solution that was appropriate 6 years ago, in a batch-oriented world.
But it's 2016, you have a micro-batch data flow running and require a non-blocking solution.
That's how I would do it:
create an EXTERNAL table with 3 partitions, mapped on 3 directories
e.g. new_data, reorg and history
feed the new files into new_data
implement a job to run the batch compaction, and run it periodically
Now the batch compaction logic:
make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
select all files that are ripe for compaction (define your own
criteria) and move them from new_data directory to reorg
merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
drop the files in reorg
So it's basically the old 2010 story, except that your existing data flow can continue dumping new files into new_data while the compaction is safely running in separate directories. And in case the compaction job crashes, you can safely investigate / clean-up / resume the compaction without compromising the data flow.
By the way, I am not a big fan of the 2010 solution based on a "Hadoop Streaming" job -- on one hand, "streaming" has a very different meaning now; on the second hand, "Hadoop streaming" was useful in the old days but is now out of the radar; on the gripping hand [*] you can do it quite simply with a Hive query e.g.
INSERT INTO TABLE blahblah PARTITION (stage='history')
SELECT a, b, c, d
FROM blahblah
WHERE stage='reorg'
;
With a couple of SET some.property = somevalue before that query, you can define what compression codec will be applied on the result file(s), how many file(s) you want (or more precisely, how big you want the files to be - Hive will run the merge accordingly), etc.
Look into https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties under hive.merge.mapfiles and hive.merge.mapredfiles (or hive.merge.tezfiles if you use TEZ) and hive.merge.smallfiles.avgsize and then hive.exec.compress.output and mapreduce.output.fileoutputformat.compress.codec -- plus hive.hadoop.supports.splittable.combineinputformat to reduce the number of Map containers since your input files are quite small.
[*] very old SF reference here :-)

Spark - on EMR saveAsTextFile wont write data to local dir

Running Spark on EMR (AMI 3.8). When trying to write an RDD to a local file, I am getting no results on the name/master node.
On my previous EMR cluster (same version of Spark installed with bootstrap script instead of as an add-on to EMR), the data would write to the local dir on the name node. Now I can see it appearing in "/home/hadoop/test/_temporary/0/task*" directories on the other nodes in the cluster, but only the 'SUCCESS' file on the master node.
How can I get the file to write to the name/master node only?
Here is an example of the command I am using:
myRDD.saveAsTextFile("file:///home/hadoop/test")
I can do this in a round about way using by pushing to HDFS first then writing the results to local filesystem with shell commands. But I would love to hear if others have a more elegant approach.
//rdd to local text file
def rddToFile(rdd: RDD[_], filePath: String) = {
//setting up bash commands
val createFileStr = "hadoop fs -cat " + filePath + "/part* > " + filePath
val removeDirStr = "hadoop fs -rm -r " + filePath
//rm dir in case exists
Process(Seq("bash", "-c", removeDirStr)) !
//save data to HDFS
rdd.saveAsTextFile(filePath)
//write data to local file
Process(Seq("bash", "-c", createFileStr)) !
//rm HDFS dir
Process(Seq("bash", "-c", removeDirStr)) !
}

How to load data from local machine to hdfs using flume

i am new to flume so please tell me...how to store log files from my local machine to local my HDFS using flume
i have issues in setting classpath and flume.conf file
Thank you,
ajay
agent.sources = weblog
agent.channels = memoryChannel
agent.sinks = mycluster
## Sources #########################################################
agent.sources.weblog.type = exec
agent.sources.weblog.command = tail -F REPLACE-WITH-PATH2-your.log-FILE
agent.sources.weblog.batchSize = 1
agent.sources.weblog.channels =
REPLACE-WITH-
CHANNEL-NAME
## Channels ########################################################
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100 agent.channels.memoryChannel.transactionCapacity = 100
## Sinks ###########################################################
agent.sinks.mycluster.type =REPLACE-WITH-CLUSTER-TYPE
agent.sinks.mycluster.hdfs.path=/user/root/flumedata
agent.sinks.mycluster.channel =REPLACE-WITH-CHANNEL-NAME
Save this file as logagent.conf and run with below command
# flume-ng agent –n agent –f logagent.conf &
We do need more information to know why things are working for you.
The short answer is that you need a Source to read your data from (maybe the spooling directory source), a Channel (memory channel if you don't need reliable storage) and the HDFS sink.
Update
The OP reports receiving the error message, "you must include conf file in flume class path".
You need to provide the conf file as an argument. You do so with the --conf-file parameter. For example, the command line I use in development is:
bin/flume-ng agent --conf-file /etc/flume-ng/conf/flume.conf --name castellan-indexer --conf /etc/flume-ng/conf
The error message reads that way because the bin/flume-ng script adds the contents of the --conf-file argument to the classpath before running Flume.
If you are appending data to your local file, you can use an exec source with "tail -F" command. If the file is static, use cat command to transfer the data to hadoop.
The overall architecture would be:
Source: Exec source reading data from your file
Channel : Either memory channel or file channel
Sink: Hdfs sink where data is being dumped.
Use user guide to create your conf file (https://flume.apache.org/FlumeUserGuide.html)
Once you have your conf file ready, you can run it like this:
bin/flume-ng agent -n $agent_name -c conf -f conf/your-flume-conf.conf

Resources