Passing files in different S3 folders as input to mapreduce - hadoop

Our log files are stored in year/month/day/hourly buckets on S3. See below for structure.
How do i pass all the logs on day=20 as input to my map reduce program?
Eg:
bucket = logs/year=2014/month=8/day=20/hour=1/log1_1.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_1.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_2.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_3.txt
bucket = logs/year=2014/month=8/day=20/hour=3/log3_1.txt
bucket = logs/year=2014/month=8/day=20/hour=4/log4_1.txt

When you say "bucket" do you actually mean distinct S3 buckets or do you mean folders/directories in a bucket? Creating that many buckets will end up hitting the S3 account limit for the number of buckets you can create.
Assuming you meant folders/directories in the bucket, use s3distcp as a step in your EMR cluster to copy the logs you want to HDFS and then use the HDFS directory as the input to the MR program
s3distcp takes an src directory and a srcPattern to filter the items found in src. In your example, you could do:
./elastic-mapreduce --jobflow JobFlowID --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--arg --src --arg s3://logs/ \
--arg --srcPattern --arg '.*day-20.*'
--arg --dest --arg hdfs://input/
All of the logs files that have day=20 in the path will be copied to the input directory on the HDFS of the EMR cluster with JobFlowID.

Related

Copy files incrementally from S3 to EBS storage using filters

I wish to move a large set of files from an AWS S3 bucket in one AWS account (source), having systematic filenames following this pattern:
my_file_0_0_0.csv
...
my_file_0_7_200.csv
Into a S3 bucket in another AWS account (target).
These need to be moved by an ec2 instance (to overcome IAM access restrictions) to an attached EBS volume incrementally (to overcome storage limitations).
Clarification:
in the filenames, there are 3 numbers separated by underscores, like so: _a_b_c, where a is always 0, b starts at 0 and goes up to 7, and c goes from 0 to maximally 200 (not guaranteed it will always reach 200).
(I have a SSH session to the EC2 instance through Putty).
1.st iteration:
So what I am trying to do in the first iteration is to copy all files from S3,
that have a name with the following pattern: my_file_0_0_*.csv.
This can be done with the command:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_0_*" --profile source_user
From here, I upload it to my target bucket with the command:
aws s3 cp . s3://my_target_bucket_name/my_folder/ --recursive --profile source_user
And finally delete the files from the ec2 instance's ebs volume with
rm *.
2.nd iteration:
aws s3 cp s3://my_source_bucket_name/my_folder/ . --recursive --exclude "*" --include "my_file_0_1_*" --profile source_user
This time, I only get some of the files with pattern my_file_0_1_*, as their combined file sizes reaches 100 GiB which is the limit of my ebs volume.
Here I run into the issue that the filenames are sorted alphabetically and not numerically by the digits in there names. e.g.:
my_file_0_1_0.csv
my_file_0_1_1.csv
my_file_0_1_10.csv
my_file_0_1_100.csv
my_file_0_1_101.csv
my_file_0_1_102.csv
my_file_0_1_103.csv
my_file_0_1_104.csv
my_file_0_1_105.csv
my_file_0_1_106.csv
my_file_0_1_107.csv
my_file_0_1_108.csv
my_file_0_1_109.csv
my_file_0_1_11.csv
After moving them to the target S3 bucket and removing them from ebs,
the challenge is to move the remaining files with pattern my_file_0_1_* in a systematic way. Is there a way to achieve this, e.g. by using find, grep, awk or similar ?
And do I need to cast some filename-slices to integers first ?
You can use sort -V command to consider the proper versioning of files and then invoke copy command on each file one by one or a list of files at a time.
ls | sort -V
If you're on a GNU system, you can also use ls -v. This won't work in MacOS.

how to list all objects in s3 bucket having having specific character in the key using shell script

I have a s3 bucket and below is the directory structure.
bucketname/uid=/year=/month=/day=/files.parquet
In some cases inside year directory I have some temporary object created by athena.Ex:
month=11_$folder$
I want remove all of these files whose key = month=11_$folder$.
Currently I am doing in a loop for all uid. Is there any faster way to do that?
Using the aws cli list-objects-v2 you can search for patterns
aws s3api list-objects-v2 \
--bucket my-bucket \
--query 'Contents[?contains(Key, `month=11_$folder$`)]'
Note this will still query all your objects and only filter what is returned back, so if you have more than 1,000 objects in your bucket, you'll need to paginate

How to make hadoop snappy output file the same format as those generated by Spark

we are using Spark and up until now the output are PSV files. Now in order to save space, we'd like to compress the output. To do so, we will change to save JavaRDD using the SnappyCodec, like this:
objectRDD.saveAsTextFile(rddOutputFolder, org.apache.hadoop.io.compress.SnappyCodec.class);
We will then use Sqoop to import the output into a database. The whole process works fine.
For previously generated PSV files in HDFS, we'd like to compress them in Snappy format as well. This is the command we tried:
hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path
The command works fine. But the issue is, sqoop can't parse the snappy output files.
When we use a command like "hdfs dfs -text hdfs-file-name" to view the generated files, the output looks like below, with a "index" like field added into each line:
0 2019-05-02|AMRS||5072||||3540||MMPT|0|
41 2019-05-02|AMRS||5538|HK|51218||1000||Dummy|45276|
118 2019-05-02|AMRS||5448|US|51218|TRADING|2282|HFT|NCR|45119|
I.e., an extra value like "0 ", "41 ", "118 " are added into the beginning of each line. Note that the .snappy files generated by Spark doesn't has this "extra-field".
Any idea how to prevent this extra field being inserted?
Thanks a lot!
These are not indexes but rather keys generated by TextInputFormat, as explained here.
The class you supply for the input format should return key/value
pairs of Text class. If you do not specify an input format class, the
TextInputFormat is used as the default. Since the TextInputFormat
returns keys of LongWritable class, which are actually not part of the
input data, the keys will be discarded; only the values will be piped
to the streaming mapper.
And since you do not have any mapper defined in your job, those key/value pairs are written straight out to the file system. So as the above excerpt hints, you need some sort of a mapper that would discard the keys. A quick-and-dirty is to use something already available to serve as a pass-through, like a shell cat command:
hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-mapper /bin/cat \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path

Hadoop seq directory with index, data and bloom files -- how to read?

New to Hadoop...I have a series of HDFS directories with the naming convention filename.seq. Each directory contains an index, data and bloom file. These have binary content and appear to be SequenceFiles (SEQ starts the header). I want to know the structure/schema. Everything I read refers to reading an individual sequence file so I'm not sure how to read these or how they were produced. Thanks.
Update: I've tried recommended tools for streaming & outputting text on the files, none worked:
hadoop fs -text /path/to/hdfs-filename.seq/data | head
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.1.2.jar \
-input /path/to/hdfs-filename.seq/data \
-output /tmp/outputfile \
-mapper "/bin/cat" \
-reducer "/bin/wc -l" \
-inputformat SequenceFileAsTextInputFormat
Error was:
ERROR streaming.StreamJob: Job not successful. Error: NA
The SEQ header confirms that hadoop sequence file. (One thing that I have never seem is the bloom file that you mentioned.)
The structure / schema of a typical Sequence file is:
Header (version, key class, value class, compression, compression code, metadata)
Record
Record length
Key length
Key Value
A sync-marker every few 100 bytes or so.
For more details:
see the description here.
Sequence file reader and How to read hadoop sequential file?

Amazon Elastic MapReduce: Output directory

I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error:
Error launching job , Output path already exists.
Here is the command to run the job that I am using:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregate
Here is where the example comes from here
I'm following Amazon'd directions for the output directory. The bucket name is s3n://mp.maptester321mark/. I've looked through all their suggestions for problems on this url
Here is my credentials.json info:
{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}
hadoop jobs won't clobber directories that already exist. You just need to run:
hadoop fs -rmr <output_dir>
before your job ot just use the AWS console to remove the directory.
Use:
--output s3n://mp.maptester321mark/output
instead of:
--output s3n://mp.maptester321mark/
I suppose EMR makes the output bucket before running and that means you'll already have your output directory / if you specify --output s3n://mp.maptester321mark/ and that might be the reason why you get this error.
---> If the folder (bucket) already exists then remove it.
---> If you delete it and you still get the above error make sure your output is like this
s3n://some_bucket_name/your_output_bucket if you have it like this s3n://your_output_bucket/
its an issue with EMR!! as i think it first creates bucket on the path (some_bucket_name) and then tries to create the (your_output_bucket).
Thanks
Hari

Resources