Download files from s3 into Hive based on last modified? - hadoop

I would like to download a set of files who's last modified date fall within a certain time period, say 2015-5-6 to 2015-6-17. The contents of these files will be directly put into a Hive table for further processing.
I know that this is possible, but it is either for only one file, or for an entire bucket. I would like to download all files in a bucket which have a last modified within a time range.
How can multiple files be downloaded into a Hive table based on the above requirement?

Did you try with this
CREATE EXTERNAL TABLE myTable (key STRING, value INT) LOCATION
's3n://mys3bucket/myDir/* ; or
's3n://mys3bucket/myDir/filename*'(if it starts with something common)

This is possible using the AWS SDK for Java, where a custom UDF or UDTF could be made to ping the keys and return their last modified date using:
S3ObjectSummary.getLastModified();
More info: AWS Java SDK Docs - S3ObjectSummary

Related

Transfer CSV files from azure blob storage to azure SQL database using azure data factory

I need to transfer around 20 CSV files inside a folder named ActivityPointer in an azure blob storage container to Azure SQL database in a single data factory pipeline, but ActivityPointer contains 20 CSV files and another folder named snapshots inside it. So when I try to create a pipeline and give * to select all the CSV files inside ActivityPointer it includes the snapshots folder too, which should not be included. Is there any possibilities to complete this task. Also I can't create another folder to transform the snapshots folder into it. What can I do now? Anyone can please help me out.
Assuming you want to copy all CSV files within ACtivityPointer folder,
You can use wildcard expression as below :
you can provide path till Active folder and than *.csv
Copy data is also considering the inner folder while using wildcards (even if we use .csv in wildcard file path). So, we have to validate whether it is a file or folder. Please look at the following demonstration.
First use Get Metadata on the required folder with field list as Child items. The debug output will be:
Now use this to iterate through child items using For each activity.
#activity('Get Metadata1').output.childItems
Inside for each, use if condition activity to check whether the current item is a file or not. Use the following condition.
#equals(item().type,'File')
When this is true, you can use copy data to complete copying the file to target table (Ignore the false case). I have create file_name parameter in my source dataset passing its value as #item().name().
This will help you to achieve your requirement. The following is the debug output. I have 4 files and 1 folder. The folder will be ignored, and the rest will be copied into the target table.

Load new files only from FTP to BLOB Azure data factory

I am trying to copy files from an FTP to Blob , the probleme is that my pipeline copies all files including the old ones. I would like to do an incremental load by only copying new files. how do U configure this. BTW in my FTP dataset the parameters ModifiedStartDate and ModifiedEndDate are not showing. I would also like to configure theses dates dynamically
Thank you!
There's some work to be done in Azure Data Factory to get this to work. What you're trying to do, if I understand correctly, is to Incrementally Load New Files in Azure Data Factory. You can do so by looking up the latest modified date in the destination folder.
In short (see the above linked article for more information):
Use Get Metadata activity to make a list of all files in the Destination folder
Use For Each activity to iterate this list and compare the modified date with the value stored in a variable
If the value is greater than that of the variable, update the variable with that new value
Use the variable in the Copy Activity’s Filter by Last Modified field to filter out all files that have already been copied

Deleting files and directories in a remote hdfs based on their creation date in Java

I want to delete files in our hdfs based on their age (no of days).
The directory structure there has a fixed path followed by id/year/month/date/hour/min as their sub directories.
I am still a beginner here but the obvious choice looks like iterating through every folder and then delete.
But here we are talking millions of documents on hourly basis.
I would like to know the best approach towards this.
based on their creation date in Java
Unclear if the "creation date" means time the file is written to HDFS, or that in the filepath. I'll assume it's the filepath.
here we are talking millions of documents on hourly basis
Doesn't really matter. You can delete entire folder paths, like a regular filesystem. Just use bash and the hdfs cli. If you need something special, all the CLI filesystem commands are mapped to Java classes.
Delete hdfs folder from java
If using bash, calculate the date using date command, subtracting the number of days, assign to a variable, let's say d. Make sure it's formatted to match the directory structure.
Ideally, don't just calculate the day. You want years and months to be computed in the date subtraction calculation.
Then simply remove everything in the path
hadoop fs -rm -R "${FIXED_PATH}/id/$(d}"
You can delete many dates in a loop - Bash: Looping through dates
The only reason you would need to iterate anything else is if you have dynamic IDs you're trying to remove
Another way would be create a (partitioned) ACID-enabled Hive table over that data.
Simply execute a delete query similar to below (correctly accounting for the date formats)
DELETE FROM t
WHERE CONCAT(year, '-', month, '-', day) < date_sub(current_date(), ${d})
Schedule it in a cron (or Oozie) task to have it repeatedly clean out old data.

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources