Tracking file usage in Minio - minio

Is there any way to track usage of files in a Minio storage?
If I could trigger a script (that will keep and periodically write file usage info in a database), then I could find files that have not been used for a very long time (e.g. 1-2 year).
I want to perform periodical cleanup on buckets that are used for temporary files.

This can be accomplished using bucket notification. Read more about it here https://docs.min.io/docs/minio-bucket-notification-guide.html#PostgreSQL

Related

Incremental ETL using Glue

Need help in processing incremental files.
Scenario: Source team is creating file in every 1hr in s3 (hrly partitioned). I would like to process in every 4hr. The Glue etl will read the s3 files (partitioned hrly) and process to store in different s3 folders.
Note : Glue ETL is called from airflow.
Question How can I make sure that I only process the incremental files ( let’s say 4 files in each execution)?
Sounds like a use case for Bookmarks
For example, your ETL job might read new partitions in an Amazon S3
file. AWS Glue tracks which partitions the job has processed
successfully to prevent duplicate processing and duplicate data in the
job's target data store.

Spark EMR S3 Processing Large No of Files

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB.
Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API
val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
for (line <- file_list.getLines()) {
if(line.length > 0 && line.contains("part"))
paths.add(line.trim)
}
val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")
The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config (
The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it.
But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ?
I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec, but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's
Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading
You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(i.e Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed
Any Pointers on how can I avoid this issue, i.e speed up the load and process ?
There is a solution that can help you based in AWS Glue.
You have a lot of files partitioned in your S3. But you have partitions based in timestamp. So using glue you can use your objects in S3 like "hive tables" in your EMR.
First you need to create a EMR with version 5.8+ and you will be able to see this:
You can set up this checking both options. This will allow to access the AWS Glue Data Catalog.
After this you need to add the your root folder to the AWS Glue Catalog. The fast way to do that is using the Glue Crawler. This tool will crawl your data and will create the catalog as you need.
I will suggest you to take a look here.
After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena.
In Athena you can check if your data was properly identified by the crawler.
This solution will make your spark works close to a real HDFS. Due to the metadata will be properly in the Data Catalog. And the time you app is taking to find the "indexing" will allow to run the jobs faster.
Working with this here I was able to improve the queries, and working with partitions was much better with glue. So, have a try this probably can help in the performance.

Is there a way to set a TTL for certain directories in HDFS?

I have the following requirements. I am adding date-wise data to a specific directory in HDFS, and I need to keep a backup of the last 3 sets, and remove the rest. Is there a way to set a TTL for the directory so that the data perishes automatically after a certain number of days?
If not, is there a way to achieve similar results?
This feature is not yet available on HDFS.
There was a JIRA ticket created to support this feature: https://issues.apache.org/jira/browse/HDFS-6382
But, the fix is not yet available.
You need to handle it using a cron job. You can create a job (this could be a simple Shell, Perl or Python script), which periodically deletes the data older than a certain pre-configured period.
This job could:
Run periodically (For e.g. once an hour or once a day)
Take the list of folders or files which need to be checked, along with their TTL as input
Delete any file or folder, which is older than the specified TTL.
This can be achieved easily, using scripting.

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Map Reduce - How to plan the data files

I would like to use AWS EMR to query large log files that I will write to S3. I can design the files any way I like. The data is created in a rate of 10K entries/minute.
The logs consist of dozens of data points and I'd like to collect data for very long period of time (years) to compare trends etc.
What are the best practices for creating such files that will be stored on S3 and queried by AWS EMR cluster?
Whats the optimal file sizes ?Should I create separate files for example on hourly basis?
What is the best way to name the files?
Should I place them in daily/hourly buckets or all in the same bucket?
Whats the best way to handle things like adding some data after a while or change in data structure that I use?
Should I compress things for example by leaving out domain names out of urls or keep as much data as possible?
Is there any concept like partitioning (the data is based on 100s of websites so I can use site ids for example). I must be able to query all the data together, or by partitions.
Thanks!
in my opinion you should use a hourly basis bucket to store data in s3 and then use a pipeline to schedule your mr job to clean the data.
once u have clean the data you can keep it to a location in s3 and then you can run a data pipeline on hourly basis on the lag of 1hour with respect to your MR pipeline to put this process data into redshift.
Hence at 3am on a day you will have 3 hour of processed data in s3 and 2 hour processed into redshift dB.
To do this you can have 1 machine dedicated for running pipelines and on that machine you can define you shell script/perl/python or so script to load data to your dB.
You can use AWS bucketing formatter for year,month,date,hour and so on. for e.g.
{format(minusHours(#scheduledStartTime,2),'YYYY')}/mm=#{format(minusHours(#scheduledStartTime,2),'MM')}/dd=#{format(minusHours(#scheduledStartTime,2),'dd')}/hh=#{format(minusHours(#scheduledStartTime,2),'HH')}/*

Resources