I'm writing a program which save the time series data from kafka into hadoop. and I designed the directory struct like this:
event_data
|-2016
|-01
|-data01
|-data02
|-data03
|-2017
|-01
|-data01
Because the is a daemon task, I write a LRU-based manager to manage the opened file and close inactive file in time to avoid resource leaking, but the income data stream is not sorted by time, it's very common to open the existed file again to append new data.
I tried use FileSystem#append() method to open a OutputStream when file existed, but it run error on my hdfs cluster(Sorry, I can't offer the specific error here because it's several month ago and now I tried another solution).
Then I use another ways to achieve my goals:
Adding a sequence suffix to the file name when the same name file exists. now I have a lot of file in my hdfs. It looks very dirty.
My question is: what's the best practice for the circumstances?
Sorry that this is not a direct answer to your programming problem, but if you're open for all options rather than implement it by yourself, I'd like to share you our experiences with fluentd and it's HDFS (WebHDFS) Output Plugin.
Fluentd is a open source, pluggable data collector and by which you can build your data pipeline easily, it'll read data from inputs, process it and then write it to the specified outputs, in your scenario, the input is kafka and the output is HDFS. What you need to do is:
Config fluentd input following fluentd kafka plugin, you'll config the source part with your kafka/topic info
Enable webhdfs and append operation for your HDFS cluster, you can find how to do it following HDFS (WebHDFS) Output Plugin
Config your match part to write your data to HDFS, there's example on the plugin docs page. For partition your data by month and day, you can configure path parameter with time slice placeholders, something like:
path "/event_data/%Y/%m/data%d"
With this option to collect your data, you can then write your mapreduce job to do ETL or whatever you like.
I don't know if this is suitable for your problem, just provide one more option here.
Related
I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/
I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".
I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")
I am trying to run a MapReduce job on my cluster that only runs on a specific file extension. We have a bunch of heterogeneous data that sits on the cluster and for this particular job I only want to execute on .jpg. Is there a way this can be done without restricting it in the mapper. It seems like this should be something easy to do when you execute the job. I'm thinking something like hadoop fs JobName /users/myuser/data/*.jpg /users/myuser/output.
Your example should work as written, but you'll want to check with the input format that you're calling the setInputPaths(Job, String) method, as this will resolve the glob string "/users/myuser/data/*.jpg" into the individual jpg files in /users/myuser/data.
I want my MapReduce program to read from the standard input stream (System.in)
For example in the run() method, how can I make my program read from System.in instead of a file like this..FileInputFormat.addInputPath(job, new Path("dummy.txt"));
Also what class should I set for the job.setInputFormat(...)
Use Hadoop Streaming to do this:
http://wiki.apache.org/hadoop/HadoopStreaming
Supports stdin, stdout
I have not seen such InputFormat present in hadoop. Probably you will have to write System.in somewhere from time to time and run hadoop job over the saved content eveytime you get new one.
Such situation is common while using hadoop for processing log files which are generated/populated continuously. In such use case its wise to get the log file(s) on daily or weekly basis and run the hadoop job over it once you obtain it.