How to use hadoop to process just a part of data - hadoop

I'm a newbie in hadoop and I met a trouble: some data will be stored in hadoop everyday, and I do some processings at the same time. These processings may use all of the data, or may be just a part of them(like just deal with today's data), what is the best way to implement this?
Should I generate a single file for one day, or just one file from the start to the end? I think hadoop doesn't have a 'filter' mechanism like 'query' in mongodb, so if I just want to process today's data, is it a waste to go through all the data?
Any advice will help, Thx!

Related

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

creating Spark Dstreams from log archives

I am new to Spark; looks awesome!
I have gobs of hourly logfiles from different sources, and wanted to create DStreams from them with a sliding window of ~5 minutes to explore correlations.
I'm just wondering what the best approach to accomplish this might be. Should I chop them up into 5-minute chunks in different directories? How would that naming structure be associated with a particular timeslice across different HDFS directories? Do I implement a filter() method that knows the log record's embedded timestamp?
suggestions, RTFMs welcomed.
thanks!
Chris
You can use apache Kafka as Dstream source and then you can try reduceByKeyAndWindow Dstream function. It will create a window according your required time
Trying to understand spark streaming windowing

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Hadoop MultipleOutputFormats to HFileOutputFormat and TextOutputFormat

I am running an ETL job with Hadoop where I need to output the valid, transformed data to HBase, and an external index for that data into MySQL. My initial thought is that I could use MultipleOutputFormats to export the the transformed data with HFileOutputFormat (key is Text and value is ProtobufWritable), and an index to TextOutputFormat (key is Text and value is Text).
The number of inputs records for an average-sized job (I'll need the ability to run many at once) is about 700 million.
I'm wondering if A) this seems to be a reasonable approach in terms of efficiency and complexity, and B) how to accomplish this with the CDH3 distribution's API, if possible.
If you're using the old MapReduce API then you can use MultipleOutputs and write to multiple output formats.
However, if you're using the new MapReduce API, I'm not sure that there is a way to do what you're trying to do. You might have to pay the price of doing another MapReduce job on the same inputs. But I'll have to do more research on it before saying for sure. There might be a way to hack the old + new api's together to allow you to use MultipleOutputs with the new API.
EDIT: Have a look at this post. You can probably implement your own OutputFormat and wrap the appropriate RecordWriters in the OutputFormat and use that to write to multiple output formats.

Resources