Using Hadoop to "bucket" data out with a single run - hadoop

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?

You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

Related

How to implement the equivalent of the Aggregator EIP in Nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.
If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Is snakemake the right tool to use for handling output mediated workflows

I'm new to trying out snakemake (last week or so) in order to handle less of the small details for workflows, previously I have coded up my own specific workflow through python.
I generated a small workflow which among the steps would use Illumina PE reads and ran Kraken against them. I'd then parse the output of the Kraken output to detect the most common species (within a set of allowable) if a species value wasn't provided (running with snakemake -s test.snake --config R1_reads= R2_reads= species=''.
I have 2 questions.
What is the recommended approach given the dynamic output/input?
Currently my strategy for this is to create a temp file which
contains the detected species and then cat {input.species} it into
other shell commands. This doesn't seem elegant but looking through
the docs I couldn't quite find an adequate alternative. I noticed
PersistentDicts would let me pass variables between run: commands
but I'm unsure if I can use that to load variables into a shell:
section. I also noticed that wrappers could allow me to handle it
however from the point I need that variable on I'd be wrapping the
remainder of my workflow.
Is snakemake the right tool if I want to use the species afterwards to run a set of scripts specific to the species (with multiple species specific workflows)?
Right now my impression on how to solve this problem is to have multiple workflow files for the species and have a run with switch which calls the associated species workflow dependant on the species.
Appreciate any insight on these questions.
-Kim
You can mark output as dynamic (e.g. expecting one file per species). Then, Snakemake will determine the downstream DAG of jobs after those files have been generated. See http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-files

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Passing variable between python scripts through shell script

I can't think of a way of doing what I am trying to do and hoping for a little advice. I am working with data on a computing cluster, and would like to process individual files on separate computing nodes. The workflow I have right now is something like the following:
**file1.py**
Get files, parameters, other info from user
Then Call: file2.sh
**file2.sh**
Submit file3.py to computing node
**file3.py**
Process input file with parameters given
What I am trying to do is call file2.sh and pass it each input data file one at a time so that there are multiple instances of file3.py running, one per file. Is there a good way to do this?
I suppose that the root of the problem is that if i were to iterate through a list of input files in file1.py I don't know how to then pass that information to file2.sh and then on to file3.py.
From this description, I'd say the the straightforward way is to call file2.sh directly from Python.
status, result = commands.getstatusoutput("file2.sh" + arg_string)
Is that enough of a start to get you moving? Are the nodes conversant enough for one to launch a command directly on another? If not, you may want to consider looking up "interprocess communication" on Linux. If they're not even on the same Internet node, you'll likely need REST commands (post and get operations), from whence things grow more overhead.

How to create hadoop input splits that span two files?

My data input files are all of the same length, but, the records therein may span two files (starting at the end of the first file and finishing at the beginning of the second).
Is it possible to create an inputsplit that would allow me to span those two files?
Is it better to create an entirely new set of files so that records do not span more than one file?
I would definitely ensure your records do not span more than one file: you could, theoretically, write your own input format that takes care of this, but the overhead is likely to be considerable as you are - in having to ensure that you know which files belong together - taking over part of the responsiblity which the jobtracker and name node fulfill for you.
You should be free to tell the jobtracker/name node where the inputs are, and for the processing to be truly parallel, you don't want to then have to take back some of that control: IMHO it would partially defeat the object of using haoop in the first place.

Resources