Spark read with filter - filter

I am working with the Spark Java API. I am trying to read a file from a directory and filter some lines out. My code looks something like this:
final JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = jsc .textFile("/path/to/some/file");
//First Read....
JavaRDD<Msg> parsedMessages = textFile.map(....);
//Then Filter
JavaRDD<Msg> queryResults = parsedMessages.filter(....)
Is there a way to combine the read and filter operation into the same operation? Something like read with filter? I have a very specific requirement where I have to query a very large data set, but I get a relatively small result set back. I then have to do a series of transformations and calculations on that filtered data. I don't want to read the whole data set into memory and then filter it out. I don't have that much memory. What I would like to do instead is to filter it at read time so only the lines matching some Regex would be read in. Is this possible to do with Spark?

Spark doesn't perform the code exactly how you write it - it goes through an optimizer. The way this code is written (read, map and filter -with no shuffling action in between) spark will actually perform the read, the map transformation and the filter for each line as it is read - i.e. it won't need all the data in memory

At least with SparkContext.textFile there is no such an option but it shouldn't be a problem. There is no requirement that all data has to reside in memory at any point other than collecting on a driver. Data is read in chunks and you can decrease size of an individual split using minPartitions parameter.
My advice is to use a normal filter operation as soon as you can and persist resulting RDD to avoid recomputation.

Related

Apache Nifi MergeContent output data inconsistent?

Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)

Analyse phase of Total Order partitioning

Map Reduce Design Patterns Book
You need to run it only once if the distribution of your data does not change quickly over time, because the value ranges it produces will continue to perform well.
I could not get what is meant by the statement, is this like a general observation or can this actually be implemented when using a TotalOrderPartitioner ?
Can we somehow ask the TotalOrderPartitioner to not create a partitioner file and only use one which has already been created ?
Basically can i skip the analyse phase when using a TotalOrderPartitioner ?
It can easily be implemented when using a TotalOrderPartitioner:
TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionFile); // use existing file!!!
// InputSampler.writePartitionFile(job, sampler); // Just comment out this line!!!
Pay attention, from the javadoc:
public static void setPartitionFile(Configuration conf,
Path p)
// Set the path to the SequenceFile storing the sorted partition keyset.
It must be the case that for R reduces, there are R-1 keys in the SequenceFile.
If you re-run sorting - if you data changed slightly and the samples should still well represent it - you can use the existing partition file with the samples, as its creation on the client by InputSampler is expensive. But you have to use the same number of Reducers, as you used in the job for which InputSampler created the partition file.

Hadoop Map-Reduce OutputFormat for assigning result to in-memory variable (not files)?

(from a Hadoop newbie)
I want to avoid files where possible in a toy Hadoop proof-of-concept example. I was able to read data from non-file-based input (thanks to http://codedemigod.com/blog/?p=120) - which generates random numbers.
I want to store the result in memory so that I can do some further (non-Map-Reduce) business logic processing on it. Essetially:
conf.setOutputFormat(InMemoryOutputFormat)
JobClient.runJob(conf);
Map result = conf.getJob().getResult(); // ?
The closest thing that seems to do what I want is store the result in a binary file output format and read it back in with the equivalent input format. That seems like unnecessary code and unnecessary computation (am I misunderstanding the premises which Map Reduce depends on?).
The problem with this idea is that Hadoop has no notion of "distributed memory". If you want the result "in memory" the next question has to be "which machine's memory?" If you really want to access it like that, you're going to have to write your own custom output format, and then also either use some existing framework for sharing memory across machines, or again, write your own.
My suggestion would be to simply write to HDFS as normal, and then for the non-MapReduce business logic just start by reading the data from HDFS via the FileSystem API, i.e.:
FileSystem fs = new JobClient(conf).getFs();
Path outputPath = new Path("/foo/bar");
FSDataInputStream in = fs.open(outputPath);
// read data and store in memory
fs.delete(outputPath, true);
Sure, it does some unnecessary disk reads and writes, but if your data is small enough to fit in-memory, why are you worried about it anyway? I'd be surprised if that was a serious bottleneck.

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Is it possible to use Pig streaming (StreamToPig) in a way that handles multiple lines as a single input tuple?

I'm streaming data in a pig script through an executable that returns an xml fragment for each line of input I stream to it. That xml fragment happens to span multiple lines and I have no control whatsoever over the output of the executable I stream to
In relation to Use Hadoop Pig to load data from text file w/ each record on multiple lines?, the answer was suggesting writing a custom record reader. The problem is, this works fine if you want to implement a LoadFunc that reads from a file, but to be able to use streaming, it has to implement StreamToPig. StreamToPig allows you to only read one line at a time as far as I understood
Does anyone know how to handle such a situation?
If you are absolutely sure, then one option is to manage it internally to the streaming solution. That is to say, you build up the tuple yourself, and when you hit whatever your desired size is, you do the processing and return a value. In general, evalfuncs in pig have this issue.

Resources