Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)
Hi I am trying to run a timeseries crossvalidation in a "rolling window" style: ie train with 8 weeks of data, test with next week, slide 1 week along.
What is the most efficient way of achieving this?
I have split my data file into weekly chunks. So what I was hoping was to pass in multiple files to the --data parameter (I was trying repeated --data).
This doesn't work, but it seems like one can use multiple cache files. AFAIK, this would require me to first create the cache file chunks out of my text file chunks. I am not clear how I would call vw to just create cache files?
You can pipe the data on the stdin (concatenate all the files with cat). However, as vw does online learning by default, there is no need to do manually the "rolling window" (and cache files) unless you want to use multiple training passes. Just store the model (with --save_resume -f path/to/the.model) and next week just continue training with the new data.
I am training a model in TensorFlow, using an Estimator tf.contrib.learn.Estimator. When calling estimator.fit, I also pass a Validation Monitor tf.contrib.learn.monitors.ValidationMonitor with an input function. This input function uses the tf.TFRecordReader reader function plugged into a batch generator tf.contrib.learn.io.read_batch_features. It is programmed to swipe one epoch when calling it for validation, by setting num_epochs to 1 as read_batch_features argument.
Is there a way to limit the size of the validation test set without having to delete files in the validation directory?
E.g. if I have 100k records in the directory that I pass to the batch reader, what is the easiest way to limit the amount of files it uses as validation, to say e.g. 1k?
This is all in the assumption that read_batch_features actually does swipe the whole validation directory when doing an evaluation step. I think this assumption is correct, since I am now using a simple hack by putting a * in the input directory (e.g. input_dir=FLAGS.input_dir + "/test_examples-001*" will limit the files used). It would be nice to have a cleaner way, such that I can precisely specify the amount of files used for validation.
I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?
Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand
I have a Simulink model that uses an embedded MATLAB function for a block, and I haven't been able to figure out how to move data between the embedded MATLAB block and a GUI in real-time (i.e. while the model is running). I tried to implement a "to workspace" block in my model but I don't know how to correctly use it.
Does anyone know how to move data from a Simulink block into a GUI in real-time?
Non-real-time solution:
If you want to set parameters in a GUI, simulate a model with those parameters, and then display the simulation output in the GUI, there is a good tutorial on blinkdagger.com. One solution they describe is using the SIMSET function to define which workspace the Simulink model interacts with. You should be able to supersede the base workspace so that data is instead sent to and from the workspace of the GUI functions that are calling the Simulink model.
Real-time solution
As suggested by MikeT, you can use a RuntimeObject. You first have to use the get_param function to get the RuntimeObject from the block:
rto = get_param(obj,'RuntimeObject');
Where obj is either a block pathname or a block-object handle. You can get the pathname of the most recently selected block using the GCB function (in which case you can replace obj with gcb). You can then get the block's output with the following:
blockData = rto.OutputPort(1).Data
One additional caveat from the documentation:
To ensure the Data field contains the
correct block output, turn off the
Signal storage reuse option (see
Signal storage reuse) on the
Optimization pane in the Configuration Parameters dialog box.
You would likely end up with a loop or a timer routine running in your GUI that would continuously get the output data from the RuntimeObject for as long as the simulation is running. The documentation also states:
A run-time object exists only while
the model containing the block is
running or paused. If the model is
stopped, get_param returns an empty
handle. When you stop or pause a
model, all existing handles for
run-time objects become empty.
Your loop or timer routine would thus have to keep checking first that the RuntimeObject exists, and either stop (if it doesn't) or get the data from it (if it does). I'm unsure of exactly how to check for existence of a RuntimeObject, but I believe you would either check if the object is empty or if the BlockHandle property of the object is empty:
isempty(rto) % Check if the RuntimeObject is empty
%OR
isempty(rto.BlockHandle) % Check if the BlockHandle property is empty
From your responses, I'm guessing you want to see the results while the simulation is running, is that correct? The blinkdagger.com tutorial lets you view the results of a simulation after it is done, but not while it is running. Do you basically want to embed something like a scope block into your GUI?
There's a few ways to do this, the best is probably using the EML block's runtime object. If you use this, you should be able to look at the output of the EML block while it is running.