We see an issue that on stream analytics when using a blob reference input. Upon restarting the stream, it prints double values for things joined to it. I assume this is an issue with having more than 1 blob active during the time it restarts. Currently we pull the files from a folder path in ADLS structured as Output/{date}/{time}/Output.json, which ends up being Output/2021/04/16/01/25/Output.json. These files have a key that the data matches on in the stream with:
IoTData
LEFT JOIN kauiotblobref kio
ON kio.ParentID = IoTData.ConnectionString
which I don't see any issue with, but those files are actually getting created every minute on the minute by an azure function. So it may be possible during the start of stream analytics, it grabs the last and the one that gets created following. (That would be my guess, but I'm not sure how we would fix that).
Here's a visual in powerBI of the issue:
Peak
Trough
This is easily explained when looking at the cosmosDB for that device it's capturing from, there are two entries with the same value, assetID, timestamp, different recordID(just means cosmosDB counted it as two separate events). This shouldn't be possible because we can't send duplicates with the same timestamp from a device.
This seems to be a core issue with blob storage on stream analytics, since it traditionally takes more than 1 minute to start. The best way I've found to resolve is to stop the corresponding functions before starting stream back up. Working to automate through CI/CD pipelines, which is good practice anyways for editing the stream.
Following the question on how to execute a file dump in icCube, I would like to know it it is possible to:
create a file dump
then use it as a data source
I tried to build a sequence of data views, but I can not get it to work, and I wonder if it is even possible at al?
(The reason I would like to do it is that my main data source is an odata feed and I need a lot of data manipulation before I can load it. I anticipate that it will be much easier to do these on CSV files.)
This is not possible as the rationale behind the ETL support is to transform data-tables as returned by the data-sources.
Fairly new to using nifi. Need help with the design.
I am trying to create a simple flow with dummy csv files(for now) in HDFS dir and prepend some text data to each record in each flowfile.
Incoming files:
dummy1.csv
dummy2.csv
dummy3.csv
contents:
"Eldon Base for stackable storage shelf, platinum",Muhammed MacIntyre,3,-213.25,38.94,35,Nunavut,Storage & Organization,0.8
"1.7 Cubic Foot Compact ""Cube"" Office Refrigerators",BarryFrench,293,457.81,208.16,68.02,Nunavut,Appliances,0.58
"Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl",Barry French,293,46.71,8.69,2.99,Nunavut,Binders and Binder Accessories,0.39
...
Desired output:
d17a3259-0718-4c7b-bee8-924266aebcc7,Mon Jun 04 16:36:56 EDT 2018,Fellowes Recycled Storage Drawers,Allen Rosenblatt,11137,395.12,111.03,8.64,Northwest Territories,Storage & Organization,0.78
25f17667-9216-4f1d-b69c-23403cd13464,Mon Jun 04 16:36:56 EDT 2018,Satellite Sectional Post Binders,Barry Weirich,11202,79.59,43.41,2.99,Northwest Territories,Binders and Binder Accessories,0.39
ce0b569f-5d93-4a54-b55e-09c18705f973,Mon Jun 04 16:36:56 EDT 2018,Deflect-o DuraMat Antistatic Studded Beveled Mat for Medium Pile Carpeting,Doug Bickford,11456,399.37,105.34,24.49,Northwest Territories,Office Furnishings,0.61
the flow
splitText-
ReplaceText-
MergeContent-
(this may be a poor way to achieve what I am trying to get, but I saw somewhere that uuid is best bet when it comes to generating unique session id. So thought of extracting each line from incoming data to flowfile and generating uuid)
But somehow, as you can see the order of data is messing up. The first 3 rows are not the same in output. However, the test data I am using (50000 entries) seems to have the data in some other line. Multiple tests show usually the data order changes after 2001st line.
And yes, I did search similar issues here and tried using defragment method in merge but it didnt work. I would appreciate if someone can explain what is happening here and how can I get the data in the same way with unique session_id,timestamp for each record. Is there some parameter I need to change or modify to get the correct output? I am open to suggestions if there is a better way as well.
First of all thank you for such an elaborate and detailed response. I think you cleared a lot of doubts I had as to how the processor works!
The ordering of the merge is only guaranteed in defragment mode because it will put the flow files in order according to their fragment index. I'm not sure why that wouldn't be working, but if you could create a template of a flow with sample data that showed the problem it would be helpful to debug.
I will try to replicate this method using a clean template again. Could be some parameter problem and the HDFS writer not able to write.
I'm not sure if the intent of your flow is to just re-merge the original CSV that was split, or to merge together several different CSVs. Defragment mode will only re-merge the original CSV, so if ListHDFS picked up 10 CSVs, after splitting and re-merging, you should again have 10 CSVs.
Yes, that is exactly what I need. Split and join data to their corresponding files. I dont specifically (yet) need to join the outputs again.
The approach of splitting a CSV down to 1 line per flow file to manipulate each line is a common approach, however it won't perform very well if you have many large CSV files. A more efficient approach would be to try and manipulate the data in place without splitting. This can generally be done with the record-oriented processors.
I used this approach purely instinctively and did not realize this is a common method. Sometimes the datafile could be very large, that means more than a million records in a single file. Wont that be an issue with the i/o in the cluster? coz that would mean each record=one flowfile=one unique uuid. What is a comfortable number of flowfiles that nifi can handle? (i know it depends on cluster config and will try to get more info about the cluster from hdp admin)
What do you suggest by "try and manipulate the data in place without splitting" ? can you give an example or template or processor to use?
In this case you would need to define a schema for your CSV which included all the columns in your data, plus the session id and timestamp. Then using an UpdateRecord processor you would use record path expressions like /session_id = ${UUID()} and /timestamp = ${now()}. This would stream the content line by line and update each record and write it back out, keeping it all as one flow file.
This looks promising. Can you share a simple template pulling files from hdfs>processing>write hdfs files but without splitting?
I am reluctant to share the template due to restrictions. But let me see if I can create a generic templ and I will share
Thank you for your wisdom! :)
I want to know how to (or can I) parameterize the parm file name in informatica?
little bit of background. I am building a standard map in informatica. Which business users can call directly after selecting the standard filters they want to apply in the map using a GUI.
The parm file name will be given by business users and all the filters that he/she selected will be in parm. The file will be dropped in the parm folder in informatica server.
This is a good case scenario, when only 1 users is using it at 1 point of time.
Also, I want to find out what should I do when multiple users are working on GUI and generating the parm files and invoking the informatica map. How do I get multiple instences of the same map running at the same time?
I hope I am making sense here....
Thanks!!!
You can achieve this by using concurrent execution of the workflow. Read about it and understand how can you implement it.
Once you know how to implement it, use a backend script/code by the gui to assign an instance name to each call through GUI. For each instance name, you can have an individual parameter file. (I believe that there would be a finite set of combination of variable values in your case). You can use below command to call individual instances, (either through you GUI or by any other backend code.
pmcmd %workflow_name% %informatica_folder_name%
-paramfile %paramfilepathandname% -rin %instance_name%
It might sound a bit confusing, but once you understand how concurrent workflows work, you can build on it based on the above input.
It'll be only possible if you call the Informatica from external tool, not the Client tools. One way is described by #Utsav, the other is when you use Informatica WSH to call a Workflow - you can indicate the parameterfile you want to be used with the workflow, as well as desired instance name.
I Think this guide to concurrent workflows May be what you are looking for:
https://kb.informatica.com/howto/6/Pages/17/301264.aspx
I accidently imported more data into 0xdata's h2o flow than I actually intended. How can I delete all my data frames?
I already tried Data -> List All Frames -> Delete, but I get the following error message:
Error evaluating cell
Error calling DELETE/....ink
Object 'nfs:....lnk' not found for argument: key
Is there another way to erase those data frames? Where are those data frames physically stored?
can you please provide more details about your environment - which version of H2O, platform.
I would recommend to re-try with the latest H2O (see http://h2o.ai/download).
Agreed with #Michal that more info is needed.
If you're using R (I would recommend using R or Python) use h2o.removeAll().
If you're using the Flow UI select Data -> List All Frames then select the check box for all frames and then click Delete selected frames.