In Hadoop Cascading Flow, i have a number of tuples which is processed and finally i have sunk into a destination.
Now my requirement is: To sink that tuples in destination file with certain defined constant String values at beginning and at the end.
For example: I have following input tuples
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Now i need to have like this output:
Certain data before those data
10|11|12|13|14|15|16|17|18|19|20
20|21|22|23|24|25|26|27|28|29|30
1|2|3|4|5|6|7|8|9|10
Certain data after those data
Little bit i have searched of repository class DelimitedParser and its methods like joinLine, joinFirstLine but due to poor documentation i am unable to get exact point of it.
It may depend on what "Certain data before those data" means ?
If you are using TextDelimited, then you can dump the header values in the sink. By default header values are not written as per the documentation hence you will need to enable it. Another thing to remember is that the header values represents the output fields.
-Amit
Related
I'm using data from synthea and it looks like most (all?) of the given and family names I'm getting back end with a three digit number (e.g. Gregg522). Is this part of the design of synthea or am I parsing the data incorrectly. A snippet of the json I'm getting back is shown below. If this is part of the design, what is the motivation of ending the name with a number (I would think this would make the data less realistic).
Yes, they generally do. It is sometimes nice to be able to see that the patients are fake/synthetic ones. However, this is a setting you can change: In the synthea.properties file, look for the setting "append_numbers_to_person_names" and set it to false.
I am trying to understand the combination of List and Fetch processors.
I have a directory with three JSON files and I get the ListAzureDataLakeStorage to list them. But when I connect a FetchAzureDataLakeStorage with which I intend to take only one of the files, the Fetch takes the same file three times. In summary, it takes the file whose azure.filename matches with the value that I put in the File Name property, but as many times as there are files in the listed directory.
I really want to use a single List and connect three Fetches to it, each one to take a different file, and thus use them for different streams.
In each Fetch I put in the "File Name" property the name of the file that I want to take. For example:
File Name: fileName1.json
I have also tried putting in "File Name" with Expression Language the following:
FileName: $ {azure.filename: equals ('fileName1.json')}. But this option causes a 404 empty body error.
But there is no way. Am I misunderstanding something about using the List and Fetch combination?
If you are statically entering file names and you want to respond to each one differently, then the ListX processors aren't very beneficial to your flow.
The easier option would be to use a GenerateFlowFile processor with the appropriate schedule to trigger a corresponding FetchX processor.
If you're only doing this for 3 files, it's not too much manual overhead. You could also achieve something similar using RouteOnContent/Attribute.
I got a problem that I have some measurement data ( like Echo measurements), that can potentially have multiple values associated with it. In other words, you have a single measurement, but multiple values associated.
Is there a standard way to represent multiple values for a single measurement as a Observation?If so, what is the best way?
I notice that under observation, you can have multiple components, should I put my LOINC code for my measurement just at observation level and put each value at component level? Or I have to use extensions?
Thanks!
I am not sure exactly what your data looks like but here are a couple of patterns:
There is sampledData Datatype that can be used for datastreams like an EKG
example
If you have discrete values that are all interpreted together with an observation ( they can't stand alone as independent observations) the using components with an Observation.code= code, Observation.value[x] is empty , Observation.component.code= code, Observation.component.value[x]= result value. here is an example of this pattern.
In some cases you will have an Observation.value[x] as well.
Note Observation.component.code is required for each component.
For grouping indpendent observations together using component is not appropriate. This grouping is done using DiagnosticReport.result or Observation.related. The DiagnosticReport resource which typically used for reporting diagnostics in responce to an order.
I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.
Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)