Parsing/expanding escaped array in Stream Job? - user-defined-functions

I'm using an Azure Stream Job to parse incoming JSON data from an IoT Hub.
I'm even using ...
CROSS APPLY GetArrayElements(event.NestedRows) as nestedrows
... to expand and denormalize additional events within each event - works great, no issues.
However, I have a new JSON property that is of type string and it is actually an embedded JSON array. For example:
{
"escapedArray": "[ 1, 2, 3 ]"
}
I'd like to use CROSS APPLY on this array as well, however I don't see any way to parse the string and convert it to a JSON array.
I considered a User Defined Function (UDF), but I read that it can only return scalers, and not arrays.
Is there a trick I'm missing inside the Stream Job to parse this string, or do I have to expand it in the event stream, prior to the Stream Job?
(FYI, I have no way to change this stream in the device event source.)
-John

According to the offical tutorial, you can create an UDF in Stream Analytics to convert a string to an array.
Defined an UDF udf.toJson as below.
function main(arrStr) {
return JSON.parse(arrStr);
}
Then use the UDF in a Query for the property escapedArray to return an array.
SELECT
UDF.toJson(escapedArray) as uarr
INTO
[YourOutputAlias]
FROM
[YourInputAlias]
My test result is as the figure below.
Hope it helps.

Related

Extract a key or part of a string using Freemarker

I'm creating a JSON message from my ITSM solution, which uses Freemarker as a template tool.
The system returns a list of materials in the form of a string like this:
{ "[ZA00344] Toner Teste", "MATNR" : "[ZA00888] Caneta" }
Is there any built-in to extract the key between [] without using keep_after, keep_befor, or such functions? Cause these can eventually cause problems.
Thanks.

how to pull data using xpath from script?

Just started learning scraping and for my test project i'm trying to retreive the quantity of a certain project in scrapy shell by using
response.xpath('//script[contains("quantity")]/text()').extract()
This doesn't work.
help me understand what should be the right covention to retreive data from such attributes like quantity, category_path & etc
<script>
window.dataLayer = window.dataLayer || [];
dataLayer.push({"event":"datalayer-initialized","region":"India","account_type":"ecom","customer":{"id":""},"page_type":"Product","product":{"ffr":"csddfas","name":"tote bag by singh","materials":"100% polyester","specs":"Dimensions: 18.5\" x 6.75\"; 24L","color":null,"size":null,"upc":null,"new":false,"brand":null,"season":"HOLIDAY 2017","on_sale":false,"quantity":"158","original_price":100,"price":100,"category_path":
["Mens","Accessories","Backpacks \/ Bags"],"created":"2016-09-07","modified":"2018-02-12",
"colors":["BLACK"],"sizes":["S","M","L","XS","XL","XXL"]}});
</script>
Scrapy selectors have built-in support for regular expressions and it can help you on this case:
response.xpath('//script[contains(text(),"quantity")]/text()').re(r'"quantity":"(\d+)"')
(you need to update your xpath to collect script content since your script not good enough)
Another Way: You also can use regular expressions to collect the json content on script, parse them to json obj and work with it as easier!
You are using css method and giving it an Xpath
Try
response.xpath('//script[contains(text(),'quantity')]').extract()
or
response.css('script::contains(quantity)').extract()
And you will need a Regex to extract that JSON string
re.findall(r'(?<=dataLayer\.push\().*(?=\)\;)', your_script_tag_data, re.DOTALL)
javascript = response.xpath('//script[contains("quantity")]/text()').extract_first()
json_string = re.search( r'dataLayer\.push\((.+?)\);', javascript, re.DOTALL ).group(1)
data = json.loads(json_string)
print( "Quantity: {0}".format(data["product"]["quantity"]) )
In my experience, have no way to get quantity, category_path & etc by only Xpath, because they're in Json format. Xpath can get information in XML data.
I assume that you've already have xml data, use
python
data = yourXML.xpath('//script//text()')
Now data is a string that contain all information. Then, you need to get string in dataLayer.push and convert it to Json format. With Json, it' easy to get your information.

How do I make the mapper process the entire file from HDFS

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.
You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.
If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035
As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)

I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.
I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.
AvroParquetWriter has only one none-deprecated Constructor, with this signature:
AvroParquetWriter(
Path file,
WriteSupport<T> writeSupport,
CompressionCodecName compressionCodecName,
int blockSize,
int pageSize,
boolean enableDictionary,
boolean enableValidation,
WriterVersion writerVersion,
Configuration conf
)
Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.
So I have a few questions:
1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?
2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?
3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...
To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:
DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);
The Parquet equivalent currently looks like this:
AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);
but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.
Thanks,
Thomas
[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter
Try AvroParquetWriter.builder :
MyData obj = ... // should be avro Object
ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
.withSchema(obj.getSchema())
.build();
pw.write(obj);
pw.close();
Thanks.

Extracting value from complex hash in Ruby

I am using an API (zillow) which returns a complex hash. A sample result is
{"xmlns:xsi"=>"http://www.w3.org/2001/XMLSchema-instance",
"xsi:schemaLocation"=>"http://www.zillow.com/static/xsd/SearchResults.xsd http://www.zillowstatic.com/vstatic/5985ee4/static/xsd/SearchResults.xsd",
"xmlns:SearchResults"=>"http://www.zillow.com/static/xsd/SearchResults.xsd", "request"=>[{"address"=>["305 Vinton St"], "citystatezip"=>["Melrose, MA 02176"]}],
"message"=>[{"text"=>["Request successfully processed"], "code"=>["0"]}],
"response"=>[{"results"=>[{"result"=>[{"zpid"=>["56291382"], "links"=>[{"homedetails"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/"],
"graphsanddata"=>["http://www.zillow.com/homedetails/305-Vinton-St-Melrose-MA-02176/56291382_zpid/#charts-and-data"], "mapthishome"=>["http://www.zillow.com/homes/56291382_zpid/"],
"comparables"=>["http://www.zillow.com/homes/comps/56291382_zpid/"]}], "address"=>[{"street"=>["305 Vinton St"], "zipcode"=>["02176"], "city"=>["Melrose"], "state"=>["MA"], "latitude"=>["42.466805"],
"longitude"=>["-71.072515"]}], "zestimate"=>[{"amount"=>[{"currency"=>"USD", "content"=>"562170"}], "last-updated"=>["06/01/2014"], "oneWeekChange"=>[{"deprecated"=>"true"}], "valueChange"=>[{"duration"=>"30", "currency"=>"USD", "content"=>"42749"}], "valuationRange"=>[{"low"=>[{"currency"=>"USD",
"content"=>"534062"}], "high"=>[{"currency"=>"USD", "content"=>"590278"}]}], "percentile"=>["0"]}], "localRealEstate"=>[{"region"=>[{"id"=>"23017", "type"=>"city",
"name"=>"Melrose", "links"=>[{"overview"=>["http://www.zillow.com/local-info/MA-Melrose/r_23017/"], "forSaleByOwner"=>["http://www.zillow.com/melrose-ma/fsbo/"],
"forSale"=>["http://www.zillow.com/melrose-ma/"]}]}]}]}]}]}]}
I can extract a specific value using the following:
result = result.to_hash
p result["response"][0]["results"][0]["result"][0]["zestimate"][0]["amount"][0]["content"]
It seems odd to have to specify the index of each element in this fashion. Is there a simpler way to obtain a named value?
It looks like this should be parsed into XML. According to the Zillow API Docs, it returns XML by default. Apparently, "to_hash" was able to turn this into a hash (albeit, a very ugly one), but you are really trying to swim upstream by using it this way. I would recommend using it as intended (xml) at the start, and then maybe parsing it into an easier to use format (like a JSON/Hash structure) later.
Nokogiri is GREAT at parsing XML! You can use the xpath syntax for grabbing elements from the dom, or even css selectors.
For example, to get an array of the "content" in every result:
response = #get xml response from zillow
results = Nokogiri::XML(response).remove_namespaces!
#using css
content_array = results.css("result content")
#same thing using xpath:
content_array = results.xpath("//result//content")
If you just want the content from the first result, you can do this as a shortcut:
content = results.at_css("result content").content
Since it is indeed XML dumped into a JSON, you could use JSONPath to query the JSON

Resources