Azure Data Factory from Cosmos documents to Blob files - azure-blob-storage

Want to use Azure Data Factory to pull Cosmos documents and copy each document to file (blob) in Storage where the file name == document id with file suffix == json. E.g. document content { id: "0001", name: "gary" }would be a blob named 0001.json with the same content { id: "0001", name: "gary" }.
Hava a Cosmos dataset targeting a Cosmos collection of JSON documents. Plus a Blob Storage dataset with an absolute file path (container + file name). Using a Copy Data activity in a pipeline the setup is vanilla with source == Cosmos dataset (no schema) and sink == Blob dataset (no schema) creating a single blob line-delimited per Cosmos document.
Guessing that a Lookup to a ForEach with a Copy Data activity would be right. Lookup against the Cosmos dataset should yield items (collection of JSON documents). ForEach settings ought to be Sequential and something like items == #activity('Find').output.value.id where Find is the name of the Lookup step. The items expression is probably way off (help here). Choosing an activity for the ForEach is the big question. Can Copy Data be bent to act on each item of the ForEach (only datasets available for the source + sink)?

Can Copy Data be bent to act on each item of the ForEach (only
datasets available for the source + sink)?
Yes, Copy activity can be use in ForEach activity, you just need to keep the Sink Dataset filename as dynamic. You can create a parameter and assign its's value equal to #activity('Find').output.value.id so that it will get new filename every time.
There is a same case scenario explained in third-party tutorial by sqlkover.com. You can refer this for better understanding.

Related

Search in Liferay

I have created one structure which is followed by web content. Structure Contains 2 fields. One is Area name and the second is Zip Code. I stored data in web content followed by this structure.
I want to search for data based on zip code or area name entered by the user. I want to provide a dropdown to the user to select criteria to search like by Zipcode / by Area name.
The problem is web content data is stored in XML format. So whenever a user searches for a keyword it will return all results which contain given text. I want to restrict that.
I am using this method for search data.
List<JournalArticle> results = JournalArticleLocalServiceUtil.search(
themeDisplay.getCompanyId(), //company id
themeDisplay.getScopeGroupId(), // group id
null, //folderIds Log list
0, //classname
null, //version
null,//title
null, //description
searchkeyword, // here put your keywords to search
new Date(), // startDate
new Date(), // endDate
0, //status
null, //review Date
QueryUtil.ALL_POS,
QueryUtil.ALL_POS, null);
You have to search using Liferay search API directly to Elasticsearch, filtering by DDM fields.
Each field of webcontent structures are translated to a DDM Field in Elasticsearch side.
You have information about how to query JournalArticle filtering by DDM fields in following links:
Liferay documentation: https://portal.liferay.dev/docs/7-2/frameworks/-/knowledge_base/f/search-queries-and-filters
Stackoverflow 1: http://stackoverflow.com/questions/24523689/liferay-6-2-custom-fulltext-search-portlet-implementation#24564208
Stackoverflow 2: http://stackoverflow.com/questions/30982986/get-distinct-values-for-a-group-of-fields-from-a-list-of-records
Liferay forums 1: https://community.liferay.com/forums/-/message_boards/message/108179215
Liferay forums 2: https://community.liferay.com/forums/-/message_boards/message/86997787
Liferay forums 3: https://community.liferay.com/forums/-/message_boards/message/84031533
(note: some links are related to 6.2 version, but 7.x queries should be very similar)
In Liferay Portal 7.x, the names of DDM fields that you have to query are built in DDMIndexerImpl.encodeName(...) method, see:
https://github.com/liferay/liferay-portal-ee/blob/cba537c39989bbd17606e4de4aa6b9ab9e81b30c/modules/apps/dynamic-data-mapping/dynamic-data-mapping-service/src/main/java/com/liferay/dynamic/data/mapping/internal/util/DDMIndexerImpl.java#L243-L268
DDM fields names follows following pattern:
Fields that are configured as keyword: ddm__keyword__{structrureId}__{fieldname}_{locale}
Other fields: ddm__{structrureId}__{fieldname}_{locale}
Note: in order to get structureId, you should query DDMStructure filtering by structureKey, if you hardcode the structureId, you can have problems in case you export/import the structure because structureId is recalculated during import process.

How to handle hive/avro schema evolution with new fields added in the middle of existing fields?

I have been told that the only way for Hive to be able to process the addition of new fields to an avro schema is if the new fields are added at the end of the existing fields. Currently our avro generation is alphabetical, so a new field could show up elsewhere in the field list.
So, can Hive handle this or not? I know next to nothing about Hive but I can see that there are good explanations of how to add new fields from avro but I can't seem to find any info on whether the location of the added field affects the ability of Hive to process them or not.
As an example, see below. How could the new schema be processed into Hive?:
Original Schema
{
"type":"record","name":"user",
"fields":[
{"name":"bday","type":"string"},
{"name":"id","type":"long"},
{"name":"name","type":"string"}
]
}
New Schema (Added field in alphabetical order)
{
"type":"record","name":"user",
"fields":[
{"name":"bday","type":"string"},
{"name":"id","type":"long"},
{"name":"gender","type":"string"},
{"name":"name","type":"string"}
]
}
Yes, Hive can handle this because it's the way Avro works:
if both are records:
the ordering of fields may be different: fields are matched by name
That's possible because all Avro files also include a schema used to write the data, writer's schema.
So, when you change the schema in Hive (e.g. by modifying avro.schema.url underlying file), you change the reader's schema. But all existing files and their writer's schemas remain untouched.
And yes, for all new fields added you have to provide a default value (using "default":...) regardless of fields ordering. Otherwise, the reader (Hive) won't be able to parse files written with original schema.
It is supported. You have to take care about add a default value for the new fields to be able to read the data that was written with the older schema.

Map JSON string property to ES object

I have a process that imports some of the data from external sources to elasticsearch. I use C# and NEST client.
Some of the classes have string properties that contain JSON. Same property may contain different json schema depending on source.
I want to index and analyze json objects in these properties.
I tried object type mapping using [ElasticProperty(Type=FieldType.Object)] but it doesn't seem to help.
What is the right way to index and analyze these strings?
E.g. I import objects like below and then want to query all start events of customer 9876 that have status rejected. I then want to see how they distribute over period of time (using kibana).
var e = new Event (){id=123, source="test-log" input="{type:'START',params:[{name:'customerid',value:'9876'},{name:'region',value:'EU'}]}",result="{status:'rejected'}"};

ElasticSearch with Hadoop data duplication issue

I have a requirement as follows :
Whatever data is there in hadoop, i need to make it searchable (and vice-versa).
So, for this , I use ElasticSearch where we can use elasticsearch-hadoop plug-in to send a data from hadoop to Elastic.And a real-time search is now possible.
But, my question is, isn't there a duplication of data. Whatever the data is in hadoop , same is duplicated in Elastic search with
indexing. Is there any way of get rid of this duplication OR my concept is wrong. I search a lot but don't find any clue about this duplication issue.
If you specify an immutable ID for each rows in elasticsearch (eg : a customerID for example), all inserts of existing datas will be only updates.
Extract from official documentation about insertion method (cf http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/configuration.html#_operation):
index (default) :new data is added while existing data (based on its
id) is replaced (reindexed).
If you have "customer" dataset in pig, just store datas like that :
A = LOAD '/user/hadoop/customers.csv' USING PigStorage()
....;
B = FOREACH A GENERATE customerid, ...;
STORE B INTO 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.nodes = localhost','es.http.timeout = 5m','es.index.auto.create = true','es.input.json = true','es.mapping.id =customerid','es.batch.write.retry.wait = 30', 'es.batch.size.entries = 500');
--,'es.mapping.parent = customer');
To perform a new search on Hadoop just use the custom loader :
A = LOAD 'foo/customer' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*');

Passing parameters to a couchbase view

I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter

Resources