I have a process that imports some of the data from external sources to elasticsearch. I use C# and NEST client.
Some of the classes have string properties that contain JSON. Same property may contain different json schema depending on source.
I want to index and analyze json objects in these properties.
I tried object type mapping using [ElasticProperty(Type=FieldType.Object)] but it doesn't seem to help.
What is the right way to index and analyze these strings?
E.g. I import objects like below and then want to query all start events of customer 9876 that have status rejected. I then want to see how they distribute over period of time (using kibana).
var e = new Event (){id=123, source="test-log" input="{type:'START',params:[{name:'customerid',value:'9876'},{name:'region',value:'EU'}]}",result="{status:'rejected'}"};
Related
As much as I want to use well-defined structures and schemas in BigQuery, I have a case now when I really need to export rows with a highly dynamic field into BigQuery.
This is the Go client I am using:
https://pkg.go.dev/cloud.google.com/go/bigquery
For brevity, here's my row struct in Go:
type Row struct {
Name string `bigquery:"name"`
Count int64 `bigquery:"count"`
Metadata Metadata `bigquery:"metadata"`
}
This Metadata struct is a slice of objects that have a set of nested fields. Some of those fields are unpredictable and this is why I cannot specify the schema explicitly and tag all the fields explicitly.
I have found the relatively new bigquery.FieldType JSON that is in theory supposed to work in these cases.
Having said that, I still get the error field metadata is not a record type when exporting my data to BQ.
This doesn't make a lot of sense because I decided to use this JSON field type in order to avoid specifying the schema for the RECORD type.
Another option would be to use the auto detect feature but it's only available when exporting data from a JSON file in a GCS bucket.
Is there a workaround in this case? Let me know if I should add more details to the question in the comments and I will do so.
My log POCO has several fixed properties, like user id, timestamp, with a flexible data bag property, which is a JSON representation of any kind of extra information I'd like to add to the log. This means the property names could be anything within this data bag, bringing me 2 questions:
How can I configure the mapping so that the data bag property, which is of type string, would be mapped to a JSON object during the indexing, instead of being treated as a normal string?
With the data bag object having arbitrary property names, meaning the overall document type could have a huge number of properties inside, would this hurt the search performance?
For the data translation from string to JSON you can use ingest pipeline with JSON processor:
https://www.elastic.co/guide/en/elasticsearch/reference/master/json-processor.html
It depends of you queries. If you'll use the "free text search" - yes, the huge number of fields will slow the query. If you you'll use query like "field":"value" - no, there is no problem with the fields number in the searches. Additional information about query optimization you cold find here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.15/tune-for-search-speed.html#search-as-few-fields-as-possible
And the question is: what you meen, when say "huge number"? 1000? 10000? 100000? As part of optimization i recommend to use dynamic templates with the definition: each string field automatically ingest into the index as "keyword" and not text + keyword. This setting decrease the number of fields to half.
I'm starting with ElasticSearch.NET (trying Nest first).
A very basic question: all the search API methods I see (search, get, etc) require specifying a .NET type.
Isn't there a way to specify an index name so the API infers the response type automatically ? In other words, is it mandatory to create POCO objects for all Indexes we intend to search ? (I understand from the documentation that ElasticSearch can infer a document type from an index by using the structure of the first document...)
Isn't there a way to specify an index name so the API infers the response type automatically ?
Not currently. We've previously discussed doing something like this based on index patterns, which would be useful to support covariant responses across multiple indices when types are completely removed in the future.
In other words, is it mandatory to create POCO objects for all Indexes we intend to search ?
No it's not mandatory. You can specify any type you desire for TDocument in IElasticClient.Search<TDocument> and the type will be used to
determine the type into which to deserialize each _source document
Provide strongly typed access to document fields through their mapping to POCO properties.
It used to be, and the documentation still says: Each PFObject class may only have one key with a PFGeoPoint object.
But in my tests today, I created an object with 2 GeoPoint columns, was able to query on either GeoPoint, and was able to modify and save either GeoPoint. Previously, this would lead to an error like: only 1 ParseGeoPoint object can be stored in a class.
Is this really supported now?
Some additional info: I first have to create the 2 geoPoint columns in the data browser. If they don't exist and my iPhone code tries to save an object with 2 geoPoints, then I get the "only one GeoPoint field may exist in an object". But as long as the 2 columns exist, my client code appears to be able to use both.
As of July 2015, Parse still does not support more than one GeoPoint column on a class. They have, however, fixed the Data Browser to prevent users from creating two GeoPoint columns.
Got this response from Parse (in the Google Group forum):
Hmm, that sounds like a problem with the data browser's mechanism of altering the schema. Could you report a bug? I would not recommend using objects created in this way - the underlying data store can only index one geopoint field per object, so whichever field gets indexed second just will have the index fail and you won't be able to run queries against it.
The solution is to put the second GeoPoint (which you will not be able to search on) into a singleton array.
I'm looking to search for a particular JSON document in a bucket and I don't know its document ID, all I know is the value of one of the sub-keys. I've looked through the API documentation but still confused when it comes to my particular use case:
In mongo I can do a dynamic query like:
bucket.get({ "name" : "some-arbritrary-name-here" })
With couchbase I'm under the impression that you need to create an index (for example on the name property) and use startKey / endKey but this feels wrong - could you still end up with multiple documents being returned? Would be nice to be able to pass a parameter to the view that an exact match could be performed on. Also how would we handle multi-dimensional searches? i.e. name and category.
I'd like to do as much of the filtering as possible on the couchbase instance and ideally narrow it down to one record rather than having to filter when it comes back to the App Tier. Something like passing a dynamic value to the mapping function and only emitting documents that match.
I know you can use LINQ with couchbase to filter but if I've read the docs correctly this filtering is still done client-side but at least if we could narrow down the returned dataset to a sensible subset, client-side filtering wouldn't be such a big deal.
Cheers
So you are correct on one point, you need to create a view (an index indeed) to be able to query on on the content of the JSON document.
So in you case you have to create a view with this kind of code:
function (doc, meta) {
if (doc.type == "youtype") { // just a good practice to type the doc
emit(doc.name);
}
}
So this will create a index - distributed on all the nodes of your cluster - that you can now use in your application. You can point to a specific value using the "key" parameter