How should a TSV file be formatted in DRUID? - hadoop

I am curious how a TSV file should look when we are ingesting data from a local TSV file using DRUID.
Should it just be like:
Please note this is just for testing:
quickstart/sample_data.tsv file:
name lastname email time Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
Where this part is my dimensions: name lastname email
And this part is my actual data: Bob Jones bobj#gmail.com 1468839687 Billy Jones BillyJ#gmail.com 1468839769
{
"type" : "index_hadoop",
"spec" : {
"ioConfig" : {
"type" : "hadoop",
"inputSpec" : {
"type" : "static",
"paths" : "quickstart/sample_data.tsv"
}
},
"dataSchema" : {
"dataSource" : "local",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"intervals" : ["2016-07-18/2016-07-18"]
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "tsv",
"dimensionsSpec" : {
"dimensions" : [
"name",
"lastname",
"email"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "added",
"type" : "longSum",
"fieldName" : "deleted"
}
]
}
}
}
I had some questions about my spec file as well since I was not able to find the answers to them on the doc. I would appreciate it if someone can answer them for me :)!
1) I noticed in the example spec they added the line "type" : "index_hadoop" at the very top. What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory? Also where can I read about the different values I should put for this "type" key in the docs? I didn't get a explanation for that.
2) Again there is a type variable in the ioConfig: "type" : "hadoop". What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory?
3) For the timestampSpec, the time in my TSV file is in GMT. Is there any way I can use this as the format. Since I read you should convert it to UTC and is there a way to convert to UTC during the process of posting the data to the Overlord? Or will I have to change all of those GMT time formats to UTC similar to this: "time":"2015-09-12T00:46:58.771Z".

Druid supports two ways of ingesting batch data
Hadoop Index Task
Index Task
The spec you are referring to is of a Hadoop Index Task hence "type" is "index_hadoop" and also ioconfig type is "hadoop".
Here is a sample spec for a index task which can read from local file:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "wikipedia",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": ["page", "language"]
}
}
},
"metricsSpec": [{
"type": "count",
"name": "count"
}, {
"type": "doubleSum",
"name": "added",
"fieldName": "added"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "NONE",
"intervals": ["2013-08-31/2013-09-01"]
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": "examples/indexing/",
"filter": "wikipedia_data.json"
}
}
}
}

Related

Apache NiFi from unix timestamp to actual date not working

I have the following NiFi Flow, with which I am struggling to generate a date, out of a unix timestamp. And I was not able to find a solution since last year :(
First of all, I receive a file from a Kafka Processor.
The data comes as a text and it looks as follows:
exclsns1,1671785280,1671785594,1671785608.
The next step is to use a ConvertRecord and generate a Parquet File out of these incoming files.
For that, I have generated the following schemas:
Record Reader --> CSV Reader:
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "endtime",
"type" : [ "null", "string" ],
"default" : null
} ]
}
Record Writer --> Parquet Record Set Writer
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "excelReader",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "time",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "starttime",
"type": { "type":"int", "logicalType":"date"},
"default" : null
}, {
"name" : "endtime",
"type": { "type":"long", "logicalType":"timestamp-millis"},
"default" : null
} ]
}
Notice that I have tried different types for the data, but none of which solved my issue.
The next step is to go into a PartitionRecord Processor, in which I use a ParquetReader and the same Parquet Record Set Writer controllers.
Beside that, I have defined 6 properties to help me identify why the data is not converted as expected:
a_endtime --> /endtime
a_endtime_converted --> format(/endtime, "yyyy/MM/dd/HH", "GMT")
a_startime --> /starttime
a_startime_converted --> format(/starttime, "yyyy/MM/dd/HH", "GMT")
a_time --> /time
a_time_converted --> format(/time, "yyyy/MM/dd/HH", "GMT")
However, once the flowfile gets on the Success Queue after PartitionRecord, I have the following values:
a_endtime
1671785608
a_endtime_converted
1970/01/20/08
a_startime
1671785594
a_startime_converted
1970/01/20/08
a_time
1671785280
a_time_converted
1970/01/20/08
1671785608 = Friday, December 23, 2022 8:53:28 AM
1671785594 = Friday, December 23, 2022 8:53:14 AM
1671785280 = Friday, December 23, 2022 8:48:00 AM
What am I doing wrong and having the same date generated for every value? Has anybody else faced a similar issue and might give me a hint on what to do to solve my issue?
Thank you :)
Unix time counted in seconds since 1/1/1970
Nifi based on java, and java time counted in milliseconds since 1/1/1970
So, you have just multiply your value by 1000 before formatting to date

Elasticsearch conditional query : search nested if not exists check default value on parent

I'm new to Elasticsearch. I have a question about creating conditional search queries between nested and parent types.
I have a mapping below named "homes"
PUT /homes/_mapping
{
"properties": {
"name": { "type": "text" },
"ownerID": { "type": "integer" },
"status": { "type": "text" }, // active or inactive
"minDays": { "type": "integer" },
"dayData": {
"type": "nested",
"properties": {
"date": { "type": "date" },
"minDays": { "type": "integer" },
"closed": { "type": "boolean" }
}
}
}
}
Homeowners can define custom minDays for homes day by day. Of course, there is a default value for minDays in the parent.
I want to make a search query by date range
For example, users can search in homes by start and end date. If there is dayData defined with minDays on the start date, it will take the value here, if not, the parent value will be taken for filtering.
How can I achieve that?
Update
I have documents like below
For example if user is searching avaibility for the dates between "2021-01-01" and 2021-01-04. Total day count is "3" for requested dates. We need search for homes which has minDays value greater or equeal than "3".
In this situation:
I need to get docs with status is "active"
The min day value for _doc/1 is "2", cause there is a custom dayData value with minDays. So this property won't be available.
The min day value for _doc/2 is "4". We get default value from parent, cause there is no custom dayData value. This property will be available.
If they dayData includes a "closed" value with "true" between searched dates, home won't be available.
I have tried several search queries to achieve that but failed for this context.
Thank you in advance.
PUT homes/_doc/1
{
"name" : "Home Test",
"status" : "active",
"ownerID": 1,
"minDays" : "3",
"dayData" : [
{
"date" : "2021-01-01",
"minDays" : 2
},
{
"date" : "2021-01-04",
"minDays" : 5
},
{
"date" : "2021-01-05",
"closed": true
}
]
}
PUT homes/_doc/2
{
"name" : "Home Test 2",
"status" : "active",
"ownerID": 2,
"minDays" : "4",
"dayData" : [
{
"date" : "2021-05-02",
"minDays" : 2
},
{
"date" : "2021-05-03",
"minDays" : 2
},
{
"date" : "2021-05-03",
"minDays" : 2,
"closed" :true
},
{
"date" : "2021-05-10",
"minDays" : 5
}
]
}

Druid hadoop batch supervisor: Could not resolve type id 'index.hadoop' as a subtipe of SupervisorSpec

I'm trying to launch a Druid supervisor to ingest PArqurt data stored into hadoop. However I am getting the folllowing error and I can't find any information about it:
"error":"Could not resolve type id 'index_hadoop' as a subtype of
[simple type, class
io.druid.indexing.overlord.supervisor.SupervisorSpec]: known type ids
= [NoopSupervisorSpec, kafka]\n at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP)
I tried to fix it loading hadoop deep storage, parquet and avro extensions on the extensions load list but this didn't work.
This is my supervisor JSON configuration:
{
"type" : "index_hadoop",
"spec" : {
"dataSchema" : {
"dataSource" : "hadoop-batch-timeseries",
"parser" : {
"type": "parquet",
"parseSpec" : {
"format" : "parquet",
"flattenSpec": {
"useFieldDiscovery": true,
"fields": [
]
},
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": [ "installation", "var_id", "value" ],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [
{
"type" : "count",
"name" : "count"
}
],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE",
"intervals" : [ "2018-10-01/2018-11-30" ]
}
},
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "granularity",
"dataGranularity": "day",
"inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat",
"inputPath": "/warehouse/tablespace/external/hive/demo.db/integers",
"filePattern": "*.parquet",
"pathFormat": "'year'=yyy/'month'=MM/'day'=dd"
},
},
"tuningConfig" : {
"type": "hadoop"
}
},
"hadoopDependencyCoordinates": "3.1.0"
}
I ran into the same issue. Solved it by submitting it as a task instead of submitting as a supervisor:
curl -X POST -H 'Content-Type: application/json' -d #my-spec.json http://my-druid-coordinator-url:8081/druid/indexer/v1/task

Initial script for Elasticsearch

Is it possible to create an initial script for Elasticsearch?
For example, I prepare one JSON file with index 20 users and 20 books.
I want to load it by the single request.
Example file:
PUT eyes
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"_doc" : {
"properties" : {
"name" : { "type" : "text" },
"color" : { "type" : "text" }
}
}
}
}
PUT eyes/_doc/1
{
"name": "XXX"
"color" : "red"
}
PUT eyes/_doc/2
{
"name": "XXXX"
"color" : "blue"
}
You can use bulk API for populating your index in one single call.
https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-bulk.html
PUT /eyes/_doc/_bulk
{"index":{"_id":1}}
{"name":"XXX","color":"red"}
{"index":{"_id":2}}
{"name":"XXX","color":"blue"}
{"index":{"_id":3}}
{"name":"XXX","color":"green"}

Exact phrase match in ElasticSearch

I'm trying to achieve exact search by phrase in Elastic, using my existing index (full-text). When user is searching, say, "Sanity Testing", the result should bring all the docs with "Sanity Testing" (case-insensitive), but not "Sanity tested".
My mapping:
{
"doc": {
"properties": {
"file": {
"type": "attachment",
"path": "full",
"fields": {
"file": {
"type": "string",
"term_vector":"with_positions_offsets",
"analyzer":"o3analyzer",
"store": true
},
"title" : {"store" : "yes"},
"date" : {"store" : "yes"},
"keywords" : {"store" : "yes"},
"content_type" : {"store" : "yes"},
"content_length" : {"store" : "yes"},
"language" : {"store" : "yes"}
}
}
}
}
}
As I understand, there's a way to add another index with "raw" analyzer, but I'm not sure this will work due to the need to search as case-insensitive. And also I don't want to rebuild indexes, as there are hundreds machines with tons of documents already indexed, so it may take ages.
Is there a way to run such a query? I'm now trying to search using the following query:
{
query: {
match_phrase: {
file: "Sanity Testing"
}
}
and it brings me both "Sanity Testing" and "Sanity Tested".
Any help appreciated!

Resources