I am using JDBC river to pull data to ElasticSearch from Oracle database.
As mentioned in following link, left join can be used to get multiple values of one column of same primary id record in single json array. But if there is only one records after left join, river doesn't create array, rather puts the value in the json field.
This is causing problem to NEST to understand the type of object.
https://github.com/jprante/elasticsearch-jdbc#structured-objects
So, is there any way to force some fields to be array even it has just one value?
There is a way to to do this using the bracket notation as described here JDBC river Bracket Notation
So basically in your SQL query if you have
Select tag as tag.name from tags
you need to change it to
Select tag as tag[name] from tags
hope this helps
Related
I’m trying to tag my data according to a lookup table.
The lookup table has these fields:
• Key- represent the field name in the data I want to tag.
In the real data the field is a subfield of “Headers” field..
An example for the “Key” field:
“Server. (* is a wildcard)
• Value- represent the wanted value of the mentioned field above.
The value in the lookup table is only a part of a string in the real data value.
An example for the “Value” field:
“Avtech”.
• Vendor- the value I want to add to the real data if a combination of field- value is found in an document.
An example for combination in the real data:
“Headers.Server : Linux/2.x UPnP/1.0 Avtech/1.0”
A match with that document in the look up table will be:
Key= Server (with wildcard on both sides).
Value= Avtech(with wildcard on both sides)
Vendor= Avtech
So baisically I’ll need to add a field to that document with the value- “ Avtech”.
the subfields in “Headers” are dynamic fields that changes from document to document.
of a match is not found I’ll need to add to the tag field with value- “Unknown”.
I’ve tried to use the enrich processor , use the lookup table as the source data , the match field will be ”Value” and the enrich field will be “Vendor”.
In the enrich processor I didn’t know how to call to the field since it’s dynamic and I wanted to search if the value is anywhere in the “Headers” subfields.
Also, I don’t think that there will be a match between the “Value” in the lookup table and the value of the Headers subfield, since “Value” field in the lookup table is a substring with wildcards on both sides.
I can use some help to accomplish what I’m trying to do.. and how to search with wildcards inside an enrich processor.
or if you have other idea besides the enrich processor- such as parent- child and lookup terms mechanism.
Thanks!
Adi.
There are two ways to accomplish this:
Using the combination of Logstash & Elasticsearch
Using the only the Elastichsearch Ingest node
Constriant: You need to know the position of the Vendor term occuring in the Header field.
Approach 1
If so then you can use the GROK filter to extract the term. And based on the term found, do a lookup to get the corresponding value.
Reference
https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_static.html
https://www.elastic.co/guide/en/logstash/current/plugins-filters-jdbc_streaming.html
Approach 2
Create an index consisting of KV pairs. In the ingest node, create a pipeline which consists of Grok processor and then Enrich it. The Grok would work the same way mentioned in the Approach 1. And you seem to have already got the Enrich part working.
Reference
https://www.elastic.co/guide/en/elasticsearch/reference/current/grok-processor.html
If you are able to isolate the sub field within the Header where the Term of interest is present then it would make things easier for you.
I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?
If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html
For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.
I'm new to rethinkdb and i love it, but i found some problems when i tried to optimize my query and make it work on bigger datasets.
The problem is simple.
I need to filter my "event" table by timestamp (row.to) , by tag (row.tags), order by timestamp (row.from) and then slice for pagination.
row.tags has a multi index and works well!
row.from and row.to are start/end time of Event.
The slow query (testeded on 100k entries) is this:
r.db("test").table("event")
.getAll(r.args(["148a6e03-b6c3-4092-afa0-3b6d1a4555cd","7008d4b0-d859-49f3-b9e0-2e121f000ddf"]), {"index": "tags"})
.filter(function(row) {return row("to").ge(r.epochTime(1480460400));})
.orderBy(r.asc("from"))
.slice(0,20)
I created an index on 'from' and tried to do
.orderBy(r.asc("from"),{index:'from'})
but i get
e: Indexed order_by can only be performed on a TABLE or TABLE_SLICE in:
I already read about problems about index intersection in Rethinkdb, but maybe i miss something, maybe there is a way of doing this simple task.
Thank you.
The reason RethinkDB complains is this:
getAll returns a selection. When filter is applied to a selection it returns a selection. When orderBy is applied to a selection the index parameter can't be used (it can only be used when orderBy is applied to a table).
orderBy can be applied to a table, sequence or selection. Only when it's applied to table can the index parameter be used. This makes sense as the index is updated when rows are added and removed from the table.
In your case, you are applying orderBy on a result of filter which is a selection. In order to sort a selection the database needs to:
read all elements into memory (by default max is 100,000 elements)
sort them using the provided function or field
and it can't use index in this case.
The way to improve your query might be to sort the table first and then apply the filter. You will be able to use the index in this case.
I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.
I'm writing a small torrent indexer in Ruby (here) and would love to support MongoDB as an option for the database. Currently, I have the database set up with a many-to-many relationship between tags and torrents.
How would I format a query that gets all the torrent_ids from a map table that match to all the tags in a given list?
I did this in SQL like this:
select torrent_id, count(*) num from tagmap where tag_id in (tag1, tag2, tag3, tag4) group by torrent_id having num = 4"
EDIT: I am right now working only with the collection with torrent_id and tag_id. That's all it has in there. So I'm mapping ids to ids and naught more.
It's better to create a collection to create the mapping consisting tag_id's and torrent_id's. Whenever you add a torrent, also add the torrents tags to the torrenttags collection. Index should be on tag_id.
You can use the following query syntax to get a list of torrents matching more than one tag.
db.tagmap.find({tag_id:{$in: ['tag1','tag2','tag3','tag4']}});
For Aggregation (group by, count) you need to use MapReduce