I want to use UUID as a primary key for one of my very high volume tables in Apache Derby. Per Derby docs, I should be using CHAR(16) FOR BIT DATA. My question is: since this is a binary column and it does not support sorting how are indexes ordered and managed? I read explanations on how defragmentation occurs when out of order entries are added - so with binary PK since there is no sorting, won't entries get added to index page one after another? And in that case how are indexes managed/ordered by the db engine? What am I missing here? Can I use and index binary storing UUID?
I referred these links: http://kccoder.com/mysql/uuid-vs-int-insert-performance/ and How should I use UUID with JavaDB/Derby and JDBC? and http://www.informit.com/articles/article.aspx?p=25862&seqNum=7
Related
Using ElasticSearch in Amazon as search engine. Lately discussed with one of developers tactics for Upsert.
In my view (i am not an well experienced ES Developer) it's ok to have a complex key as _id, e.g. Result-1, Data-2, etc. It will help on Upsert and data deduplication. But concern was raised about key datatype. Long key, such as string, Sha1-digest, hex, etc — could affect search performance, and better to have some short keys or pass it to ES without predefined _id and deduplicate with document body or some specific properties.
I haven't read anything about ID performance — from Official docs to medium/blogs.
Is the concern right and I should follow it?
Thank you!
The concern about using custom ID fields is on the indexing phase because with the auto generated ones Elasticsearch can safely index the document without querying for uniqueness. If you are OK with your indexing rate then you should be fine.
If you look in the docs on the Tune for Search speed , there is no advice about using auto generated ids.
Relevant reads.
Choosing a fast unique identifier (UUID) for Lucene
Tune Elasticsearch for Search Speed
In my cassandra database I have a table with users and I want a function to search for users by their unique usernames. For that I need to query all usernames from the user table so that I can filter them serverside, because for input of "nark" I should also find username "Mark", "Narkis" and so on, so I can't just use the username as a partition key and search for the exact value.
If I give them all in the same partition, it results in a hot partition. If I distribute them over multiple partitions, I have to search in all of them.
How can query that efficiently for millions of users? Is there a way to search like that without querying all usernames?
Thank you for your help!
Cassandra natively is not a good fit for such a use case. Even extensive use of secondary indexes will be of minimal help here.
Nevertheless if you already have all your data on C* to achieve such a functionality you essentially need a indexing framework on top of it, most widely used is Apache SOLR (built on Lucene).I have seen SOLR work like magic for fuzzy searching on C* though nothing beats having something like Elasticsearch for the use case.
I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.
I noticed that documents indexed in elasticsearch using the kafka elasticsearch connector have their ids in the following format topic+partition+offset.
I would prefer to use id's generated by elasticsearch. It seems topic+partition+offset is not usually unique so I am loosing data.
How can I change that?
As Phil says in the comments -- topic-partition-offset should be unique, so I don't see how this is causing data loss for you.
Regardless - you can either let the connector generate the key (as you are doing), or you can define the key yourself (key.ignore=false). There is no other option.
You can use Single Message Transformations with Kafka Connect to derive a key from the fields in your data. Based on your message in the Elasticsearch forum it looks like there is an id in your data - if that's going to be unique you could set that as your key, and thus as your Elasticsearch document ID too. Here's an example of defining a key with SMT:
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey, ExtractId
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=id
# `ExtractField`: convert key from an object to a plain field
transforms.ExtractId.type=org.apache.kafka.connect.transforms.ExtractField$Key
transforms.ExtractId.field=id
(via https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/)
#Robin Moffatt, as much as I see it, topic-partition-offset can cause duplicates in case that upgrade your kafka cluster, but not in rolling upgrade fashion but just replace cluster with cluster (which is sometime easier to replace). In this case you will experience data loss because of overwriting data.
Regarding to your excellent example, this can be the solution for many of the cases, but I'd add another option. Maybe you can add epoc timestamp element to the topic-partition-offset so this will be like this topic-partition-offset-current_timestamp.
What do you think?
I Want to know whether any tool available for comparing database entries into elastcisearch index to find the mismatch.
Thanks in advance.
There is a way to do this with the Scrutineer tool, which provides support for comparing data stored in Elasticsearch against a source of truth, usually relational database.
After running this tool, you'll get a report of:
records in the source of truth and not in ES (missed create)
records in ES and not in the source of truth (missed delete)
records in ES and the source of truth which are out of synch (missed update)
Basically, this will give you an exact overview of the de-/synchronization state of the two data stores you're comparing (ES + DB).
UPDATE 1:
Here is another interesting blog article on the subject: Elasticsearch: Verifying Data Integrity with External Data Stores
UPDATE 2:
Here is yet another interesting blog article on the subject: How to keep Elasticsearch synchronized with a relational database using Logstash
I believe not, this has the potential to be a very taxing operation. However if you have used a unique PK from your database as the _id for the documents in elasticsearch then you could use the following command whilst iterating through records -
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1'
This will return an appropriate response as to whether the document exists or not. Storing all _id's which don't exist and placing these into ElasticSearch, within your own bespoke script or application.
If this is not the case the complexity of the problem significantly rise as do the implications to the cluster.