How best create parent child relationship in Elasticsearch - elasticsearch

I have two real time streams. One contains news articles and the other comments about the same articles. I'd like to create a parent-child relationship between each article and that articles comments except for headline. There is no common id. I'd like to use the headline which exists in both streams and match the two streams based on that every 15 minutes. I am assuming that 15 min would be sufficient to handle delay between the two streams. How would you go about doing this? Any ideas would be appreciated.
A typical message containing, entity_name, source_name, headline, which comes through Logstash looks like this:
"Thomson Reuters Corp.","Japan Today","Trump claims victory after
forcing NATO crisis talks"
Some typical comments, comment, headline, which comes through Logstash but a separate pipeline looks like this:
"We applaud Trumps claim ...", "Trump claims victory after forcing NATO crisis talks"
"Nato crisis is important...", "Trump claims victory after forcing NATO crisis talks"
Specifically:
1. Keep indexes separate or create a third index with from the first two?
2. How to run 15 min refresh cycles?
3. If there is a better way/tool/data store, please advise.

You can create a common id between comments and article by hashing the headline (supposing you never observe typos).
Yes, keep articles and comments in separate indices.
reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html
Need more specifics on what you mean by matching the streams. Not sure if there's a way to schedule jobs using Elasticsearch Task API... Maybe make a cronjob to do this? You can go through the articles index, hash the headline, and then query for that hash in the comments index.
Seems like you have a solid storage method right now.

Related

Is it possible to total a schema field?

Apologies if this has come up before, but I couldn't find anything.
I am using GraphCMS (highly recommend it) and I have several fields that are floats. I am using them for prices. Each entry in the schema can either be a buy or sell in an enumeration field. I would like to total all the price fields where the entry is a buy, and total all the price fields where the entry is a sell.
I thought there would be something clear in the docs because totalling those fields would be very useful for something like calculating an average score etc. but I can only see docs about counting entries. Have I missed it somewhere?
Surely this is possible?
After speaking with the helpful guys at GraphCMS I have figured out how to do it. This is not a question about GraphCMS as #puelo suggested, but can be achieved by running a graphql query and then programatically running calculations with the data. It's a simple solution but wasn't immediately apparent to me at first. Hopefully that info will be useful to other newbies.

Datadog distinct-like custom metrics

Given following scenario:
A lambda receives an event via SQS
The lambda receives a uuid pointing to an entity.
The lambda may fail with an error
SQS will retrial that particular entity several times
The lambda will be called with different entities thousand of times
Right now we monitor a custom error-count metric like myService.errorType.
Which gives us an exact number of how many times an error occurred - independent from a specific entity: If an entity can't be processed like 100 times, then the metric value will be 100.
What I'd like to have, though, is a distinct metric based on the UUID.
Example:
entity with id 123 fails 10 times
entity with id 456 succeeds
entity with id 789 fails 20 times
Then I'd like to have a metric with the value of 2 - because the processes failed for two entities only (and not for 30, as it would be reported right now).
While searching for a solution I found the possibility of using tags. But as the docs point out they are not meant for such a use-case:
Tags shouldn’t originate from unbounded sources, such as epoch timestamps, user IDs, or request IDs. Doing so may infinitely increase the number of metrics for your organization and impact your billing.
So are there any other possibilities to achieve my goals?
I've solved it now by verifying the status via code and by adding tags to the metrics:
occurrence:first
subsequent
This way I can filter in my dashboard for occurrence:first only.
To make sure things are clear, you have a metric called myService.errorType with a tag entity. This metric is a counter that will increase every time an entity is in error. You will then use this metric query:
sum:myService.errorType{*} by {entity}
When you speak about UUID, it seems that the cardinality is small (here you show 3). Which means that every hour you will have small amount of UUID available. In that case, adding UUID to the metric tags is not as critical as user ID, timestamp, etc. which have a limitless number of options.
I would invite you to add this uuid tag, and check the cardinality in the metric summary page to ensure it works.
Then to get the number of UUID concerned by errors, you can use something like:
count_not_null(sum:myService.errorType{*} by {uuid})
Finally, as an alternative, if the cardinality of UUID can go through the roof, I would invite you to work with logs or work with Christopher's solution which seems to limit the cardinality increase as well.

Efficient way to query

My app has a class that saves picture that users upload. Each object in the class has a city property that holds the name of the city that the picture was taken at, and a like property that tracks the number of likes.
I want to be able to send a query that returns one picture per city and each picture should have the highest ranking of likes in the city it belongs to. How can I do that?
One way which I first thought about is doing multiple queries by fetching the most liked picture of a city and save it in an array, and then do the same to other cities.
However, each country has more than one city, thus it's not that efficient.
Parse doesn't support the ordinary operations used in databases. Besides, I tried to use a compound query. Unfortunately, I can't set limit or ordering on the subqueries. Any good solution for this?
It would be easy using group by. Unfortunately, Parse does not support "select distinct" or "group by" features.
As you've suggested you need to fetch for each country all the cities, and for each one get the top most rated photo.
BUT, since Parse has strict restrictions on the duration time execution of a request ( 3 sec for an event listener, 7 sec for a custom function ), I suggest you to do this in a background job, saving in a new table the top rated photo for each city. In this way you can easily query the db from client. The Background jobs can be executed up to 15 minuted before parse drop them, so you could make that kind of queries without timeouts.
Hope it helps

How to quickly search book titles?

I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.
I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.
Basically, I'm looking for a way to:
find the results quickly
sort them by relevance.
I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.
P.S. I'm currently using MySQL, but I could switch to anything else if needed.
Using a SOUNDEX is the best way i think.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This will match 'Saw' etc.
For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').
Example usage:
UPDATE `books` SET `soundex_title` = SOUNDEX(title);
You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...
One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store.
The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:
keyword : set of book titles
Redis already has a built-in set data-type that you can use.
Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.
Everything is done in memory; therefore the response time is very fast.
Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.
Apache Lucene with Solr is definitely a very good option for your problem
You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.
And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.
Say the book "A dime in a dozen". In books table it will be:
book_id=356
book_title='A dime in a dozen'
In titles table will be stored:
titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'
The query for this:
SELECT b.book_id, b.book_title
FROM books b JOIN titles t on b.book_id=t.titles_book_id
WHERE t.titles_title='%twelve%'
GROUP BY b.book_id
Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.

RSS functioning problem

I need to create an RSS feed for our information system, which is written in PHP.
I had no problems with the RSS 2.0 specification, nor with the creation of RSS feed generator. Items for the feed are to be fetched from a large table containing lots of records, so it will take a lot of time to get all the necessary information from this table. Therefore, it is necessary to implement the following scheme:
To show 5 latest items to new
subscribers.
For the existing subscribers – to
show only those items which have
been added since their last view of
the feed.
I have no problems with the first condition: I can simply use the LIMIT clause
to limit the number of fetched rows. Something like this:
$items = function_select(“SELECT * FROM some_table ORDER BY date DESC LIMIT 5);
But this creates the following problem: Suppose there are real feed subscribers who have already read the items from 1 up to 10. After they've been away for some period of time new items have been created; say, 10 new items.
During their next check-in we want them to see all the new 10 items, but not all at once. They will see only the last 5 ones (from 16 up to 20), but not all 10 of them. The items from 11 up to 15 will be omitted.
I suppose that in order to succeed in solving this problem there should be a kind of a flag to be sent to feed. For example: pubDate of the lasted fetched item. Twitter's feed uses something similar. However, that link is hand-made. How could it be done another way?
Please let me know if you have any ideas. If you have any example ready (no matter in what language) just share a link with me. I would appreciate it greatly.
Thank you in advance.
Standard RSS feeds don't render different content to different users. They simply always provide the most recent few items (often 10), and rely on the RSS reader to poll them often enough that they don't miss any updates. Unless you have a particularly compelling reason not to do this, this is the simplest and most effective mechanism.

Resources