In tweepy streaming API, I want to collect tweets with location filter. Thus I have used the following command.
stream.filter()
I also need to collect tweets without a location filter. For this, I have used the following command.
stream.sample()
Logically, I think the sample() function should retrieve a larger number of tweets. But my observation is that sample() is fetching much lesser number of tweets. Is it supposed to be this way? How can I fetch all tweets without applying location filter?
You can indeed search for tweets in a specific location, with a specific hashtag, user, ...
But you can't search for every tweet created. There is too much data to get.
You need to specify a filter.
So it depends on the tweets you need to get.
You can use userstream() to get tweets tweeted to you and only you.
You can use filter(track=[somefilter]) to get tweets from research with the specified filter
Related
I have a microservice with elasticsearch as a backend store. Now I have multiple indices with hundreds of thousands of documents inserted in those indices.
Now I need to expose GET APIs for those indices. GET /employees/get.
I have gone through ES pagination using scroll and search_after. But both of them require meta information like scroll_id and search_after(key) to do pagination.
Now the concern is my microservice shouldn't expose these scroll_ids or search_after. With current approach, I can list up to 10k docs but not after that. And I don't want users of microservice to know about the backend store or anything about it. So How can I achieve this in elasticservice?
I have below approach in mind:
Store the scroll_id in-memory and retrieve the results based on that for subsequent queries. Get query will be as below:
GET /employees/get?page=1 By default each page will have 10k documents.
Implement scroll API internally over GET API and return all matching documents to users. But this increases the latency and memory. Because at times I may end up returning 100k docs to user in a single call.
Expose GET API with search string. By default return 10k documents and further the results will be refreshed with searchstring as explained:
Lets say GET /employees/get return 10k documents. And accept query_string to enrich the 10k like auto suggestion using n gram. Then we show most valid 10k docs everytime. I know this is not actual pagination but somehow this too solves the problem in a hacky way. This is my Plan-B.
Edited:
This is my usecase:
Return list of employees of a company. There are more than 100k employees. So I have to return the results in pages. GET /employees/get?from=0&size=1000 and GET /employees/get?from=1001&size=1000
But once I reach from+size to 10k, ES rejects the query.
Please suggest what would be the ideal way to implement pagination in microservice with ES as backend store and not letting user to know about internals of ES.
I want to use NLP with elasticsearch. I have been able to achieve one level by using Open NLP plugin as mentioned in comments of this question. I am getting entities like person, organization, location etc indexed while inserting documents.
I have a doubt while searching the same information.Since, I need to process the terms entered by the user during query time. Following is what I have thought of:
Process the query entered by user using apache NLP as specified here.
Extract Person, location and organisation Names from the previous and then run a query against the entities stored in index.
I am also thinking of using Google Knowledge Graph Search Api to fetch related information about the extracted entities in the previous steps and then include them in search query as well. (Reason to do this is because we want to show results of Delhi in case some one searches for Capital Of India). We are not going with Synonyms Search approach in this case as we want the information to be dynamically available.
My question is that-
Is there something better we can do to achieve the same, because lot of processing at query time is going to increase the response time?
I'm using ElasticSearch to search from more than 10 million records, most records contains 1 to 25 words. I want to retrieve data from it, the method I'm using now is drastically slow for big data retrieval as I'm trying to get data from the source field. I want a method that can make this process faster. I'm free to use other database or anything with ElasticSearch. Can anyone suggest some good Ideas and Example for this?
I've tried searching for solution on google and one solution I found was pagination and I've already applied it wherever it's possible but pagination is not an option when I want to retrieve many(5000+) hits in one query.
Thanks in advance.
Try using scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all
results) from a single search request, in much the same way as you
would use a cursor on a traditional database.
I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:
Make create_at field as date. Should I use mapping or index template for this?
make some fields not analyzed. Like hashtags, URLs, etc.
Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc.
What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?
I am a bit confused. Will really appreciate advise.
Thanks in advance.
I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.
This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).
The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.
I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.
Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:
use something like apache drill in order to populate the hdfs with the user and item data.
using the recommendation job in mahout train on the data from the hdfs.
transform the resultant data in the hdfs to index shards to be used by solr
use solr to provide the recommendations to the userbase
However, I am looking for clarifications on a couple aspects of this design:
How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?
What is the best manner in which to invoke the recommendations job?
I have other questions besides these two but the answers to these would be a huge help.
You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.
The basic recommender can be put together in two ways:
After getting data into into HDFS in the form of (user id, item id, preference weight) run the ItemSimilarityJob on the data (use LLR similarity, which is usaully best). It will create what is called an indicator matrix. This will be an item id by item id sparse matrix of values indicating the similarity magnitude between any two items. You must then convert this into values that Solr can index. That means translating the internal Mahout integer IDs into some unique string representation, which is probably what they were at the very beginning. This will look like (item123,item223 item643 item293 item445...) as a CSV. so two Solr fields, the first is an item id, the second is a list of similar items. All ids must be text tokens. Then the query for recommendations is a Solr query made up of item ids that a particular user has shown a preference for. So query = "item223 item344 item445...". Make the query against the filed that olds the indicator matrix values. You will get back an ordered list of item IDs
A much easier way that may work for you is to use a tool in the /examples folder of Mahout 1.0-SNAPSHOT or here: https://github.com/pferrel/solr-recommender. It takes in raw log files with unique strings for user and item ids. it does all the work on Hadoop to output CSVs that can be indexed by Solr directly or loaded into a DB as described above.
The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.
Here are three ways to do rescoring:
Mix in metadata with the query. So you could do query = "item223 item344 item445..." against the indicator field AND "SciFi" against the "genre" field. This gives you blended collaborative filtering and metadata in your query and as you can imagine, the recs are based on the user's prefs but skewed towards "SciFi". There are lots of other interesting things you can do once you get item+indicators+metadata into an index.
Filter recs by metadata. You can get recs not skewed but filtered, if you want. Using the Solr query = "item223 item344 item445..." against the indicator field AND "SciFi" as a filter against the "genre" field. In this case you get nothing but "SciFi" where #1 you would get mostly "SciFi"
Get your ordered list of recs back and rescore them in any way you'd like based on other things you know about the user, context, or items. Often these can be encoded into a Solr query and done with one query but reordering and filtering can be done after the recs are returned too. You would have to write that code, it is not built in.
The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.
Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.