Browse all documents and bulk update some of them - elasticsearch

I am using the Jest client for Elastic to browse an index of document to update one field. My workflow is to run an empty query with paging and look if I can compute the extra field. If I can, I update the relevant documents in one bulk update.
Pseudo-code
private void process() {
int from = 0
int size = this.properties.batchSize
boolean moreResults = true
while (moreResults) {
moreResults = handleBatch(from, this.properties.batchSize)
from += size
}
}
private boolean handleBatch(int from, int size) {
log.info("Processing records $from to " + (from + size))
def result = search(from, size)
if (result.isSucceeded()) {
// Check each element and perform an upgrade
}
// return true if the query returned at least one item
}
private SearchResult search(int from, int size) {
String query =
'{ "from": ' + from + ', ' +
'"size": ' + size + '}'
Search search = new Search.Builder(query)
.addIndex("my-index")
.addType('my-document')
.build();
jestClient.execute(search)
}
I don't have any error but when I run the batch several times, it looks like is finding "new" documents to upgrade while the total number of documents hasn't changed. I got the suspicion that an updated document was processed several times which I could confirm by checking the processed IDs.
How can I run a query so that the original documents are the ones processed and any update wouldn't interfere with it?

Instead of running a normal search (i.e. using from+size), you need to run a scroll search query instead. The main difference is that the scroll will freeze a given snapshot of documents (at the time of the query) and query them. Whatever changes happen after the first scroll query, won't be considered.
Using Jest, you need to modify your code to look more like this:
// 1. Initiate the scroll request
Search search = new Search.Builder(searchSourceBuilder.toString())
.addIndex("my-index")
.addType("my-document")
.addSort(new Sort("_doc"))
.setParameter(Parameters.SIZE, size)
.setParameter(Parameters.SCROLL, "5m")
.build();
JestResult result = jestClient.execute(search);
// 2. Get the scroll_id to use in subsequent request
String scrollId = result.getJsonObject().get("_scroll_id").getAsString();
// 3. Issue scroll search requests until you have retrieved all results
boolean moreResults = true;
while (moreResults) {
SearchScroll scroll = new SearchScroll.Builder(scrollId, "5m")
.setParameter(Parameters.SIZE, size).build();
result = client.execute(scroll);
def hits = result.getJsonObject().getAsJsonObject("hits").getAsJsonArray("hits");
moreResults = hits.size() > 0;
}
You need to modify your process and handleBatch methods with the above code. It should be straightforward, let me know if not.

Related

Fetch data from a middle of a big stack using searchAfter(jump to a specific page,)

I have a large data set around 25million records
I am using searchAfter with PointInTime to walk through the data
My question is there a way where I can skip records over the limit of 10000
index.max_result_window
and start picking the records for example from 100,000 up to 105,000
right now I am sending multiple requests to Elasticsearch until I reach the desired point but it is not efficient and it is consuming a lot of time
Here is how I did it :
I calculated how many pages I needed to do the pagination.
Then the user will send a request with page number i.e number 3. So in this case only when I reach the desired page I will set the source to true.
this I best I managed to do to improve the performance and reduce the response size for none required pages
int numberOfPages = Pagination.GetTotalPages(totalCount, _size);
var pitResponse = await _esClient.OpenPointInTimeAsync(content._index, p => p.KeepAlive("2m"));
if (pitResponse.IsValid)
{
IEnumerable<object> lastHit = null;
for (int round = 0; round < numberOfPages; round++)
{
bool fetchSource = round == requiredPage;
var response = await _esClient.SearchAsync<ProductionDataItem>(s => s
.Index(content._index)
.Size(10000)
.Source(fetchSource)
.Query(query)
.PointInTime(pitResponse.Id)
.Sort(srt => {
if (content.Sort == 1) { srt.Ascending(sortBy); }
else { srt.Descending(sortBy); }
return srt; })
.SearchAfter(lastHit)
);
if (fetchSource)
{
itemsList.AddRange(response.Documents.ToList());
break;
}
lastHit = response.Hits.Last().Sorts;
}
}
//Closing PIT
await _esClient.ClosePointInTimeAsync(p => p.Id(pitResponse.Id));
Check here: Elasticsearch Pagination Techniques
I think the best way to do it, is how I did it
by keeping scrolling via Point in time and only loading the result when the desired page is reached by using the .source(bool)

ParseQuery not taking into consideration Where clause WindowsPhone

I made a new query to select from Article Class with where clause for each item selected. However, it keeps getting the whole list every time although there are selected fields!
Here is my code:
ParseQuery<Article> query = new ParseQuery<Article>();
if (souCategorie.SelectedIndex >= 0)
{
query.WhereEqualTo("idSCategorie", listeSouCategorie.ElementAt(souCategorie.SelectedIndex));
}
if(motcle.Text.Length > 0)
{
query.WhereContains("nom", motcle.Text);
// query.WhereContains("description", motcle.Text);
}
if(distance.Text.Length>0)
if (Convert.ToDouble(distance.Text) > 0)
{
Debug.WriteLine(distance.Text);
ParseGeoPoint geo = new ParseGeoPoint();
geo.Latitude = geoposition.Coordinate.Latitude;
geo.Longitude = geoposition.Coordinate.Longitude;
query.WhereWithinDistance("coordonnees", geo, ParseGeoDistance.FromKilometers(Convert.ToDouble(distance.Text)));
}
IEnumerable<Article> lst = await query.FindAsync();
rechercheResult.DataContext = lst.ToList();
What could possibly be wrong?
I know that queries can do funky stuff when you start trying to use GeoPoint stuff. I would try setting up two queries, one that just queries for objects within a distance, then pass that query into the second query that has the whereEqualTo and whereContains calls.

Sorting for Azure DocumentDB

I want to use DocumentDB to store roughly 200.000 documents of the same type. The documents each get an integer id field and I would like to retrieve them paged, in reverse order (highest id first).
So recently I found out there is no sorting for DocumentDB (see also DocumentDB - query result order). Perhaps it is better to go for a different database (such as RavenDB) however, time is pressing and I want to avoid the cost of switching to another database.
The question:
I have been looking at implementing my own sorted index of the documents on the client side (ASP Web API 2). I was thinking of creating a SortedList of key(id) and value(document.selflink). Then I could create a Getter with parameters for count, offset and a predicate to filter the documents. Below I added a quick example.
I just have the feeling this is a bad idea; either slow, costing too many resources or can be better done another way. So I am open for implementation suggestions...
public class SortableDocumentDbRepository
{
private SortedList _sorted = new SortedList();
private readonly string _sortedPropertyName;
private DocumentCollection ReadOrCreateCollection(string databaseLink) {
DocumentCollection col = base.ReadOrCreateCollection(databaseLink);
var docs = Client.CreateDocumentQuery(Collection.DocumentsLink)
.AsEnumerable();
lock (_sorted.SyncRoot) {
foreach (Document doc in docs) {
var propVal = doc.GetPropertyValue<string>(_sortedPropertyName);
if (propVal != null) {
_sorted.Add(propVal, doc.SelfLink);
}
}
}
return col;
}
public List<T> GetItems<T>(int count, int offset, Expression<Func<T, bool>> predicate) {
List<T> result = new List<T>();
lock (_sorted.SyncRoot) {
var values = _sorted.GetValueList();
for (int i = offset; i < _sorted.Count; i++) {
var queryable = predicate != null ?
Client.CreateDocumentQuery<T>(values[i].ToString()).Where(predicate) :
Client.CreateDocumentQuery<T>(values[i].ToString());
T item = queryable.AsEnumerable().FirstOrDefault();
if (item == null || item.Equals(default(T))) continue;
result.Add(item);
if (result.Count >= count) return result;
}
}
return result;
}
}
Microsoft has implemented Sorting:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-api-sql-query-reference#bk_orderby_clause
Example: SELECT * FROM c ORDER BY c._ts DESC
As you mentioned, order by unfortunately isn't implemented yet.
Your approach looks reasonable to me.
I see you are using a predicate to narrow the query result set (pulling 200,000 records for any DB will be costly).
Since it looks like you are looking to order by id - you can also look in to setting up a range index on id allowing you to perform range queries (e.g. < and >) on the id and further narrow the query result set. There is also a range index included by default on the _ts (timestamp) system property on documents that may also be helpful in this context.
See: http://azure.microsoft.com/en-us/documentation/articles/documentdb-indexing-policies/

How to retrieve total view count of large number of pages combined from the GA API

We are interested in the statistics of the different pages combined from the Google Analytics core reporting API. The only way I found to query statistics multiple pages at the same is by creating a filter like so:
ga:pagePath==page?id=a,ga:pagePath==page?id=b,ga:pagePath==page?id=c
And this get escaped inside the filter parameter of the GET query.
However when the GET query gets over 2000 characters I get the following response:
414. That’s an error.
The requested URL /analytics/v3/data/ga... is too large to process. That’s all we know.
Note that just like in the example call the only part that is different per page is a GET parameter in the pagePath, but we have to OR a new filter specifying both the metric (pagePath) as well as the part of the path that is always identical.
Is there any way to specify a large number of different pages to query without hitting this limit in the GET query (I can't find any documentation for doing POST requests)? Or are there alternatives to creating batches of a max of X different pages per query and adding them up on my end?
Instead of using ga:pagePath as part of a filter you should use it as a dimension. You can get up to 10,000 rows per query this way and paginate to get all results. Then parse the results client side to get what you need. Additionally use a filter to scope the results down if possible based on your site structure or page names.
I am sharing a sample code where you can fetch more then 10,000 record data via help of Items PerPage
private void GetDataofPpcInfo(DateTime dtStartDate, DateTime dtEndDate, AnalyticsService gas, List<PpcReportData> lstPpcReportData, string strProfileID)
{
int intStartIndex = 1;
int intIndexCnt = 0;
int intMaxRecords = 10000;
var metrics = "ga:impressions,ga:adClicks,ga:adCost,ga:goalCompletionsAll,ga:CPC,ga:visits";
var r = gas.Data.Ga.Get("ga:" + strProfileID, dtStartDate.ToString("yyyy-MM-dd"), dtEndDate.ToString("yyyy-MM-dd"),
metrics);
r.Dimensions = "ga:campaign,ga:keyword,ga:adGroup,ga:source,ga:isMobile,ga:date";
r.MaxResults = 10000;
r.Filters = "ga:medium==cpc;ga:campaign!=(not set)";
while (true)
{
r.StartIndex = intStartIndex;
var dimensionOneData = r.Fetch();
dimensionOneData.ItemsPerPage = intMaxRecords;
if (dimensionOneData != null && dimensionOneData.Rows != null)
{
var enUS = new CultureInfo("en-US");
intIndexCnt++;
foreach (var lstFirst in dimensionOneData.Rows)
{
var objPPCReportData = new PpcReportData();
objPPCReportData.Campaign = lstFirst[dimensionOneData.ColumnHeaders.IndexOf(dimensionOneData.ColumnHeaders.FirstOrDefault(h => h.Name == "ga:campaign"))];
objPPCReportData.Keywords = lstFirst[dimensionOneData.ColumnHeaders.IndexOf(dimensionOneData.ColumnHeaders.FirstOrDefault(h => h.Name == "ga:keyword"))];
lstPpcReportData.Add(objPPCReportData);
}
intStartIndex = intIndexCnt * intMaxRecords + 1;
}
else break;
}
}
Only one thing is problamatic that your query length shouldn't exceed around 2000 odd characters

Google calendar query returns at most 25 entries

I'm trying to delete all calendar entries from today forward. I run a query then call getEntries() on the query result. getEntries() always returns 25 entries (or less if there are fewer than 25 entries on the calendar). Why aren't all the entries returned? I'm expecting about 80 entries.
As a test, I tried running the query, deleting the 25 entries returned, running the query again, deleting again, etc. This works, but there must be a better way.
Below is the Java code that only runs the query once.
CalendarQuery myQuery = new CalendarQuery(feedUrl);
DateFormat dfGoogle = new SimpleDateFormat("yyyy-MM-dd'T00:00:00'");
Date dt = Calendar.getInstance().getTime();
myQuery.setMinimumStartTime(DateTime.parseDateTime(dfGoogle.format(dt)));
// Make the end time far into the future so we delete everything
myQuery.setMaximumStartTime(DateTime.parseDateTime("2099-12-31T23:59:59"));
// Execute the query and get the response
CalendarEventFeed resultFeed = service.query(myQuery, CalendarEventFeed.class);
// !!! This returns 25 (or less if there are fewer than 25 entries on the calendar) !!!
int test = resultFeed.getEntries().size();
// Delete all the entries returned by the query
for (int j = 0; j < resultFeed.getEntries().size(); j++) {
CalendarEventEntry entry = resultFeed.getEntries().get(j);
entry.delete();
}
PS: I've looked at the Data API Developer's Guide and the Google Data API Javadoc. These sites are okay, but not great. Does anyone know of additional Google API documentation?
You can increase the number of results with myQuery.setMaxResults(). There will be a maximum maximum though, so you can make multiple queries ('paged' results) by varying myQuery.setStartIndex().
http://code.google.com/apis/gdata/javadoc/com/google/gdata/client/Query.html#setMaxResults(int)
http://code.google.com/apis/gdata/javadoc/com/google/gdata/client/Query.html#setStartIndex(int)
Based on the answers from Jim Blackler and Chris Kaminski, I enhanced my code to read the query results in pages. I also do the delete as a batch, which should be faster than doing individual deletions.
I'm providing the Java code here in case it is useful to anyone.
CalendarQuery myQuery = new CalendarQuery(feedUrl);
DateFormat dfGoogle = new SimpleDateFormat("yyyy-MM-dd'T00:00:00'");
Date dt = Calendar.getInstance().getTime();
myQuery.setMinimumStartTime(DateTime.parseDateTime(dfGoogle.format(dt)));
// Make the end time far into the future so we delete everything
myQuery.setMaximumStartTime(DateTime.parseDateTime("2099-12-31T23:59:59"));
// Set the maximum number of results to return for the query.
// Note: A GData server may choose to provide fewer results, but will never provide
// more than the requested maximum.
myQuery.setMaxResults(5000);
int startIndex = 1;
int entriesReturned;
List<CalendarEventEntry> allCalEntries = new ArrayList<CalendarEventEntry>();
CalendarEventFeed resultFeed;
// Run our query as many times as necessary to get all the
// Google calendar entries we want
while (true) {
myQuery.setStartIndex(startIndex);
// Execute the query and get the response
resultFeed = service.query(myQuery, CalendarEventFeed.class);
entriesReturned = resultFeed.getEntries().size();
if (entriesReturned == 0)
// We've hit the end of the list
break;
// Add the returned entries to our local list
allCalEntries.addAll(resultFeed.getEntries());
startIndex = startIndex + entriesReturned;
}
// Delete all the entries as a batch delete
CalendarEventFeed batchRequest = new CalendarEventFeed();
for (int i = 0; i < allCalEntries.size(); i++) {
CalendarEventEntry entry = allCalEntries.get(i);
BatchUtils.setBatchId(entry, Integer.toString(i));
BatchUtils.setBatchOperationType(entry, BatchOperationType.DELETE);
batchRequest.getEntries().add(entry);
}
// Get the batch link URL and send the batch request
Link batchLink = resultFeed.getLink(Link.Rel.FEED_BATCH, Link.Type.ATOM);
CalendarEventFeed batchResponse = service.batch(new URL(batchLink.getHref()), batchRequest);
// Ensure that all the operations were successful
boolean isSuccess = true;
StringBuffer batchFailureMsg = new StringBuffer("These entries in the batch delete failed:");
for (CalendarEventEntry entry : batchResponse.getEntries()) {
String batchId = BatchUtils.getBatchId(entry);
if (!BatchUtils.isSuccess(entry)) {
isSuccess = false;
BatchStatus status = BatchUtils.getBatchStatus(entry);
batchFailureMsg.append("\nID: " + batchId + " Reason: " + status.getReason());
}
}
if (!isSuccess) {
throw new Exception(batchFailureMsg.toString());
}
There is a small quote on the API page
http://code.google.com/apis/calendar/data/1.0/reference.html#Parameters
Note: The max-results query parameter for Calendar is set to 25 by default,
so that you won't receive an entire
calendar feed by accident. If you want
to receive the entire feed, you can
specify a very large number for
max-results.
So to get all events from a google calendar feed, we do this:
google.calendarurl.com/.../basic?max-results=999999
in the API you can also query with setMaxResults=999999
I got here while searching for a Python solution;
Should anyone be stuck in the same way, the important line is the fourth:
query = gdata.calendar.service.CalendarEventQuery(cal, visibility, projection)
query.start_min = start_date
query.start_max = end_date
query.max_results = 1000
Unfortunately, Google is going to limit the maximum number of queries you can retrieve. This is so as to keep the query governor in their guidelines (HTTP requests not allowed to take more than 30 seconds, for example). They've built their whole architecture around this, so you might as well build the logic as you have.

Resources