I have a DynamoDB table with 151 records, table size is 666 kilobytes and average item size is 4,410.83 bytes.
This is my table schema:
uid - partition key
version - sort key
status - published
archived - boolean
... // other attributes
When I'm doing scan operation from AWS Lambda.
module.exports.searchPois = async (event) => {
const klass = await getKlass.byEnv(process.env.STAGE, 'custom_pois')
const status = (event.queryStringParameters && event.queryStringParameters.status) || 'published'
const customPois = await klass.scan()
.filter('archived').eq(false)
.and()
.filter('status').eq(status)
.exec()
return customPois
}
This request takes up 7+ seconds to fetch. I'm thinking to add a GSI so I can perform a query operation. But before I add, is it really like this slow when using scan and If I add GSI, will it fetch faster like 1-3 seconds?
Using Scan should be avoided of at all possible. For your use-case a GSI would be much more efficient and a sparse index would be even better: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html
In saying that, for the small number of items you have it should not take 7seconds. This is likely caused by making infrequent requests to DynamoDB as DynamoDB relies on caching metadata to improve latency for requests, if your requests are infrequent then the metadata will no exist in cache increasing response times.
I suggest to ensure you re-use your connections, create your client outside of the Lambda event handler and ensure you keep active traffic on the table.
Related
I have a Spring Boot 2.x project with a big Table in my Cassandra Database. In my Liquibase Migration Class, I need to replace a value from one column in all rows.
For me its a big perfomance hit, when I try to solve this with
SELECT * FROM BOOKING
forEach Row
Update Row
Because of the total number of rows. Even when I select only 1 Column.
Is it possible to make something like "partwise/pagination" loop?
Pseudecode
Take first 1000 rows
do Update
Take next 1000 rows
do Update
loop.
Im also happy about all other solution approaches you have.
Must known:
Make sure there is a way to group the updates by partition. If you try a batchUpdate on 1000 rows not in same partition the coordinator of the request will suffer, you are moving the load from your client to the coordinator, and you want the parallelize the writes instead. A batchUpdate with cassandra has nothing to do with the one in relational databases.
For fined-grained operations like this you want to go back to the usage of the drivers with CassandraOperations and CqlSession for maximum control
There is a way to paginate with Spring Data cassandra using Slice but do not have control over how operations are implemented.
Spring Data Cassandra core
Slice<MyEntity> slice = MyEntityRepo.findAll(CassandraPageRequest.first(size));
while(slice.hasNext() && currpage < page) {
slice = personrepo.findAll(slice.nextPageable());
currpage++;
}
slice.getContent();
Drivers:
// Prepare Statements to speed up queries
PreparedStatement selectPS = session.prepare(QueryBuilder
.selectFrom( "myEntity").all()
.build()
.setPageSize(1000) // 1000 per pages
.setTimeout(Duration.ofSeconds(10)); // 10s timeout
PreparedStatement updatePS = session.prepare(QueryBuilder
.update("mytable")
.setColumn("myColumn", QueryBuilder.bindMarker())
.whereColumn("myPK").isEqualTo(QueryBuilder.bindMarker())
.build()
.setConsistencyLevel(ConsistencyLevel.ONE)); // Fast writes
// Paginate
ResultSet page1 = session.execute(selectPS);
Iterator<Row> page1Iter = page1.iterator();
while (0 < page1.getAvailableWithoutFetching()) {
Row row = page1Iter.next();
cqlsession.executeAsync(updatePS.bind(...));
}
ByteBuffer pagingStateAsBytes =
page1.getExecutionInfo().getPagingState();
selectPS.setPagingState(pagingStateAsBytes);
ResultSet page2 = session.execute(selectPS);
You could of course include this pagination in a loop and track progress.
I was reading through the docs to learn pagination approaches for Apollo. This is the simple example where they explain the paginated read function:
https://www.apollographql.com/docs/react/pagination/core-api#paginated-read-functions
Here is the relevant code snippet:
const cache = new InMemoryCache({
typePolicies: {
Query: {
fields: {
feed: {
read(existing, { args: { offset, limit }}) {
// A read function should always return undefined if existing is
// undefined. Returning undefined signals that the field is
// missing from the cache, which instructs Apollo Client to
// fetch its value from your GraphQL server.
return existing && existing.slice(offset, offset + limit);
},
// The keyArgs list and merge function are the same as above.
keyArgs: [],
merge(existing, incoming, { args: { offset = 0 }}) {
const merged = existing ? existing.slice(0) : [];
for (let i = 0; i < incoming.length; ++i) {
merged[offset + i] = incoming[i];
}
return merged;
},
},
},
},
},
});
I have one major question around this snippet and more snippets from the docs that have the same "flaw" in my eyes, but I feel like I'm missing some piece.
Suppose I run a first query with offset=0 and limit=10. The server will return 10 results based on this query and store it inside cache after accessing merge function.
Afterwards, I run the query with offset=5 and limit=10. Based on the approach described in docs and the above code snippet, what I'm understanding is that I will get only the items from 5 through 10 instead of items from 5 to 15. Because Apollo will see that existing variable is present in read (with existing holding initial 10 items) and it will slice the available 5 items for me.
My question is - what am I missing? How will Apollo know to fetch new data from the server? How will new data arrive into cache after initial query? Keep in mind keyArgs is set to [] so the results will always be merged into a single item in the cache.
Apollo will not slice anything automatically. You have to define a merge function that keeps the data in the correct order in the cache. One approach would be to have an array with empty slots for data not yet fetched, and place incoming data in their respective index. For instance if you fetch items 30-40 out of a total of 100 your array would have 30 empty slots then your items then 60 empty slots. If you subsequently fetch items 70-80 those will be placed in their respective indexes and so on.
Your read function is where the decision on whether a network request is necessary or not will be made. If you find all the data in existing you will return them and no request to the server will be made. If any items are missing then you need to return undefined which will trigger a network request, then your merge function will be triggered once data is fetched, and finally your read function will run again only this time the data will be in the cache and it will be able to return them.
This approach is for the cache-first caching policy which is the default.
The logic for returning undefined from your read function will be implemented by you. There is no apollo magic under the hood.
If you use cache-and-network policy then a your read doesn't need to return undefined when data
I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.
I have a service that pulls statistics for a sales region. The service computes the stats for ALL regions and then caches that collection, then returns only the region requested.
public object Any(RegionTotals request)
{
string cacheKey = "urn:RegionTotals";
//make sure master list is in the cache...
base.Request.ToOptimizedResultUsingCache<RegionTotals>(
base.Cache, cacheKey, CacheExpiryTime.DailyLoad(), () =>
{
return RegionTotalsFactory.GetObject();
});
//then retrieve them. This is all teams
RegionTotals tots = base.Cache.Get<RegionTotals>(cacheKey);
//remove all except requested
tots.Records.RemoveAll(o => o.RegionID != request.RegionID);
return tots;
}
What I'm finding is that when I use a MemoryCacheClient (as part of a StaticAppHost that I use for Unit Tests), the line tots.Records.RemoveAll(...) actually affects the object in the cache. This means that I get the cached object, delete rows, and then the cache no longer contains all regions. Therefore, subsequent calls to this service for any other region return no records. If I use my normal Cache, of course the Cache.Get() makes a new copy of the object in the cache, and removing records from that object doesn't affect the cache.
This is because an In Memory cache doesn't add any serialization overhead and just stores your object instances in memory. Whereas when you use any of the other Caching Providers your values are serialized first then sent to the remote Caching Provider then when it's retrieved it's deserialized back so it's never reusing the same object instances.
If you plan on mutating cached values you'll need to clone the instances before mutating them, if you don't want to manually implement ICloneable you can serialize and deserialize them with:
var clone = TypeSerializer.Clone(obj);
In Meteor I want to work on the document level when having a Mongo database and according to sources, what I have to watch out for is expensive publications so today my question is:
How would I go about publishing documents with relations, would I follow the relational-type of query where we would find assignment details with an assignment id like this:
Meteor.publish('someName', function () {
var empId = "dj4nfhd56k7bhb3b732fd73fb";
var assignmentData = Assignment.find({ employee_id: empId });
return AssignmentDetails.find({ assignment_id: $in [ assignment ] });
});
or should we rather take an approach like this where we skip the filtering step in the publish and instead publish every assignment_detail and handle that filter on the client:
Meteor.publish('someName', function () {
var empId = "dj4nfhd56k7bhb3b732fd73fb";
var assignmentData = Assignment.find({ employee_id: empId });
var detailData = AssignmentDetails.find({ employee_id: empId });
return [ assignmentData, detailData];
});
I guess this is a question of whether the amount of data being searched trough on the server should be more then or if the amount of data being transferred to the client should be bigger.
Which of these would be most cost effective for the server?
It's a matter of opinion, but if possible I would strongly recommend attaching employee_id to docs in AssignmentDetails, as you have in the second example. You're correct in suggesting that publications are expensive, but much more so if the publication function is more complex than necessary, and you can reduce your pub function to one line if you have employee_id in AssignmentDetails (even where there are many employee_ids for each assignment) by just searching on that. You don't even need to return that field to the client (you can specify the fields to return in your find), so the only incurred overhead would be in database storage (which is v. cheap) and adding it to inserted/updated AssignmentDetails docs (which would be imperceptible). The actual amount of data transferred would be the same as in the first case.
The alternative of just publishing everything might be fine for a small collection, but it really depends on the number of assignments, and it's not going to be at all scalable this way. You need to send the entire collection to the client every time a client connects, which is expensive and time-consuming at both ends if it's more than a MB or so, and there isn't really any way round that overhead when you're talking about a dynamic (i.e. frequently-changing) collection, which I think you are (whereas for largely static collections you can do things with localStorage and poll-and-diff).