Windows Azure Paging Large Datasets Solution - windows

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?

I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Related

Sum value flow power automate

I have a table on azure following :
enter image description here
Can anyone help me to make a sum perday of number of users with flow power automate :
enter image description here
Thanks in advance
Ok, this is a monster of an answer but it works so follow closely.
Refer to the images for what to do.
The basic concept is, loop through all entities and fill an array with the distinct row keys. The way we determine if it's distinct or not is by adding the row key to an array IF it hasn't been added previously.
From there, we will loop through that distinct list and using an inner loop, we will sum each NumberOfUsers column IF the Inner Row Key matches the Outer Row Key that is being processed.
At the end of the outer loop, add an object to an array. That object has two fields, "RowKey" and "NumberOfUsers". The "NumberOfUsers" field contains the summation for that given RowKey.
From here, you have the distinct count.
If I've mis-used any fields (i.e. the use of RowKey) then change it up as need be.
This is just logic, you just need to apply it to the scenario. I think this is best done in an Azure Function because it'll run faster and be a lot less to maintain but if you want to avoid that and use PowerAutomate, this works.
Flow
Data / Table
Result

Is it indexing Or tagging?

I have two classes claim and index. i have a field in my claim class called topic which is a string. I m trying to index the topic column not using database index column features. But it should by coding the following method.
Suppose i have claim 1, for claim 1 topic field ("i love muffins muffins") i ll do the folowing treatment
#1. Create an empty Dictionary with "word"=>occurrences
#2. Create a List of the stopwords exemple stopwords = ("For","This".....etc )
#3. Create List of the delimiters exemple delimiter_chars = ",.;:!?"
#4. Split the Text(topic field) into words delimited by whitespace.
#5. Remove unwanted delimiter characters adjoining words.
#6. Remove stopwords.
#7. Remove Duplicate
#8. now i create multiple index object (word="love",occurences = 1,looked = 0,reference on claim 1),(word="muffins",occurences = 2,looked = 0,reference on claim 1),
now whenever i look the word muffins for exemple looked will increase by one and i will move the record up in my database. So my question is the following is this method good ? is it better than database index features ? is there someways to improve things ?
What I think you are looking for is something called a B-Tree. In your case, you would use a 26 (or 54 if you need case sensitivity) branch node in the tree. This will make finding objects very fast. I think the time is nlogn or something. In the node, you would have a pointer to the actual data in an array, list, file, or something else.
However, unless you are willing to put the time in to code something specific for your application, you might be better off using a database such as Oracle, Microsoft SQL Server, or MySQL because these are professionally developed and profiled to get the maximum performance possible.

How could I quickly look-up items in a List of loaded entities

I have built an MVC 5 application, using EF 6 to query the database. One page show a cross table of two dimensions: substances against properties of these substances. It is rendered as an html table.Many cells do not have a value. This is what it looks like:
sub 1 sub 2 sub 3
prop A 1.0
prop B 1.5 X
prop C 0.6 Y
The cell values are actually more complex, including tool tips, footnotes, etc.
I implemented the generation of the html table, by the following steps:
create a list of unique properties;
create a list of unique substances;
loop through the properties;
render a row for each;
loop through the substances;
See if there is a value for the combination of property and substances;
render the cell's value or an empty one.
Using the ANTS performance profiler, I found out that step 6 has a huge performance issue with increasing numbers of substances and properties, the hit count exploding to hundreds of millions, with a few hundred substances and a few tens of properties (the largest selection the user can make). The execution time is many minutes. It seems to scale N(substances)^2 * N(properties)^2.
The code looks like:
Value currentValue =
values.Where(val => val.substance.Id == currentSubstanceId
&& val.property.Id == currentPropertyId).SingleOrDefault();
where values is a List and Value is an entity, which I read from to render the cells. values had been pre-loaded from the database and no queries are shown by the SQL Server Profiler.
Since not all cells have a value, I thought it best to loop through the row and columns and see if there is a value. I cannot just loop through the list of values.
What could I try to improve this? I thought about:
Create some sort of C# object, using the substance.Id and property.Id as a compound key and fill it from the List object. Which would be fastest?
Create some Linq query which returns an object which already contains the empty cells, like (substance cross join properties) left join values. I could do this in SQL easily, but could this be done with Linq? Could the object which stores the result have the Value as a member field, so I can still use it to render the cells?
Stop pre-loading and just run a database query for the value of each combination, possibly benefiting from database indexes.
I am considering restricting the number of substances and properties the user may select, but I would rather not do that.
Addtional info
As requested by C.Zonnenberg, some more info about the query.
The query to fill the list of values is basically as follows:
I create an IQueryable to which I add filters for requested substances and properties. I then include the substances, property and value details, found in related entities. I then execute query.ToList(). The actual SQL query, as seen by the SQL Profiler looks complex, involving SubstanceId IN () and PropertyId IN (), but it takes far less then a second to execute.
It returns a list of proxies, like: {System.Data.Entity.DynamicProxies.SubstancePropertyValue_078F758A4FF9831024D2690C4B546F07240FAC82A1E9D95D3826A834DCD91D1E}
I think your best bet is your first option. But to do that efficiently I would also modify the source data (values) and turn it into a dictionary, so you have a structure that's optimized for indexed lookup:
var dict = values.ToDictionary(e =>
Tuple.Create(e.substance.id, e.propertyid),
e => e.Value);
Then for each cell:
Value currentValue ;
dict.TryGetValue(Tuple.Create(currentSubstanceId, currentPropertyId),
out currentValue );
Further, you may benefit from parallelization by fetching the cell values in a Parallel.ForEach looping through all substances, for instance.

Removing duplicates from a huge file

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361
These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task.
What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)
If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.
I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.
If you go down the database route, you can load the csv into a database and use 'duplicate key update'
using mysql:-
Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of
LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.
If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)
Set keySet;
while(not end_of_input_file)
read line from input file
if first column is not in keySet
add first column to keySet
write line to output file
end while
If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:
r = read_row()
if r is None:
os.exit()
last = r[0]
write_row(r)
while True:
r = read_row()
if r is None:
os.exit()
if r[0] != last:
write_row(r)
last = r[0]
Otherwise:
What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.
S = set()
while True:
r = read_row()
if r is None:
os.exit()
if r[0] not in S:
write_row(r)
S.add(r[0])
This will stream over the input using only memory proportional to the size of the set of values from the first column.
If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.
Say you have the following data: a, c, a, b
With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c
We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.
We then sort by position and strip that: a, c, b
This would involve three linear scans over the data set and two sorting steps.

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources