WP7 Sterling very slow storing lots of data - windows-phone-7

I'm trying to store 46,000 objects in Sterling and it's taking 3 minutes.
Yes I know it's a lot but this is data provided by the customer and could end up being a lot more.
I'm guessing each time I save a new object it is looking up the key to see if the object has already been stored.
Is there any way to bypass this and tell sterling to just insert?
Any other ideas?

Without knowing anything about your data structure it's a bit difficult to recommend ways in which you could improve performance, however:
The fewer indexes you create for your data tables the fewer indexes there are to create when your data is persisted. You should look carefully at which indexes you need for your data read scenarios.
The more data relationships there are, the more metadata there is to create at write time. You may be able to simplify the data structures and combine classes.
Sheer volume of data sounds like your biggest problem. I've experienced similar problems before with trying to persist large volumes of GPS data. The problem there is that I was trying to write a lot of relatively small amounts of data related to a single piece of data in another table. I managed to resolve this by consolidating the GPS data into a single string and persisting it as a field with the main record. This offloaded a lot of the read/write time into a significantly smaller amount of time for rehydrating the data when it was actually needed.
I woudl definitely recommend reachign out to Jeremy and the Sterling team via the CodePlex site if none of the above help.

Have you considered keeping most data server-side, and presenting client with only a window into that data, something like 20 or 50 rows at a time?
EDIT: since the answer in no, I'd turn off the database table index while the operation is going, or use SQL bulk copy.

Related

Mixing column and row oriented databases?

I am currently trying to improve the performance of a web application. The goal of the application is to provide (real time) analytics. We have a database model that is similiar to a star schema, few fact tables and many dimensional tables. The database is running with Mysql and MyIsam engine.
The Fact table size can easily go into the upper millions and some dimension tables can also reach the millions.
Now the point is, select queries can get awfully slow if the dimension tables get joined on the fact tables and also aggretations are done. First thing that comes in mind when hearing this is, why not precalculate the data? This is not possible because the users are allowed to use several freely customizable filters.
So what I need is an all-in-one system suitable for every purpose ;) Sadly it wasn't invented yet. So I came to the idea to combine 2 existing systems. Mixing a row oriented and a column oriented database (e.g. like infinidb or infobright). Keeping the mysql MyIsam solution (for fast inserts and row based queries) and add a column oriented database (for fast aggregation operations on few columns) to it and fill it periodically (nightly) via cronjob. Problem would be when the current data (it must be real time) is queried, therefore I maybe would need to get data from both databases which can complicate things.
First tests with infinidb showed really good performance on aggregation of a few columns, so I really think this could help me speed up the application.
So the question is, is this a good idea? Has somebody maybe already done this? Maybe there is are better ways to do it.
I have no experience in column oriented databases yet and I'm also not sure how the schema of it should look like. First tests showed good performance on the same star schema like structure but also in a big table like structure.
I hope this question fits on SO.
Greenplum, which is a proprietary (but mostly free-as-in-beer) extension to PostgreSQL, supports both column-oriented and row-oriented tables with high customizable compression. Further, you can mix settings within the same table if you expect that some parts will experience heavy transactional load while others won't. E.g., you could have the most recent year be row-oriented and uncompressed, the prior year column-oriented and quicklz-compresed, and all historical years column-oriented and bz2-compressed.
Greenplum is free for use on individual servers, but if you need to scale out with its MPP features (which are its primary selling point) it does cost significant amounts of money, as they're targeting large enterprise customers.
(Disclaimer: I've dealt with Greenplum professionally, but only in the context of evaluating their software for purchase.)
As for the issue of how to set up the schema, it's hard to say much without knowing the particulars of your data, but in general having compressed column-oriented tables should make all of your intuitions about schema design go out the window.
In particular, normalization is almost never worth the effort, and you can sometimes get big gains in performance by denormalizing to borderline-comical levels of redundancy. If the data never hits disk in an uncompressed state, you might just not care that you're repeating each customer's name 40,000 times. Infobright's compression algorithms are designed specifically for this sort of application, and it's not uncommon at all to end up with 40-to-1 ratios between the logical and physical sizes of your tables.

SQL query to search faster or using hash table

If I am looking for a record in the database, is writing a sql query to search the database directly faster OR is reading the entire data from the database into a hashtable and then searching in O(1) time faster?
This question is for experienced programmers who have faced such issues in the past.
If you know the primary key of the row or the column you are searching on is indexed, then doing the retrieval" using SQL will be much faster. Especially if your table does not fit into memory.
Making direct sql query to database would obviously be much faster, than first reading all the records into a hash table and searching from it. This will not only save your time in loading all the records firstly into a hash table and then searching through them. 2ndly it will also save lots of memory, that your hash tables will consume.
I have experienced this kind of situations. Hope this helps you!
Sql Server Database is more faster and better than Hash-table.
one important reason behind.
Hash table reads the data once from secondary storage and then loaded into memory.
now, it is easy to identify that what will happen?
By Storing data in a huge manner, system will be slow. it will difficult to manipulate and retrieve the records.....
Despite, DBMS is being considered well convenient environment as compare to hash table. if you are trying to get results with few thousands of records then you do not have need to create index. it depends on need. Thus, it is much easy to get answers from remote machine with three tier applications. it takes care about row count, IO Speed etc.
If the SQL table is not indexed you'd have to benchmark to find your answer. Since there are lots of factors such as the row count, IO speed, network speed (if database is on a remove machine), it is hard to just give an answer to the question
On the other hand, indexing the table is a better choice. Just, leave the DBMS's job to DBMS.

Wanted: DB for fast read operations to be accessed from ruby apps

Basically it's a financial database, with both daily and intraday data (date,symbol,open,high,low,close,vol,openinterest) -- very simple structure. Updates are just once a day. A typical query would be: date and close price of MSFT for all dates in DB. I was thinking that there's got to be something out there that's been optimized for lots of reads and not many writes, as opposed to a general-purpose RDBMS like MySQL. I searched rubyforge.org, and I didn't see anything that specifically addressed this (as far as I could tell).
MS SQL Server can be optimized like this with the fairly simple:
ALTER DATABASE myDatabase
SET READ_COMMITTED_SNAPSHOT ON
SQL Server will automatically cache your data in memory if it is being used heavily for reads.
You can always use a RAMdisk for your MySQL installation if your database footprint is small enough. One way to make your tables small enough to fit is to create them as MyISAM ARCHIVE tables. While they are very compact, compressed, they can only be appended to or read from, but not updated. (http://dev.mysql.com/tech-resources/articles/storage-engine.html)
Generally a properly indexed and well organized MySQL table is really fast, especially when using MyISAM, and even more so when loaded from memory. They key is in denormalizing the data as heavily as you can optimizing for your particular read scenarios.
For example, having a stock_id, date, price tuple is going to be fairly slow to sort and retrieve. If you have, instead, stock_id and a column with some serialized data, the retrieval time will be very quick.
Another solution that is likely faster is to push all the data into an alternative DBMS like Toyko Cabinet or something similar, especially if your data fits neatly into a key/value store.
Look at MySQL, but run the database from memory instead of disk. Depends on the size of your dataset and your budget, but you could then update memory from disk once a day, and have a very, very fast read time afterwards.
The best-known (to me at least!) time series database is Fame but it's expensive and I strongly doubt that there's anything like, say, an ActiveRecord implementation for it. Unless it's changed a lot in the 10 or so years since I last touched it, it isn't SQL-friendly at all.
With a fairly tightly-focused application, you can take a more flexible view of your data. For example, consider what is the information that you're actually looking to store? Is it the atomic price/hi/lo/close/vol/whatever, or is it more appropriately a time series of such values? If you always want to view the series, store a series per row, not a value.
Throwing a few ideas out here...
How might it look if you stored a year or a month of a single value for a single stock in one row? Maybe as an XML string, or JSON or something more terse of your own devising. Compressed CSV, perhaps? That ought to fit a month's values into a 255-character column. (Use something like Huffman coding to do the encoding, perhaps - a single dictionary ought to work for all instances of such similar data).
You can still hold a horizontal view as well: with the extremely low update rate you'll have (should only be data fixes, I'd guess) you can probably stand to build that stuff.
There's an obvious downside to this: you'll have a bunch of extra work to do.
I don't have any personal experience, but MogoDB claims to offer relational-style flexibility with key-value performance.
As mentioned elsewhere key-value database might be worth looking at: Tokyo Cabinet, CouchDB or one of the others again, perhaps, with concatenated value for the time series.

Normalize or Denormalize in high traffic websites

What are the best practices for database design and normalization for high traffic websites like stackoverflow?
Should one use a normalized database for record keeping or a normalized technique or a combination of both?
Is it sensible to design a normalized database as the main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of the database for fast searching?
or
Should the main database be denormalized but with normalized views at the application level for fast database operations?
or some other approach?
The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.
The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.
So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.
Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.
That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.
The Art of Capacity Planning is pretty good.
You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html
First: Define for yourself what hight-traffic means:
50.000 Page-Viewss per day?
500.000 Page-Views per day?
5.000.000 Page-Views per day?
more?
Then calculate this down to probable peak page-views per minute and per seconds.
After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?
Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.
Relational-database can be, if fully optimized, amazingly fast, when joining tables!
A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denormalization the default approach.
(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)
The best best-practice answer is definitely: It depends ;-)
For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.
Neither matters if you aren't caching properly.

schema-less data warehouse and reporting

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.
The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.
I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.
I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.
As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.
Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.
I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.
This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)
Wrap the exporter in a web page so they can download it themselves and get up-to-date data.
Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.
Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.
I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.
The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.
You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.

Resources