Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We have 1 million dataset and each dataset is around 180mb. SO the total size of our data is around 185T. Each dataset is plain DEL file with only three columns. The first two columns is the row key and the last one is the value of the row. For example, the first column is A, the second is B the third is C. The value of A is the dataset number, so A is fixed in one dataset and its range is from 1-1million. B is the position number, B can range from 1 to 3 million.
What we are planning to do is given a any set of non-overlapping ranges of B like from 1-1000, 10000-13000, 16030-17000...., we calculate the sum of the values of each dataset over all these ranges, and return the top 200 dataset number(A) with in seconds.
Do any one expertised in bigdata have any idea on how many severs we will need to handle this case? My boss believe 10 servers (16 cores each) can do it with a budget of $50,000. Do you think it's feasible?
I think that services such as Microsoft Azure can be your friend in this case. I think that your budget will go further using "pay as you use" services. You can decide how many servers / instances you would like to use to crunch your data.
I do think that one slight problem might be the way your data is currently formatted. I would most certainly look at using Azure table storage possibly and first work on getting your data in a service such as that. Once that has been complete, you now have a more "queryable" and reliable data store. From there you can use your language of choice to interact with that data. Using table storage, you can create partition keys.
Once you have your partitions that you would like to use, you can then create a service that you would perhaps supply a partition or more likely partition range, which it will process. You will be able to adjust the size of your instances and also what hardware should drive them, with something like this in place you can then determine an average on how long it would take 1 instance to process x records. Perhaps you could write some logs as to the performance.
Once you have your logs, it will be simple to determine how long the process would take with reasonable accuracy. you can then start adding more instances to your service, thus starting to work through the data at a faster pace.
Table storage was also designed to work with big datasets, so going through the documentation on this, you will find many key features that you could use.
there are honestly many ways in which this problem could be solved, this is simply one option that I have used in the past and it worked for me at the time.
I would make sure that if this was a viable option for you, to place your data and services in the same data centre. While I assume you have some form of sequence in your files, you could also persist placeholders storing your sum values for future use and should your data grow in the future you could simply add the new data and run your services again to update the system.
I would not go on this journey without making sure that you can persist your sum values in some or other way, otherwise should you need to get the values again in the future you will again need to start from the beginning again.
I managed to find one quick write up about the services mentioned above working with big data. Perhaps it might be able to help you further. http://www.troyhunt.com/2013/12/working-with-154-million-records-on.html
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 9 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
In each of the following examples, we need to choose the best data structure(s). Options are: Array, Linked Lists, Stack, Queues, Trees, Graphs, Sets, Hash Tables. This is not homework, however, I am really curious about data structures and I would like the answer to these questions so that I can understand how each structure works.
You have to store social network “feeds”. You do not know the size, and things may need to be dynamically added.
You need to store undo/redo operations in a word processor.
You need to evaluate an expression (i.e., parse).
You need to store the friendship information on a social networking site. I.e., who is friends with who.
You need to store an image (1000 by 1000 pixels) as a bitmap.
To implement printer spooler so that jobs can be printed in the order of their arrival.
To implement back functionality in the internet browser.
To store the possible moves in a chess game.
To store a set of fixed key words which are referenced very frequently.
To store the customer order information in a drive-in burger place. (Customers keep on coming and they have to get their correct food at the payment/food collection window.)
To store the genealogy information of biological species.
Hash table (uniquely identifies each feed while allowing additional feeds to be added (assuming dynamic resizing))
Linked List (doubly-linked: from one node, you can go backwards/forwards one by one)
Tree (integral to compilers/automata theory; rules determine when to
branch and how many branches to have. look up parse trees)
Graph (each person is a point, and connections/friendships are an edge)
Array (2-dimensional, 1000x1000, storing color values)
Queue (like a queue/line of people waiting to get through a checkpoint)
Stack (you can add to the stack with each site
visited, and pop off as necessary to go back, as long as you don't
care about going forward. If you care about forward, this is the same scenario as the word processor, so linked list)
Tree (can follow any game move by move, down from the root to the leaf. Note that this tree is HUGE)
Hash table (If you want to use the keywords as keys, and get all things related to them, I would suggest a hash table with linked lists as the keys' corresponding values. I might be misunderstanding this scenario, the description confuses me a little as to how they are intended to be used)
Queue or Hash Table (if this is a drive thru, assuming people aren't cutting
in front of one another, it's like the printer question. If customers are placing orders ahead of time, and can arrive in any order, a hash table would be much better, with an order number or customer name as the key and the order details as the value)
Tree (look up phylogenic tree)
If you would like to know more about how each data structure works, here is one of many helpful sites that discusses them in detail.
You have to store social network “feeds”. You do not know the size, and things may need to be dynamically added. -linked list or Hash table
You need to store undo/redo operations in a word processor. – stack (use 2 stack for including redo operation)
You need to evaluate an expression (i.e., parse). – stack or tree
You need to store the friendship information on a social networking site. I.e., who is friends with who – graph
You need to store an image (1000 by 1000 pixels) as a bitmap. – ArrayList(java) or 2D array
To implement printer spooler so that jobs can be printed in the order of their arrival. – queue
To implement back functionality in the internet browser. – stack
To store the possible moves in a chess game. – tree
To store a set of fixed keywords which are referenced very frequently. –dynamic programming or hash table
To store the customer order information in a drive-in burger place. (Customers keep on coming and they have to get their correct food at the payment/food collection window.) – queue / min Priority Queue(if the order is simple and takes less time)
To store the genealogy information of biological species. – Tree
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I look after a system which uploads flat files generated by ABAP. We have a large file (500,000 records) generated from the HR module in SAP every day which generates a record for every person for the next year. One person gets a record if they are rostered on on a certain day or have planned leave for a given day.
This job takes over 8 hours to run and it is starting to get time critical. I am not an ABAP programmer but I was concerned when discussing this with the programmers as they kept on mentioning 'loops'.
Looking at the source it's just a bunch of single row selects inside nested loops after nested loop. Not only that it has loads of SELECT
I suggested to the programmers that they use SQL more heavily but they insist the SAP approved way is to use loops instead of SQL and use the provided SAP functions (i.e to look up the work schedule rule), and that using SQL will be slower.
Being a database programmer I never use loops (cursors) because they are far slower than SQL, and cursors are usually a giveaway that a procedural programmer has been let loose on the database.
I just can't believe that changing an existing program to use SQL more heavily than loops will slow it down. Does anyone have any insight? I can provide more info if needed.
Looking at google, I'm guessing I'll get people from both sides saying it is better.
I've read the question and I stopped when I read this:
Looking at the source it's just a bunch of single row selects inside
nested loops after nested loop. Not only that it has loads of SELECT
*.
Without knowing more about the issue this looks overkilling, because with every loop the program executes a call to the database. Maybe this was done in this way because the dataset of the selected data is too big, however it is possible to load chunks of data, then treat them and then repeat the action or you can make a big JOIN and operate over that data. This is a little tricky but trust me that this does the job.
In SAP you must use this kind of techniques when this situations happen. Nothing is more efficient than handling datasets in memory. To this I can recommend the use of sorted and/or hashed tables and BINARY SEARCH.
On the other hand, using a JOIN does not necessarily improves performance, it depends on the knowledge and use of the indexes and foreign keys in the tables. For example, if you join a table to get a description I think is better to load that data in an internal table and get the description from it with a BINARY SEARCH.
I can't tell exactly what is the formula, it depends on the case, Most of the time you have to tweak the code, debug and test and make use of transactions 'ST05' and 'SE30' to check performance and repeat the process. The experience with those issues in SAP gives you a clear sight of these patterns.
My best advice for you is to make a copy of that program and correct it according to your experience. The code that you describe can definitely be improved. What can you loose?
Hope it helps
Sounds like the import as it stands is looping over single records and importing them into a DB one at a time. It's highly likely that there's a lot of redundancy there. It's a pattern I've seen many times and the general solution we've adopted is to import data in batches...
A SQL Server stored procedure can accept 'table' typed parameters, which on the client/C# side of the database connection are simple lists of some data structure corresponding to the table structure.
A stored procedure can then receive and process multiple rows of your csv file in one call, therefore any joins you need to do are being done on sets of input data which is how relational databases are designed to be used. This is especially beneficial if you're joining out to commonly used data or have lots of foreign keys (which are essentially invoking a join in order to validate the keys you're trying to insert).
We've found that the SQL Server CPU and IO load for a given amount of import data is much reduced by using this approach. It does however require consultation with DBAs and some tuning of indexes to get it to work well.
You are correct.
Without knowing the code, in most cases it is much faster to use views or joins instead of nested loops. (there are exceptions, but they are very rare).
You can define views in SE11 or SE80 and they usually heavily reduce the communication overhead between abap server and database server.
Often there are readily defined views from SAP for common cases.
edit:
You can check where your performance goes to: http://scn.sap.com/community/abap/testing-and-troubleshooting/blog/2007/11/13/the-abap-runtime-trace-se30--quick-and-easy
Badly written parts that are used sparsely don't matter.
With the statistics you know where it hurts an where your optimization effort pays.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am evaluating a number of different NoSQL databases to store time series JSON data. ElasticSearch has been very interesting due to the query engine, I just don't know how well it is suited to storing time series data.
The data is composed of various metrics and stats collected at various intervals from devices. Each piece of data is a JSON object. I expect to collect around 12GB/day, but only need to keep the data in ES for 180 days.
Would ElasticSearch be a good fit for this data vs MongoDB or Hbase?
You can read up on ElasticSearch time-series use-case example here.
But I think columnar databases are a better fit for your requirements.
My understanding is that ElasticSearch works best when your queries return a small subset of results, and it caches such parameters to be used later. If same parameters are used in queries again, it can use these cached results together in union, hence returning results really fast. But in time series data, you generally need to aggregate data, which means you will be traversing a lot of rows and columns together. Such behavior is quite structured and is easy to model, in which case there does not seem to be a reason why ElasticSearch should perform better than columnar databases. On the other hand, it may provide ease of use, less tuning, etc all of which may make it more preferable.
Columnar databases generally provide a more efficient data structure for time series data. If your query structures are known well in advance, then you can use Cassandra. Beware that if your queries request without using the primary key, Cassandra will not be performant. You may need to create different tables with the same data for different queries, as its read speed is dependent on the way it writes to disk. You need to learn its intricacies, a time-series example is here.
Another columnar database that you can try is the columnar extension provided for Postgresql. Considering that your max db size will be about 180 * 12 = 2.16 TB, this method should work perfectly, and may actually be your best option. You can also expect some significant size compression of about 3x. You can learn more about it here.
Using time based indices, for instance an index a day, together with the index-template feature and an alias to query all indices at once there could be a good match. Still there are so many factors that you have to take into account like:
- type of queries
- Structure of the document and query requirements over this structure.
- Amount of reads versus writes
- Availability, backups, monitoring
- etc
Not an easy question to answer with yes or no, I am afraid you have to do more research yourself before you are really say that it is the best tool for the job.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I found the below in a question bank and I'm looking for some help with it.
For each of the following situations, select the best data structure and justify your selection.
The data structures should be selected from the following possibilities: unordered list, ordered array, heap, hash table, binary search tree.
(a) (4 points) Your structure needs to store a potentially very large number of records, with the data being added as it arrives. You need to be able to retrieve a record by its primary key, and these keys are random with respect to the order in which the data arrives. Records also may be deleted at random times, and all modifications to the data need to be completed just after they are submitted by the users. You have no idea how large the dataset could be, but the data structure implementation needs to be ready in a few weeks. While you are designing the program, the actual programming is going to be done by a co-op student.
For the answer, I thought BST would be the best choice.
Since the size is not clear, hashtable is not a good choice.
Since there is a matter of deletion, heap is not acceptable either.
Is my reasoning correct?
(b) (4 points) You are managing data for the inventory of a large warehouse store. New items (with new product keys) are added and deleted from the inventory system every week, but this is done while stores are closed for 12 consecutive hours.
Quantities of items are changed frequently: incremented as they are stocked, and decremented as they are sold. Stocking and selling items requires the item to be retrieved from the system using its product key.
It is also important that the system be robust, well-tested, and have predictable behaviour. Delays in retrieving an item are not acceptable, since it could cause problems for the sales staff. The system will potentially be used for a long time, though largely it is only the front end that is likely to be modified.
For this part I thought heapsort, but I have no idea how to justify my answer.
Could you please help me?
(a) needs fast insertion and deletion and you need retrieval based on key. Thus I'd go with a hashtable or a binary search tree. However, since the size is not known in advance and there's that deadline constraint, I'd say the binary search tree is the best alternative.
(b) You have enough time to process data after insertion/deletion but need an O(1) random access. An ordered array should do the trick.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
As of now I have done effort estimation based on experience and recently using function points.
I am now exploring UCP, read this article http://www.codeproject.com/KB/architecture/usecasep.aspx. I then checked various other articles based on Use Case Points (UCP). I am not able to find out how exactly it works and is it correct.
For example, I have a login functionality where user provides userid and password and I check against a table in database to allow or deny login. I define a user actor and Login as a Use Case.
As per UCP I categorize Login use case as Simple and the GUI interface as Complex. As per UCP factor table I get 5 and 3 so the total is 15. After applying the technical factor and environmental factor adjustment it becomes 7. If I take productivity factor as 20 then I am getting 140 hours. But I know it will take at most 30 hrs along with documentation and testing efforts.
Am I doing something wrong in defining the Use Case here? UCP says if the interface is GUI then its complex but here the gui is easy enough so should I downgrade that factor? Also factor for simple is 5, should I define another level as Very Simple? But then am I not complicating the matter here?
Ironically, the prototypical two box logon form is much more complicated than a 2 box CRUD form because the logon form needs to be secure and the CRUD form only needs to save to a database table (and read and update and delete).
A logon form needs to decide if where to redirect to, how to cryptographically secure an authentication token, if and how to cache roles, how to or if to deal with dictionary attacks.
I don't know what this converts to in UCP points, I just know that the logon screen in my app has consumed much more time a form with a similar number of buttons and boxes.
Last time I was encouraged to count function points, it was a farce because no one had the time to set up a "function points court" to get rulings on hard to measure things, especially ones that didn't fall neatly into the model that function point counting assumes.
Here's an article talking about Use Case Points - via Normalized Use Case. I think the one factor overlooked in your approach is the productivity which is suppose to be based on past projects. 20 seems to be the average HOWEVER if you are VERY productive (there's a known 10 to 1 ratio of moderate to good programmers) the productivity could be 5 bringing the UCP est. close to what you think it should be. I would suggest looking at past projects, calculating the UCP, getting the total hours and determining what your productivity really is. Productivity being a key factor needs to be calculated for individuals and teams to be able to be used in the estimation effectively.
Part of the issue may be how you're counting transactions. According to the author of UCP, transactions are a "round trip" from the user to the system back to the user; a transaction is finished when the system awaits a new input stimulus. In this case, unless the system is responding...a logon is probably just 1 transaction unless there are several round trips to and from the system.
Check out this link for more info...
http://www.ibm.com/developerworks/rational/library/edge/09/mar09/collaris_dekker/index.html
first note that in a previous work of Ribu he stated that effort for 1 UCP ranges from 15 to 30 hrs (see: http://thoughtoogle-en.blogspot.com/2011/08/software-quotation.html for some details);
second it is clear that this kind of estimation like, also Function Points, is more accurate when there are a lot of use-case and not one. You are not considering for example, startup of the project, project management, creation of environments etc. that are all packed in the 20 hours.
I think there is something wrong in your computation: "I get 5 and 3 so the total is 15". UAW and UUCW must be added, not multiplied.