I have a large dataset of 1.2crore rows, which is taking around 30 min in sorting using usual SAS proc sort. Is there any faster algorithm/option in sas?
Kuber
Without more details on how you are using the sorted dataset and what fields and lengths make up your dataset here's a few things you can try:
Use the tagsort option in proc sort. This is useful when the dataset is wide.
Create an index instead of sorting. If you are just going to do some by group processing then this will be faster and will work just as well.
If you are sorting in order to do a merge consider using either SQL joins (which may not need to sort as much data) or hashtables (which can be used to merge and don't require sorted data).
Compress the output dataset (if you aren't already) and/or the input dataset. This will reduce the IO.
But to answer your question, there is no faster sort procedure in sas then proc sort. According to the below PDF: The SASĀ® sort routine is of order O(NlogN), which is as
fast as a comparison sort can be.
If you are working at a site that has syncsort licensed then this can speed it up, but this is usually enabled by default.
http://www2.sas.com/proceedings/sugi26/p121-26.pdf
If the reason you need to sort your data set is to merge it with another data set, you might look at doing your merge/lookup using a HASH object. Then you might not need to sort it.
Related
I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name').
DFa has around 610 million rows,
DFb has 3000 milllion rows.
Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80".
It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method.
Is there any other solution to deal with this? Not using spark solution is also acceptable if the runtime is acceptable.
More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets.
First compute 64 bit hashes of your ids. The comparison will be a lot faster on the hashes, than on the string ids.
My basic idea is:
Build a hash table from DFa.
As you compute the hashes for DFb, you do a lookup in the table. If there's nothing there then drop the entry (no match). If you get a hit compare the actual IDs to make sure you don't get a false positive.
The complexity is O(N). Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches.
The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better.
Since we already have the hash, we only need to know if it's exists. So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). However this is not a common datastructure so you'll need to implement it.
But there's something even better, something even more compact. That is called bloom filter. This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. The best part is that you should be able to find libraries for it already available.
So everything together it looks like this:
Compute hashes from DFa and build a bloom filter.
Compute hashes from DFb and check against the bloom filter. If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result.
This is a typical usecase in any big data environment. You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors.
You can read more about broadcasted joins here
Broadcast-Joins
What are the cases when using hash table can improve performance, and when it does not? and what are the cases when using hash tables are not applicable?
What are the cases when using hash table can improve performance, and when it does not?
If you have reason to care, implement using hash tables and whatever else you're considering, put your actual data through, and measure which performs better.
That said, if the hash tables has the operations you need (i.e. you're not expecting to iterate it in sorted order, or compare it quickly to another hash table), and has millions or more (billions, trillions...) of elements, then it'll probably be your best choice, but a lot depends on the hash table implementation (especially the choice of closed vs. open hashing), object size, hash function quality and calculation cost / runtime), comparison cost, oddities of your computers memory performance at different cache levels... in short: too many things to make even an educated guess a better choice than measuring, when it matters.
and what are the cases when using hash tables are not applicable?
Mainly when:
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can be wasteful and slow when excessive memory's used), the majority of implementations will outgrow the arrays they're using for the hash table every now and then, then allocate a bigger array and copy content across: this can make the specific insertions that cause this rehashing to be much slower than the normal O(1) behaviour, even though the average is still O(1); if you need more consistent behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating on elements with keys that are "nearby" in some specific sort order), such that cache efficiency is better for other storage models that keep them nearby in memory (e.g. bucket sorted elements), even if you're not exactly relying on the sort order for e.g. iteration
We use Hash Tables to get access time of O(1). Imagine a dictionary. When you are looking for a word, eg "happy", you jump straight to 'H'. Here the hash function is determined by the starting alphabet. And then you look for happy within the H bucket (actually H bucket then HA bucket then HAP bucket anbd so on).
It doesn't make sense to use Hash Tables when your data is ordered or needs ordering like sorted numbers. (Alphabets are ordered ABCD....XYZ but it wouldn't matter if you switched A and Z, provided you know it is switched in your dictionary.)
I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.
Sometimes interviewers ask how to sort million/billion 32-bit integers (e.g. here and here). I guess they expect the candidates to compare O(NLog(N)) sort with radix sort. For million integers O(NLog(N)) sort is probably better but for billion they are probably the same. Does it make sense ?
If you get a question like this, they are not looking for the answer. What they are trying to do is see how you think through a problem. Do you jump right in, or do you ask questions about the project requirements?
One question you had better ask is, "How optimal of solution does the problem require?" Maybe a bubble sort of records stored in a file is good enough, but you have to ask. Ask questions about what if the input changes to 64 bit numbers, should the sort process be easily updated? Ask how long does the programmer have to develop the program.
Those types of questions show me that the candidate is wise enough to see there is more to the problem than just sorting numbers.
I expect they're looking for you to expand on the difference between internal sorting and external sorting. Apparently people don't read Knuth nowadays
As aaaa bbbb said, it depends on the situation. You would ask questions about the project requirements. For example, if they want to count the ages of the employees, you probably use the Counting sort, I can sort the data in the memory. But when the data are totally random, you probably use the external sorting. For example, you can divide the data of the source file into the different files, every file has a unique range(File1 is from 0-1m, File2 is from 1m+1 - 2m , ect ), then you sort every single file, and lastly merge them into a new file.
Use bit map. You need some 500 Mb to represent whole 32-bit integer range. For every integer in given array just set coresponding bit. Then simply scan your bit map from left to right and get your integer array sorted.
It depends on the data structure they're stored in. Radix sort beats N-log-N sort on fairly small problem sizes if the input is in a linked list, because it doesn't need to allocate any scratch memory, and if you can afford to allocate a scratch buffer the size of the input at the beginning of the sort, the same is true for arrays. It's really only the wrong choice (for integer keys) when you have very limited additional storage space and your input is in an array.
I would expect the crossover point to be well below a million regardless.
When I attempt to do something like
SELECT Max(ObjectId) FROM Objects;
I see that in the explain-plan that this is performed by doing a sort. Now, sorting (which I guess would require something in the complexity of O(nlogn)) must be much costlier than just scanning each row and remembering the max value (which could be done in O(n)).
Am I missing something here? Is oracle really performing a sort or does the explain-plan just use the description "sort" for describing a simple scan of all values in the ObjectId column? If oracle really does perform a "real sort", is there a good reason for doing this that I am missing?
Thanks in advance!
since you have not posted details about your table Objects we will have to guess. My guess is that you have an index on ObjectId. In that case you will see a INDEX FULL SCAN (MIN/MAX) step in the explain plan, meaning that the data will be retrieved directly from the index. The keys are ordered in an index so reading the first or last key gives you the MIN/MAX.
This is an O(log n) operation (since it depends upon the depth of the index).
Update:
If you don't have an index on ObjectId you will see a SORT AGGREGATE step in the explain plan. This doesn't mean that the entire set will be sorted. In fact the data will be aggregated as it is read. This will likely involve a single comparison for each row, giving you a total O(n) cost.
Also on a related note, Oracle probably uses O(n) algorithms to sort data.