I am trying to sort my directory using the command prompt.
The below command sorts the directory and display on screen, but not sorting in the actual directory:
C:\>dir C:\Users\ap\Desktop\pdf /o:d
I want to sort in the actual directory.
The NTFS file system stores filenames in alphabetical order. This means it is fairly fast no matter how many files are in the folder. You CANNOT change the order of files in a folder.
Here's a brief maths answer why this is so. Also see https://en.wikipedia.org/wiki/Binary_search_algorithm for more detail.
https://www.quora.com/What-is-the-fastest-algorithm-for-searching-in-ordered-lists-and-unordered-lists
For ordered list
1.We can go for binary search not as been suggested by my friend Siddharth.
2.Ordered list allows us to go for mid term searching.
3.Time complexity will be O(log n) for n inputs.
4.(Note that log is of base 2)
For unordered list
1.In this type of list we have to go for Linear search algo.
2.Binary search on Unordered list will not work.
3.Time complexity will be O(n) for n inputs.
To give you an idea
A 4 billion record sorted database will take a maximum of 32 accesses using binary search. The same database if unsorted will take an average of 2 billion and a maximum of 4 billion accesses.
Related
The array mentioned in the question are as follows:
[1,1,...,1,1,-1,-1,...,-1,-1]
How to quickly find the index of the 1 closest to -1?
Note: Both 1 and -1 will exist at the same time, and the number of 1 and -1 is large.
For example, for an array like this:
[1,1,1,1,1,-1,-1,-1]
the result should be 4.
The fastest way I can think of is binary search, is there a faster way?
With the current representation of the data, binary search is the fastest way I can thing of. Of course, you can cache and reuse the results in constant time since the answer is always the same.
On the other hand if you change the representation of the array to some simple numbers you can find the next element in constant time. Since the data can always be mapped to a binary value, you can reduce the whole array to 2 numbers. The length of the first partition and the length of the second partition. Or the length of the whole array and the partitioning point. This way you can easily change the length of both partitions in constant time and have access to the next element of the second partition in constant time.
Of course, changing the representation of the array itself is a logarithmic process since you need to find the partitioning point.
By a simple information theoretic argument, you can't be faster than log(n) using only comparisons. Because there are n possible outcomes, and you need to collect at least log(n) bits of information to number them.
If you have extra information about the statistical distribution of the values, then maybe you can exploit it. But this is to be discussed on a case-by-case basis.
Introduction: I want to replace about 280'000 images of math formulas on the Encyclopedia of Mathematics by their corresponding TEX code. To do this, I have classified all of these images (or better: their URLs) into a list of 100'000 lists.
Each "sublist" contains strings of urls such that each url in that sublist links to the same image. The list looks something like [["https://www.encyclopediaofmath.org/legacyimages/a/a130/a130010/a1300105.png", "https://www.encyclopediaofmath.org/legacyimages/a/a010/a010080/a01008021.png", ...], ["https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w1300801.png", "https://www.encyclopediaofmath.org/legacyimages/w/w130/w130080/w130080211.png"], ...].
For each sublist, I have (or am still in the process of) determined the corresponding TEX code for one image of that sublist. Since the images within each sublist are identical, I have (or still am) determined the TEX code for every image url in the whole list.
Now I want to replace the images in each article (such as this one) by the known TEX code. This results in me having to index the image URLs of each article in this list of sublists.
My question: Do you know of any better data structures than a list of lists for the above task?
Example code:
dups = [[i, i+1] for i in range(100000)]
for i in range(10000):
for j in range(100000):
if i in dups[j]:
print(f"Found number {i} in {j}-th list")
break
In the above example, dups is a simplified version of my list of lists (and I am using numbers instead of strings.) As you can notice, the above program takes some time to finish. I would like to improve dups so that a similar type of indexing can be done faster.
Remark 1: The above code essentially makes 1 + 2 + 3 + ... + n comparisons if dups has a length of n. This leads to n * (n+1)/2 comparisons. Since n = 100'000 in my case, this is already a lot of comparisons.
Remark 2: An obvious improvement would be to transform each sublist into a Python set and to consider a list of sets. However, most of my sublists contain less than 3 images, so I doubt that this would greatly improve runtime.
Remark 3: Note that I can hardly control the order of the "incoming" images (basically I have to follow the article structure) and that I can not construct a full order inside the list of lists (since I can not break the sublists apart.) I have thus not figured out a way to implement binary search.
While it might introduce data redundancy I would propose to use a binary search tree. Your list of lists is a good idea for indexing but it has one significant issue, which indeed is the runtime.
Your metric for the tree could simply be an alphabetic comparison of the links (a < z, aa > z etc.). Thus, essentially you have binary search and just some redundant data. If we do the math, you have 280,000 images, which means the average search time in a BST will be log[2](280,000), which is approximately 18 steps. That you have about three of the same TEX codes really does not matter considering the improvement in speed, just store it 3 times. Treat it like a key and value pair. In your BST the key is your link, the corresponding value is just stored with it (which you can use your list of lists for). You can also have the value of your pair be the index of the sublist it is in. But my general suggestion would be to ignore your sublists when searching and use them again when you're done with it.
A tree would look something like this:
(link, code/index)
/ \
(link,code/index) (link, code/index)
/ \ / \
etc. etc.
If you want to or have to stick to your sublist idea then my only suggestion is to create a dictionary based on your lists. See here for the time complexity of that.
Though if possible I would implement this in a language which has pointers or in such a way that your code for every link is the same object to save space.
The answers are (1) and (5) but I am not sure why. Could someone please explain this to me and why the other answers are incorrect. How can I understand how things like binary/linear search will behavior on different data structures?
Thank you
I am hoping you already know about binary search.
(1) True-
Explanation
For performing binary search, we have to get to middle of the sorted list. In linked list to get to the middle of the list we have to traverse half of the list starting from the head, while in array we can directly get to middle index if we know the length of the list. So linked lists takes O(n/2) time which can be done in O(1) by using array. Therefore linked list is not the efficient way to implement binary search.
(2)False
Same explanation as above
(3)False
Explanation
As explained in point 1 linked list cannot be used efficiently to perform binary search but array can be used.
(4) False
Explanation
Binary search worst case time is O(logn). As in binary search we don't need to traverse the whole list. In first loop if key is lesser then middle value we will discard the second half of the list. Similarly now we will operate with the remaining list. As we can see with every loop we are discarding the part of the list that we don't have to traverse, so clearly it will take less then O(n).
(5)True
Explanation
If element is found in O(1) time, that means only one loop was run by the code. And in the first loop we always compare to the middle element of the list that means the search will take O(1) time only if the middle element is the key value.
In short, binary search is an elimination based searching technique that can be applied when the elements are sorted. The idea is to eliminate half the keys from consideration by keeping the keys in sorted order. If the search key is not equal to the middle element, one of the two sets of keys to the left and to the right of the middle element can be eliminated from further consideration.
Now coming to your specific question,
True
The basic binary search requires that mid-point can be found in O(1) time which can't be possible in linked list and can be way more expensive if the the size of the list is unknown.
True.
False
False
Binary search, mid-point calculation should be done in O(1) time which can only be possible in arrays , as the indices defined in arrays are known. Secondly binary search can only be applied to the arrays which are in sorted order.
False
The answer by Vaibhav Khandelwal, explained it nicely. But I wanted to add some variations of the array on to which binary search can be still applied. If the given array is sorted but rotated by X degree and contains duplicates, for example,
3 5 6 7 1 2 3 3 3
Then binary search still applies on it, but for the worst case, we needed we go linearly through this list to find the required element, which is O(n).
True
If the element found in the first attempt i.e situated at the mid-point then it would be processed in O(1) time.
MidPointOfArray = (LeftSideOfArray + RightSideOfArray)/ 2
The best way to understand binary search is to think of exam papers which are sorted according to last names. In order to find a particular student paper, the teacher has to search in that student name's category and rule-out the ones that are not alphabetically closer to the name of the student.
For example, if the name is Alex Bob, then teacher directly starts her search from "B", then take out all the copies that have surname "B", then again repeat the process, and skips the copies till letter "o" and so on till find it or not.
Given a text file in the format below, each line is a list of up to 50
names. Write a program produces a list of pairs of names which appear
together in at least fifty different lists.
Tyra,Miranda,Naomi,Adriana,Kate,Elle,Heidi
Daniela,Miranda,Irina,Alessandra,Gisele,Adriana
In the above sample, Miranda and Adriana appear together twice, but
every other pair appears only once. It should return
"Miranda,Adriana\n". An approximate solution may be returned with
lists which appear at least 50 times with high probability.
I was thinking of the following solution:
Generate a Map <Pair,Integer> pairToCountMap, after reading through the file.
Iterate through the map, and print those with counts >= 50
Is there a better way to do this? The file could be very large, and I'm not sure what is meant by the approximate solution. Any links or resources would be much appreciated.
First let's assume that names are limited in length, so operations on them are constant time.
Your answer should be acceptable if it fits in memory. If you have N lines with m names each, your solution should take O(N*m*m) to complete.
If that data set doesn't fit in memory, you can write the pairs to a file, sort that file using a merge sort, then scan through to count pairs. The running time of this is O(N*m*log(N*m)), but due to details about speed of disk access will run much faster in practice.
If you have a distributed cluster, then you could use a MapReduce. It would run very similarly to the last solution.
As for the statistics approach, my guess is that they mean running through the list of files to find the frequency of each name, and the number of lines with different numbers of names in them. If we assume that each line is a random assortment of names, using statistics we can estimate how many intersections there are between any pair of common names. This will be roughly linear in the length of the file.
You can for each name obtain the list of the line numbers where it appears (use a hashtable to store the names), then for every pair of names get the size of the intersection of the corresponding line indices (in the case of two increasing sequences this is linear time).
Say the length of a name is limited by a constant. So if you have N names and M lines, then building the list is like O(MN) and the final stage is O(N^2 M).
Specifically, given two large files with 64-bit integers produce a file with integers that are present in both files and estimate the time complexity of your algorithm.
How would you solve this?
I changed my mind; I actually like #Ryan's radix sort idea, except I would adapt it a bit for this specific problem.
Let's assume there are so many numbers that they do not fit in memory, but we have all the disk we want. (Not unreasonable given how the question was phrased.)
Call the input files A and B.
So, create 512 new files; call them file A_0 through A_255 and B_0 through B_255. File A_0 gets all of the numbers from file A whose high byte is 0. File A_1 gets all of the numbers from file A whose high byte is 1. File B_37 gets all the numbers from file B whose high byte is 37. And so on.
Now all possible duplicates are in (A_0, B_0), (A_1, B_1), etc., and those pairs can be analyzed independently (and, if necessary, recursively). And all disk accesses are reasonably linear, which should be fairly efficient. (If not, adjust the number of bits you use for the buckets...)
This is still O(n log n), but it does not require holding everything in memory at any time. (Here, the constant factor in the radix sort is log(2^64) or thereabouts, so it is not really linear unless you have a lot more than 2^64 numbers. Unlikely even for the largest disks.)
[edit, to elaborate]
The whole point of this approach is that you do not actually have to sort the two lists. That is, with this algorithm, at no time can you actually enumerate the elements of either list in order.
Once you have the files A_0, B_0, A_1, B_1, ..., A_255, B_255, you simply observe that no numbers in A_0 can be the same as any number in B_1, B_2, ..., B_255. So you start with A_0 and B_0, find the numbers common to those files, append them to the output, then delete A_0 and B_0. Then you do the same for A_1 and B_1, A_2 and B_2, etc.
To find the common numbers between A_0 and B_0, you just recurse... Create file A_0_0 containing all elements of A_0 with second byte equal to zero. Create file A_0_1 containing all elements of A_0 with second byte equal to 1. And so forth. Once all elements of A_0 and B_0 have been bucketed into A_0_0 through A_0_255 and B_0_0 through B_0_255, you can delete A_0 and B_0 themselves because you do not need them anymore.
Then you recurse on A_0_0 and B_0_0 to find common elements, deleting them as soon as they are bucketed... And so on.
When you finally get down to buckets that only have one element (possibly repeated), you can immediately decide whether to append that element to the output file.
At no time does this algorithm consume more than 2+epsilon times the original space required to hold the two files, where epsilon is less than half a percent. (Proof left as an exercise for the reader.)
I honestly believe this is the most efficient algorithm among all of these answers if the files are too large to fit in memory. (As a simple optimization, you can fall back to the std::set solution if and when the "buckets" get small enough.)
You could a radix sort, then iterate over the sorted results keeping the matches . Radix is O(DN), where D is the number of digits in the numbers. The largest 64 bit number is 19 digits long, so the sort sort for 64 bit integers with a radix of 10 will run in about 19N, or O(N), and the search runs in O(N). Thus this would run in O(N) time, where N is the number of integers in both files.
Assuming the files are too large to fit into memory, use an external least-significant-digit (LSD) radix sort on each of the files, then iterate through both files to find the intersection:
external LSD sort on base N (N=10 or N=100 if the digits are in a string format, N=16/32/64 if in binary format):
Create N temporary files (0 - N-1). Iterate through the input file. For each integer, find the rightmost digit in base N, and append that integer to the temporary file corresponding to that digit.
Then create a new set of N temporary files, iterate through the previous set of temporary files, find the 2nd-to-the-rightmost digit in base N (prepending 0s where necessary), and append that integer to the new temporary file corresponding to that digit. (and delete the previous set of temporary files)
Repeat until all the digits have been covered. The last set of temporary files contains the integers in sorted order. (Merge if you like into one file, otherwise treat the temporary files as one list.)
Finding the intersection:
Iterate through the sorted integers in each file to produce a pair of iterators that point to the current integer in each file. For each iterator, if the numbers match, append to an output list, and advance both iterators. Otherwise, for the smaller number, throw it away and advance the iterator for that file. Stop when either iterator ends.
(This outputs duplicates where there are input duplicates. If you want to remove duplicates, then the "advance the iterator" step should advance the iterator until the next larger number appears or the file ends.)
Read integers from both files into two sets (this will take O(N*logN) time), then iterate over two sets and write common elements to output file(this will take O(N) time). Complexity summary - O(N*logN).
Note: The iteration part will perform faster if we store integers into vectors and then sort them, but here we will use much more memory if there are many duplicates of integers inside the files.
UPD: You can also store in the memory only distinct integers from one of the files:
Read the values from the smaller files into a set. Then read values from the second files one by one. For each next number x check it's presence in the set with O(logN). If it exists there, print it and remove it from the set to avoid printing it twice. Complexity remains O(N*logN), but you use memory only necessary to store distinct integers from the smallest file.