Removing duplicate strings with limited memory [closed] - algorithm

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?

One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn) accesses to the disk.
Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets) accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.

My solution would be to do a merge sort, which allows for external memory usage. After sorting, searching for duplicates would be as easy as only ever comparing two elements.
Example:
0: cat
1: dog
2: bird
3: cat
4: elephant
5: cat
Merge sort
0: bird
1: cat
2: cat
3: cat
4: dog
5: elephant
Then simply compare 0 & 1 -> no duplicates, so move forward.
1 & 2 -> duplicate, remove 1 (this could be as simple as filling it with an empty string to skip over later)
compare 2 & 3 -> remove 2
etc.
The reason for removing 1 & 2 rather than 2 & 3 is that it allows for a more efficient comparison -- you don't have to worry about skipping indices that have been removed.

Related

How To Empty a Dynamic Array [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to re-use a dynamic arrays many times as I consider it a better performance.
Hence, I don't need to create a new dynamic array every time I need it.
I want to ask if it can lead to bugs and inefficiency if I use the same array for several instructions then clear it and reuse it? And how can I correct my procedure, so, it might approach my need.
My code :
procedure Empty(local_array : array of Integer);
var
i : Integer;
begin
for i:= 0 to high(local_array) do
local_array[i]:= nil;
Setlength(local_array, 0);
end;
If you want to reuse your array don't mes with its size. Changing the size of an array or more specifically increasing it is what could lead to the need for data reallocation.
What is array data reallocation?
In Delphi all arrays need to be stored in continuous memory block. This means that if you are trying to increase the size of your array and there already some data after memory block that is currently assigned to your array the whole array needs to be moved to another memory location where there is enough space to store the new array size in one continuous memory block.
So instead of resizing your array leave its size alone and just set value of array items to some default value. Yes this means that such array will still occupy its allocated memory. But that is goal of reusing such array as you avoid overhead for allocating/deallocating memory to your array.
If you go this way don't forget to store your own count of used items in your array since its length may be larger than the number of item actually used.

Searching for similar columns in a huge csv file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a huge csv file which has 5000 columns and 5,000,000 rows. I know that there are some columns in this file which are exactly the same. I want to identify such columns. Please not that I cannot fetch this huge file into the memory and runtime is also important.
Exactly the same?
I suppose you can verify it with hash functions.
step 1 - You can load the 5'000 values of first row and calculate 5'000 hash values; exclude the values (the columns) without a corresponding value.
step 2 - Load the value (only the column survived) and calculate the hash of the concatenation of preceding hash with the loaded value; exclude the values (the columns) without a corresponding value.
following steps: exactly as step 2: load and concatenate/hash, excluding columns without matches.

Garbage bin sorting algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm having a real world problem at work and I was hoping of solving it with python but I can't find the right algorithm to solve it.
Say I have trashcans with fixed sized holes and I need to throw away a number of specific sized trash bags. Each bag must be specifically in the right size. What algorithm can i use to thro away the trash using the minimum number of trashcans?
This sounds like it could be a bin packing or knapsack problem. You should know that both are NP-hard so there are no polynomial time optimal solutions to either problem. However there are a number of heuristic algorithms that are easy to implement that can guarantee "near" optimal solutions.
The most well known are First Fit an Best Fit. Both of these algorithms are guaranteed to pack items into bins such that the number of bins used is not greater than 1.7x the number of bins that would be used in an optimal solution. You can read more about these algorithms here.
A very simple example using first fit in python might look like this:
import random
bin_capacity = 1.0
bins = [[]]
bin_stats = [0.0] # used to keep track of how full each bin is
jobs = [random.uniform(0.0,1.0) for i in range(100)]
for job in jobs: # iterate through jobs
job_assigned = False
for i in range(len(bin_stats)):
if job + bin_stats[i] <= bin_capacity:
# if job will fit into bin, assign it there
bins[i].append(job)
bin_stats[i] += job
job_assigned = True
break
if not job_assigned:
# if job doesn't fit into any bin, open a new bin
bins.append([job])
bin_stats.append(job)
for i in range(len(bins)):
print "Bin {:} is {:.2f}% full and contains the following jobs".format(i,bin_stats[i]*100)
for job in bins[i]:
print "\t",job

how to write algorithm to insert multiple elements in queue [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
How can we write an algorithm to add multiple elements say 5 elements {1,2,3,4,5} in an queue
I searched a lot but found algorithm to insert only one item but I don't know how to run a loop to insert multiple elements.
the algorithm to insert one item which I found is
Start
Check if the Queue is full or not if(rear=N-1) THEN print “Queue is Full” and exit else goto step 3
Increment the rear
++rear;
Add the item at the ‘rear’ position Q[rear]= item;
Exit
0 Start
1 Initialize index variable to 0
2 Check if the number of the elements inserted (iterations) is equal M (where M is the number of the elements to insert). If it is, go to the step 7.
3 Check if the Queue is full or not if(rear=N-1) THEN print “Queue is Full” and exit
4 Increment the rear ++rear;
5 Add the item at the ‘rear’ position Q[rear]= items[i];
6 Increment index variable and go to the step 2
7 Exit
Alternatively, you could check if the queue has space to put M elements before the loop. Steps from 1 to 6 can be implemented using for loop (of course, any other loop should do the trick).

Algorithm that works like Unix Sort [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
for example our input file in.txt:
naturalistic 10
coppering 20
artless 30
after command: sort in.txt
artless 30
coppering 20
naturalistic 10
after command: sort -n -k 2 in.txt
naturalistic 10
coppering 20
artless 30
My Question: How can I manage keeping the lines stable while sorting according to column.
I want to whole line stays same while its order in general is changing?
What algoritm or code piece is useful? Is it about file reading or sorting facility?
Standard UNIX sort doesn't document which algorithm it uses. It may even choose a different algorithm depending on such things as the size of the input or the sort options.
The Wikipedia page on sorting algorithms lists many sorting algorithms you can choose from.
If you want a stable sort, there are plenty of options (the comparison table on the same Wikipedia page lists which ones are stable), but in fact any sorting algorithm can be made stable by tagging each data item with its original position in the input and breaking ties in the key comparison function according to that position.
Other than that, it's not exactly clear what you're asking. In your question you demonstrate the use of sort with and without -n and -k options, but it's not clear why this should influence the actual choice of sort algorithm...
I would just create a hash table of the strings with the num as key and string as value (I'm assuming they are unique) and then for the command sort , I'd sort based on values and for -n -k 2 I'd sort based on keys.
The POSIX standard does not dictate which algo to use, so different unix flavours may use different algos. GNU sort uses Merge Sort http://en.wikipedia.org/wiki/Merge_sort

Resources