Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
for example our input file in.txt:
naturalistic 10
coppering 20
artless 30
after command: sort in.txt
artless 30
coppering 20
naturalistic 10
after command: sort -n -k 2 in.txt
naturalistic 10
coppering 20
artless 30
My Question: How can I manage keeping the lines stable while sorting according to column.
I want to whole line stays same while its order in general is changing?
What algoritm or code piece is useful? Is it about file reading or sorting facility?
Standard UNIX sort doesn't document which algorithm it uses. It may even choose a different algorithm depending on such things as the size of the input or the sort options.
The Wikipedia page on sorting algorithms lists many sorting algorithms you can choose from.
If you want a stable sort, there are plenty of options (the comparison table on the same Wikipedia page lists which ones are stable), but in fact any sorting algorithm can be made stable by tagging each data item with its original position in the input and breaking ties in the key comparison function according to that position.
Other than that, it's not exactly clear what you're asking. In your question you demonstrate the use of sort with and without -n and -k options, but it's not clear why this should influence the actual choice of sort algorithm...
I would just create a hash table of the strings with the num as key and string as value (I'm assuming they are unique) and then for the command sort , I'd sort based on values and for -n -k 2 I'd sort based on keys.
The POSIX standard does not dictate which algo to use, so different unix flavours may use different algos. GNU sort uses Merge Sort http://en.wikipedia.org/wiki/Merge_sort
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm looking for an algorithm that classifies differently formated 10 digit (mostly) integer keys. The training data set looks like that:
+------------+----------------+
| key | classification |
+------------+----------------+
| 1000012355 | US |
| 1000045331 | US |
| 0000123101 | DE |
| 0003453202 | DE |
| 000K213411 | ES |
| 000K243221 | ES |
+------------+----------------+
The keys originate from different systems and are created in a different manner. There is a large training data set available. While I assume that some part of those keys are random the structure is not.
Any help will be appreciated.
Before building models, training, and predicting.It's better to analyze the problem first, you assumed that some part of those keys are random the structure is not.You need to explore the data set to prove your hypothesis and according to the distribution of data, determine what model to use.
Converts a string to a vector, treat each character in a string as a categorical type feature,using one-hot encoding, you will get a sparse matrix of high dimensions. After this step, you can calculate, analyze, model, and so on for training data.
Then you need to analyze the data. One of the simple and effective methods is visual analysis.For high dimensional data, you can use the andrews curves, parallel coordinatesand so on.You can also use dimension reduction methods such as PCA or ICA, then visualize low dimensional data.
Depending on your visualization results, you can choose your model.If depending on the feature distribution, different categories of data are easily divided, you can use almost any classification algorithm, such as LR, SVM and even clustering.If it's a multi class problem, you can use OVO or OVR.If the visualization is poor, the distinction between classes is not obvious, you may need to do some feature engineering, or try tree models and ensemble learning methods.
You could do a one-hot encoding of each character, and concatenate these.
That is, say you have 20 possible characters that each of these 10 characters in the key can take on. You could then convert each character to a 20-length vector of zeros, with a one in the position corresponding to the particular character. You would then have an overall feature vector of length 10 * 20 = 200. You could then feed this into any classification algorithm as inputs, with the target outputs being the possible countries.
If this is truly deterministic, and the keys can be separated, a decision tree might find the perfect solution. Or even logistic regression? If there is some 'fuzziness' then something like Random Forest might work better.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a huge csv file which has 5000 columns and 5,000,000 rows. I know that there are some columns in this file which are exactly the same. I want to identify such columns. Please not that I cannot fetch this huge file into the memory and runtime is also important.
Exactly the same?
I suppose you can verify it with hash functions.
step 1 - You can load the 5'000 values of first row and calculate 5'000 hash values; exclude the values (the columns) without a corresponding value.
step 2 - Load the value (only the column survived) and calculate the hash of the concatenation of preceding hash with the loaded value; exclude the values (the columns) without a corresponding value.
following steps: exactly as step 2: load and concatenate/hash, excluding columns without matches.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?
One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn) accesses to the disk.
Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets) accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.
My solution would be to do a merge sort, which allows for external memory usage. After sorting, searching for duplicates would be as easy as only ever comparing two elements.
Example:
0: cat
1: dog
2: bird
3: cat
4: elephant
5: cat
Merge sort
0: bird
1: cat
2: cat
3: cat
4: dog
5: elephant
Then simply compare 0 & 1 -> no duplicates, so move forward.
1 & 2 -> duplicate, remove 1 (this could be as simple as filling it with an empty string to skip over later)
compare 2 & 3 -> remove 2
etc.
The reason for removing 1 & 2 rather than 2 & 3 is that it allows for a more efficient comparison -- you don't have to worry about skipping indices that have been removed.
I am reading Programming Pearls by Jon Bentley (reference).
Here author is mentioning about various sorting algorithms like merge sort, multipass sort.
Questions:
How does algorithm for merge sort work by reading input file once and using work files and writing output file only once?
How does the author denote that the 40 pass i.e. multipass sort algorithm works by writing only once to output file and with no work files?
Can someone explain the above with a simple example, like having memory to store 3 digits and having 10 digits to store, e.g. 9,0,8,6,5,4,1,2,3,7
This is from Chapter 1 of Jon Bentley's
Programming Pearls, 2nd Edn (1999), which is an excellent book. The equivalent example from the first edition is slightly different; the multipass algorithm only made 27 passes over the data (and there was less memory available).
The sort described by Jon Bentley has special setup constraints.
File contains at most 10 million records.
Each record is a 7 digit number.
There is no other data associated with the records.
There is only 1 MiB of memory available when the sort must be done.
Question 1
The single read of the input file slurps as many lines from the input as will fit in memory, sorts that data, and writes it out to a work file. Rinse and repeat until there is no more data in the input file.
Then, complete the process by reading the work files and merging the sorted contents into a single output file. In extreme cases, it might be necessary to create new, bigger work files because the program can't read all the work files at once. If that happens, you arrange for the final pass to have the maximum number of inputs that can be handled, and have the intermediate passes merge appropriate numbers of files.
This is a general purpose algorithm.
Question 2
This is where the peculiar properties of the data are exploited. Since the numbers are unique and limited in range, the algorithm can read the file the first time, extracting numbers from the first fortieth of the range, sorting and writing those; then it extracts the second fortieth of the range, then the third, ..., then the last fortieth.
This is a special-purpose algorithm, exploiting the nature of the numbers.
Sometimes interviewers ask how to sort million/billion 32-bit integers (e.g. here and here). I guess they expect the candidates to compare O(NLog(N)) sort with radix sort. For million integers O(NLog(N)) sort is probably better but for billion they are probably the same. Does it make sense ?
If you get a question like this, they are not looking for the answer. What they are trying to do is see how you think through a problem. Do you jump right in, or do you ask questions about the project requirements?
One question you had better ask is, "How optimal of solution does the problem require?" Maybe a bubble sort of records stored in a file is good enough, but you have to ask. Ask questions about what if the input changes to 64 bit numbers, should the sort process be easily updated? Ask how long does the programmer have to develop the program.
Those types of questions show me that the candidate is wise enough to see there is more to the problem than just sorting numbers.
I expect they're looking for you to expand on the difference between internal sorting and external sorting. Apparently people don't read Knuth nowadays
As aaaa bbbb said, it depends on the situation. You would ask questions about the project requirements. For example, if they want to count the ages of the employees, you probably use the Counting sort, I can sort the data in the memory. But when the data are totally random, you probably use the external sorting. For example, you can divide the data of the source file into the different files, every file has a unique range(File1 is from 0-1m, File2 is from 1m+1 - 2m , ect ), then you sort every single file, and lastly merge them into a new file.
Use bit map. You need some 500 Mb to represent whole 32-bit integer range. For every integer in given array just set coresponding bit. Then simply scan your bit map from left to right and get your integer array sorted.
It depends on the data structure they're stored in. Radix sort beats N-log-N sort on fairly small problem sizes if the input is in a linked list, because it doesn't need to allocate any scratch memory, and if you can afford to allocate a scratch buffer the size of the input at the beginning of the sort, the same is true for arrays. It's really only the wrong choice (for integer keys) when you have very limited additional storage space and your input is in an array.
I would expect the crossover point to be well below a million regardless.