Searching for similar columns in a huge csv file [closed]

Searching for similar columns in a huge csv file [closed] - bash

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a huge csv file which has 5000 columns and 5,000,000 rows. I know that there are some columns in this file which are exactly the same. I want to identify such columns. Please not that I cannot fetch this huge file into the memory and runtime is also important.

Exactly the same?
I suppose you can verify it with hash functions.
step 1 - You can load the 5'000 values of first row and calculate 5'000 hash values; exclude the values (the columns) without a corresponding value.
step 2 - Load the value (only the column survived) and calculate the hash of the concatenation of preceding hash with the loaded value; exclude the values (the columns) without a corresponding value.
following steps: exactly as step 2: load and concatenate/hash, excluding columns without matches.

Related

How To Empty a Dynamic Array [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to re-use a dynamic arrays many times as I consider it a better performance.
Hence, I don't need to create a new dynamic array every time I need it.
I want to ask if it can lead to bugs and inefficiency if I use the same array for several instructions then clear it and reuse it? And how can I correct my procedure, so, it might approach my need.
My code :
procedure Empty(local_array : array of Integer);
var
i : Integer;
begin
for i:= 0 to high(local_array) do
local_array[i]:= nil;
Setlength(local_array, 0);
end;

If you want to reuse your array don't mes with its size. Changing the size of an array or more specifically increasing it is what could lead to the need for data reallocation.
What is array data reallocation?
In Delphi all arrays need to be stored in continuous memory block. This means that if you are trying to increase the size of your array and there already some data after memory block that is currently assigned to your array the whole array needs to be moved to another memory location where there is enough space to store the new array size in one continuous memory block.
So instead of resizing your array leave its size alone and just set value of array items to some default value. Yes this means that such array will still occupy its allocated memory. But that is goal of reusing such array as you avoid overhead for allocating/deallocating memory to your array.
If you go this way don't forget to store your own count of used items in your array since its length may be larger than the number of item actually used.

From Log value to Exponential value, huge Distortion for prediction of machine learning algorithm [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I build a machine learning algorithms to predict Y' value. For this, I used Log value of Y for data scaling.
As I got the predicted Y' and actual Y value, I have to convert Log value of Y&Y' to Exponential value.
BUT, there was so huge distortion from the values over exp7 (=ln1098)... It makes a lot of MSE(error).
How can I avoid this huge distortion?? (Generally, I need to get values over 1000)
Thanks!!

For this, I used Log value of Y for data scaling.
Not for scaling, but to make target variable distribution normal.
If your MSE arises when real target value arises too - it means that the model simply can't fit enough on big values. Usually it can be solved by cleaning data (removing outliers). Or take another ML-model.
UPDATE
You can run KFold and for each fold calculate MSE/MAE between predicted and real values. Then take big errors and take a look which parameters/features this cases have.
You can eliminate cases with big errors, but it's usually dangerous.
In general bad fit on big values mean that you did not remove outliers from your original dataset. Plot histograms and scatter plots and make sure that you don't have them.
Check categorical variables: maybe you have small values (<=5%). If so, group them.
Or you need to create 2 models: one for small values, one for big ones.

Removing duplicate strings with limited memory [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?

One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn) accesses to the disk.
Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets) accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.

My solution would be to do a merge sort, which allows for external memory usage. After sorting, searching for duplicates would be as easy as only ever comparing two elements.
Example:
0: cat
1: dog
2: bird
3: cat
4: elephant
5: cat
Merge sort
0: bird
1: cat
2: cat
3: cat
4: dog
5: elephant
Then simply compare 0 & 1 -> no duplicates, so move forward.
1 & 2 -> duplicate, remove 1 (this could be as simple as filling it with an empty string to skip over later)
compare 2 & 3 -> remove 2
etc.
The reason for removing 1 & 2 rather than 2 & 3 is that it allows for a more efficient comparison -- you don't have to worry about skipping indices that have been removed.

Algorithm that works like Unix Sort [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
for example our input file in.txt:
naturalistic 10
coppering 20
artless 30
after command: sort in.txt
artless 30
coppering 20
naturalistic 10
after command: sort -n -k 2 in.txt
naturalistic 10
coppering 20
artless 30
My Question: How can I manage keeping the lines stable while sorting according to column.
I want to whole line stays same while its order in general is changing?
What algoritm or code piece is useful? Is it about file reading or sorting facility?

Standard UNIX sort doesn't document which algorithm it uses. It may even choose a different algorithm depending on such things as the size of the input or the sort options.
The Wikipedia page on sorting algorithms lists many sorting algorithms you can choose from.
If you want a stable sort, there are plenty of options (the comparison table on the same Wikipedia page lists which ones are stable), but in fact any sorting algorithm can be made stable by tagging each data item with its original position in the input and breaking ties in the key comparison function according to that position.
Other than that, it's not exactly clear what you're asking. In your question you demonstrate the use of sort with and without -n and -k options, but it's not clear why this should influence the actual choice of sort algorithm...

I would just create a hash table of the strings with the num as key and string as value (I'm assuming they are unique) and then for the command sort , I'd sort based on values and for -n -k 2 I'd sort based on keys.
The POSIX standard does not dictate which algo to use, so different unix flavours may use different algos. GNU sort uses Merge Sort http://en.wikipedia.org/wiki/Merge_sort

Retrieve variables and equation from database and solve them [duplicate]

This question already has answers here:
Algebra equation parser for java
(5 answers)
Closed 9 years ago.
My client wants to save an equation formula in a database (Oracle). In this formula they want to use abbreviations of the variables names (field in a table containing the variables) as a descriptive field to see what the formula uses to calculate the result, but wants to be able to calculate the result of the formula when all the variables have values as well.
This means if they change the formula later, the result has to reflect those changes. They have short and long formulas. e.g.
C=(A+B)/100
D=(E+F)/100
G=(3*C)+(4*D)/7
Do you know any reference to something similar to this?
I'm using jsp and Oracle as stated before.

You are on your own. Oracle will not help you much in parsing equations. For simple things, you can iterate over variables and values using SQL REPLACE function and see if that is good enough for you.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Searching for similar columns in a huge csv file [closed] - bash

Related

How To Empty a Dynamic Array [closed]

From Log value to Exponential value, huge Distortion for prediction of machine learning algorithm [closed]

Removing duplicate strings with limited memory [closed]

Algorithm that works like Unix Sort [closed]

Retrieve variables and equation from database and solve them [duplicate]

Categories

Resources