Filtering indexes from dataset in python

Filtering indexes from dataset in python - numpy-ndarray

I have a dataset with shape (2000,74,64). How to filter indexes 2,4,12,14,22,24 from 2nd dimension(from 74). How can I do it by using for loop?

a = np.array(...) # your array (2000, 74, 64)
b = np.zeros((2000, 6, 64)) # new array shape
j = 0 # iterator
for i in range (0, a.shape[1]):
if i in {2,4,12,14,22,24}:
b[:,j,:] = a[:,i,:]
j += 1

Related

StatsBase.sample() can't draw without replacement if FrequencyWeights() are provided

I'm trying to sample without replacement using StatsBase.sample() in Julia. Because I have my data in the following form I can use my counts as FrequencyWeights():
using StatsBase
data = ["red", "blue", "green"]
counts = [2000, 2000, 1]
balls = StatsBase.sample(data, FrequencyWeights(counts), 1000)
One problem with this is that StatsBase.sample() implicitly sets replace=true so this is possible:
countmap(balls)
Dict("blue" => 478,
"green" => 2, # <= two green balls?
"red" => 520)
Explicitly setting replace=false throws an error.
balls = StatsBase.sample(data, FrequencyWeights(counts), 1000, replace=false)
Cannot draw 3 samples from 1000 samples without replacement.
error(::String)#error.jl:33
var"#sample!#174"(::Bool, ::Bool, ::typeof(StatsBase.sample!), ::Random._GLOBAL_RNG, ::Vector{String}, ::StatsBase.FrequencyWeights{Int64, Int64, Vector{Int64}}, ::Vector{String})#sampling.jl:858
#sample#175#sampling.jl:871[inlined]
#sample#176#sampling.jl:874[inlined]
top-level scope#Local: 2[inlined]
Is my only solution here to reformat my data to a wide form like this? Because that seems very inefficient as my actual data set has a lot of counts.:
wide_data = [fill("red", 2000)..., fill("blue", 2000)..., "green"]
sample(wide_data, 1000, replace=false)

You could use something like this:
function mysample(data::AbstractVector, counts::AbstractVector, n::Integer)
#assert n <= sum(counts)
#assert firstindex(data) == 1
#assert firstindex(counts) == 1
res = similar(data, n)
fw = FrequencyWeights(copy(counts))
for i in 1:n
j = sample(axes(data, 1), fw)
res[i] = data[j]
fw.sum -= 1
fw.values[j] -= 1
end
return res
end

Dynamic Programming: max sum through a a 2 column list, left column skips a row

To preface, my English is not quite perfect so apologies for any mistakes.
The problem goes as follows:
Given a list of two choices in each column, determine the maximum sum possible from choosing only one of the two options in the column. The twist is: if the bottom of the two values is chosen, the next column is entirely skipped.
Example:
5
9 3 5 7 3
5 8 1 4 5
If you were to choose 5 initially from [9, 5], the column [3, 8] would be skipped. Whereas, if 9 was chosen, the next column would NOT be skipped and you could choose from [3, 8] (if 8 was chosen, the next column would be skipped and if 3 was chosen, it would not be, etc).

When you attempt to solve this using DP, the most important aspect of the problem is to define the right states of the DP.
Define D[i][j] = maximum sum until index i, if u choose element j || j in [0,1]
L = [
[9, 3, 5, 7, 3],
[5, 8, 1, 4, 5]
]
def find_max_sum(L):
D = [[0, 0] for _ in range(len(L[0]))]
D[0][0] = max(L[0][0], 0)
D[0][1] = max(L[1][0], 0)
for i in range(1, len(D)):
D[i][0] = L[0][i] + max(D[i-1])
if i > 1:
D[i][1] = max(L[1][i] + max(D[i-2]), max(D[i-1]))
else:
D[i][1] = max(L[1][i], max(D[i-1]))
return max(D[-1])
print(find_max_sum(L))

Let N be the number of columns, then we can store our choices in an array a(2 X N).
And lets define a function f(i,j) which gives the maximum possible sum from the first j+1 columns(0,1...j) where in j-th column we either pick the first option(i=0) or we pick the second option(i=1).
So the answer to this problem will be max(f(i,j)) for each i=[0,1] and j=[0,1,...N-1].
For each column j we need to handle 2 cases:
1- choosing the first option ---> f(0,j)
2- choosing the second option ---> f(1,j)
To calculate the values of column j ( f(0,j) , f(1,j) ) we need to consider these cases for column j-1:
1- we picked the first option from column j-1
2- we did not pick any option from column j-1(means we picked the second option from column j-2)
note: we can not pick the second option from column j-1 because we are picking an option from column j according to our function definition.
Here is the code written in C++
int OO = 1e6;// a big value
//reading the input
int a[2][100];
int N; cin >> N;
for (int i = 0; i < N; i++) cin >> a[0][i];
for (int i = 0; i < N; i++) cin >> a[1][i];
//initialization
int f[2][100];
f[0][0] = a[0][0];
f[1][0] = a[1][0];
int result = max(f[0][0],f[1][0]);
for (int i = 1; i < N; i++) {
int val1 = f[0][i - 1];
int val2 = -OO;
if (i - 2 >= 0)
val2 = f[1][i - 2];
f[0][i] = max(val1, val2) + a[0][i];
f[1][i] = max(val1,val2) + a[1][i];
result = max(result, max(f[0][i], f[1][i]));
}
cout << result << endl;

Optimizing the algorithm to run under 4 seconds for a quite large number of operations

I have the following code which is for solving the practise challanges from hackerrank.
And there are literally 10^7 values to be created in a list and then each should be incremented according to 10^5 queries (with console read time included), I need to crack it within 4 seconds. Here is total inputs (with queries).
First line contains two numbers, first(n) is the number of values in list, second(m) is the number of queries following below. All lines below are queries have 3 numbers, first(a) and second(b) is the indexes (starting from 1), third(k) is the value to be added into the list within the indexes. And then finally the maximum in the list should be console ouput.
private fun readLn() = readLine()!! // string line
private fun readStrings() = readLn().split(" ") // list of strings
private fun readInts() = readStrings().map { it.toInt() } // list of ints
fun main() {
val (n, m) = readInts()
val list = MutableList(n) { 0L }
repeat(m) {
val queries = readStrings()
val a = queries[0].toInt() - 1
val b = queries[1].toInt() - 1
val k = queries[2].toLong()
for (i in a..b) {
list[i] += k
}
}
println(list.max())
}
Currently it seems well optimized for me, but still can't do all the operations within 4 seconds.
Any help would be appreciated, Thanks in advance!
Edit - After answer provided by #Photon, I've modified the code but still with that algorithm as well the time limit is reached for same test cases.
Here is the modified code -
private fun readLn() = readLine()!! // string line
private fun readStrings() = readLn().split(" ") // list of strings
private fun readInts() = readStrings().map { it.toInt() } // list of ints
fun main() {
val (n, m) = readInts()
val list = MutableList(n + 2) { 0L }
repeat(m) {
val queries = readStrings()
val a = queries[0].toInt()
val b = queries[1].toInt()
val k = queries[2].toLong()
list[a] += k
list[b + 1] -= k
}
for (i in 1..n + 1) {
list[i] = list[i - 1] + list[i]
}
println(list.max())
}

Brute force is simply too slow no matter how much you optimize this. Here`s a simple array trick to solve this in O(N + Q) time:
First we have array of zeroes of size N+2: A = [0, 0, 0, 0, ..., 0]
For query L R K instead of increasing all numbers in interval we can increase first one by K and R+1 one by -K
then after all queries we can modify array by adding A[i-1] for all i in [1, N]
this will be the same as doing all queries
It might be confusing so here's an example:
N=5 so our initial array: A = [0, 0, 0, 0, 0, 0, 0]
lets say we have a query: 1 3 3
updated array: A = [0, 3, 0, 0, -3, 0, 0]
lets say we have another query: 2 5 10
updated array: A = [0, 3, 10, 0, -3, 0, -10]
now after all queries we can add A[i-1] for all i in [1, 5]
updated array: A = [0, 3, 13, 13, 10, 10, 0]
notice is`s the same as doing all queries by brute force

Algorithm for the largest subarray of distinct values in linear time

I'm trying to come up with a fast algorithm for, given any array of length n, obtaining the largest subarray of distinct values.
For example, the largest subarray of distinct values of
[1, 4, 3, 2, 4, 2, 8, 1, 9]
would be
[4, 2, 8, 1, 9]
This is my current solution, I think it runs in O(n^2). This is because check_dups runs in linear time, and it is called every time j or i increments.
arr = [0,...,n]
i = 0
j = 1
i_best = i
j_best = j
while i < n-1 and j < n:
if check_dups(arr, i j): //determines if there's duplicates in the subarray i,j in linear time
i += 1
else:
if j - i > j_best - i_best:
i_best = i
j_best = j
j += 1
return subarray(arr, i_best, j_best)
Does anyone have a better solution, in linear time?
Please note this is pseudocode and I'm not looking for an answer that relies on specific existing functions of a defined language (such as arr.contains()).
Thanks!

Consider the problem of finding the largest distinct-valued subarray ending at a particular index j. Conceptually this is straightforward: starting at arr[j], you go backwards and include all elements until you find a duplicate.
Let's use this intuition to solve this problem for all j from 0 up to length(arr). We need to know, at any point in the iteration, how far back we can go before we find a duplicate. That is, we need to know the least i such that subarray(arr, i, j) contains distinct values. (I'm assuming subarray treats the indices as inclusive.)
If we knew i at some point in the iteration (say, when j = k), can we quickly update i when j = k+1? Indeed, if we knew when was the last occurrence of arr[k+1], then we can update i := max(i, lastOccurrence(arr[k+1]) + 1). We can compute lastOccurrence in O(1) time with a HashMap.
Pseudocode:
arr = ... (from input)
map = empty HashMap
i = 0
i_best = 0
j_best = 0
for j from 0 to length(arr) - 1 inclusive:
if map contains-key arr[j]:
i = max(i, map[arr[j]] + 1)
map[arr[j]] = j
if j - i > j_best - i_best:
i_best = i
j_best = j
return subarray(arr, i_best, j_best)

We can adapt pkpnd's algorithm to use an array rather than hash map for an O(n log n) solution or potentially O(n) if your data allows for an O(n) stable sort, but you'd need to implement a stable sorting function that also provides the original indexes of the elements.
1 4 3 2 4 2 8 1 9
0 1 2 3 4 5 6 7 8 (indexes)
Sorted:
1 1 2 2 3 4 4 8 9
0 7 3 5 2 1 4 6 8 (indexes)
--- --- ---
Now, instead of a hash map, build a new array by iterating over the sorted array and inserting the last occurrence of each element according to the duplicate index arrangements. The final array would look like:
1 4 3 2 4 2 8 1 9
-1 -1 -1 -1 1 3 -1 0 -1 (previous occurrence)
We're now ready to run pkpnd's algorithm with a slight modification:
arr = ... (from input)
map = previous occurrence array
i = 0
i_best = 0
j_best = 0
for j from 0 to length(arr) - 1 inclusive:
if map[j] >= 0:
i = max(i, map[j] + 1)
if j - i > j_best - i_best:
i_best = i
j_best = j
return subarray(arr, i_best, j_best)
JavaScript code:
function f(arr, map){
let i = 0
let i_best = 0
let j_best = 0
for (j=0; j<arr.length; j++){
if (map[j] >= 0)
i = Math.max(i, map[j] + 1)
if (j - i > j_best - i_best){
i_best = i
j_best = j
}
}
return [i_best, j_best]
}
let arr = [ 1, 4, 3, 2, 4, 2, 8, 1, 9]
let map = [-1,-1,-1,-1, 1, 3,-1, 0,-1]
console.log(f(arr, map))
arr = [ 1, 2, 2, 2, 2, 2, 1]
map = [-1,-1, 1, 2, 3, 4, 0]
console.log(f(arr, map))

We can use Hashtable(Dictionary in c#)
public int[] FindSubarrayWithDistinctEntities(int[] arr)
{
Dictionary<int, int> dic = new Dictionary<int, int>();
Result r = new Result(); //struct containing start and end index for subarray
int result = 0;
r.st = 1;
r.end = 1;
for (int i = 0; i < arr.Length; i++)
{
if (dic.ContainsKey(arr[i]))
{
int diff = i - (dic[arr[i]] + 1);
if(result<diff)
{
result = diff;
r.st = Math.Min(r.st, (dic[arr[i]] + 1));
r.end = i-1;
}
dic.Remove(arr[i]);
}
dic.Add(arr[i], i);
}
return arr.Skip(r.st).Take(r.end).ToArray();
}

Add every number to Hashset if it isn't already in it. Hashset's insert and search are both O(1). So final result will be O(n).

How to go about a d-smooth sequence algorithm

I'm really struggling to design an algorithm to find d, which is the lowest value that can be added or subtracted (at most) to make a given sequence strictly increasing.
For example.. say seq[] = [2,4,8,3,1,12]
given that sequence, the algorithm should return "5" as d because you can add or subtract at most 5 to each element such that the function is strictly increasing.
I've tried several approaches and can't seem to get a solid technique down.
I've tried looping through the seq. and checking if seq[i] < seq[i+1]. If not, it checks if d>0.. if it is, try to add/subtract it from seq[i+1]. Otherwise it calculates d by taking the difference of seq[i-1] - seq[i].
I can't get it to be stable though and Its like I keep adding if statements that are more "special cases" for unique input sequences. People have suggested using a binary search approach, but I can't make sense of applying it to this problem.
Any tips and suggestions are greatly appreciated. Thanks!
Here's my code in progress - using Python - v4
def ComputeMaxDelta3(seq):
# Create a copy to speed up comparison on modified values
aItems = seq[1:] #copies sequence elements from 1 (ignores seq[0])
# Will store the fix values for every item
# this should allocate 'length' times the 0 value
fixes = [0] * len(aItems)
print("fixes>>",fixes)
# Loop until no more fixes get applied
bNeedFix = True
while(bNeedFix):
# Hope will have no fix this turn
bNeedFix = False
# loop all subsequent item pairs (i should run from 0 to length - 2)
for i in range(0,len(aItems)-1):
# Left item
item1 = aItems[i]
# right item
item2 = aItems[i+1]
# Compute delta between left and right item
# We remember that (right >= left + 1
nDelta = item2 - (item1 + 1)
if(nDelta < 0):
# Fix the right item
fixes[i+1] -= nDelta
aItems[i+1] -= nDelta
# Need another loop
bNeedFix = True
# Compute the fix size (rounded up)
# max(s) should be int and the division should produce an int
nFix = int((max(fixes)+1)/2)
print("current nFix:",nFix)
# Balance all fixes
for i in range(len(aItems)):
fixes[i] -= nFix
print("final Fixes:",fixes)
print("d:",nFix)
print("original sequence:",seq[1:])
print("result sequence:",aItems)
return
Here's whats displayed:
Working with: [6, 2, 4, 8, 3, 1, 12]
[0]= 6 So the following numbers are the sequence:
aItems = [2, 4, 8, 3, 1, 12]
fixes>> [0, 0, 0, 0, 0, 0]
current nFix: 6
final Fixes: [-6, -6, -6, 0, 3, -6]
d: 1
original sequence: [2, 4, 8, 3, 1, 12]
result sequence: [2, 4, 8, 9, 10, 12]
d SHOULD be: 5
done!
~Note~
I start at 1 rather than 0 due to the first element being a key

As anticipated, here is (or should be) the Python version of my initial solution:
def ComputeMaxDelta(aItems):
# Create a copy to speed up comparison on modified values
aItems = aItems[:]
# Will store the fix values for every item
# this should allocate 'length' times the 0 value
fixes = [0] * len(aItems)
# Loop until no more fixes get applied
bNeedFix = True
while(bNeedFix):
# Hope will have no fix this turn
bNeedFix = False
# loop all subsequent item pairs (i should run from 0 to length - 2)
for i in range(0,len(aItems)-1):
# Left item
item1 = aItems[i]
# right item
item2 = aItems[i+1]
# Compute delta between left and right item
# We remember that (right >= left + 1
nDelta = item2 - (item1 + 1)
if(nDelta < 0):
# Fix the right item
fixes[i+1] -= nDelta
aItems[i+1] -= nDelta
# Need another loop
bNeedFix = True
# Compute the fix size (rounded up)
# max(s) should be int and the division should produce an int
nFix = (max(fixes)+1)/2 # corrected from **(max(s)+1)/2**
# Balance all fixes
for i in range(len(s)):
fixes[i] -= nFix
print("d:",nFix) # corrected from **print("d:",nDelta)**
print("s:",fixes)
return
I took your Python and fixed in order to operate exactly as my C# solution.
I don't know Python, but looking for some reference on the web, I should have found the points where your porting was failing.
If you compare your python version with mine you should find the following differences:
You saved a reference aItems into s and used it as my fixes, but fixes was meant to start as all 0.
You didn't cloned aItems over itself, then every alteration to its items was reflected outside of the method.
Your for loop was starting at index 1, whereas mine started at 0 (the very first element).
After the check for nDelta you subtracted nDelta from both s and aItems, but as I stated at points 1 and 2 they were pointing to the same items.
The ceil instruction was unnedeed because the division between two integers produces an integer, as with C#.
Please remember that I fixed the Python code basing my knowledge only on online documentation, because I don't code in that language, so I'm not 100% sure about some syntax (my main doubt is about the fixes declaration).
Regards,
Daniele.

Here is my solution:
public static int ComputeMaxDelta(int[] aItems, out int[] fixes)
{
// Create a copy to speed up comparison on modified values
aItems = (int[])aItems.Clone();
// Will store the fix values for every item
fixes = new int[aItems.Length];
// Loop until no more fixes get applied
var bNeedFix = true;
while (bNeedFix)
{
// Hope will have no fix this turn
bNeedFix = false;
// loop all subsequent item pairs
for (int ixItem = 0; ixItem < aItems.Length - 1; ixItem++)
{
// Left item
var item1 = aItems[ixItem];
// right item
var item2 = aItems[ixItem + 1];
// Compute delta between left and right item
// We remember that (right >= left + 1)
var nDelta = item2 - (item1 + 1);
if (nDelta < 0)
{
// Fix the right item
fixes[ixItem + 1] -= nDelta;
aItems[ixItem + 1] -= nDelta;
//Need another loop
bNeedFix = true;
}
}
}
// Compute the fix size (rounded up)
var nFix = (fixes.Max() + 1) / 2;
// Balance all fixes
for (int ixItem = 0; ixItem < aItems.Length; ixItem++)
fixes[ixItem] -= nFix;
return nFix;
}
The function returns the maximum computed fix gap.
As a bounus, the parameter fixes will receive the fixes for every item. These are the delta to apply to each source value in order to be sure that they will be in ascending order: some fix can be reduced but some analysis loop is required to achieve that optimization.
The following is a code to test the algorithm. If you set a breakpoint at the end of the loop, you'll be able to check the result for sequence you provided in your example.
var random = new Random((int)Stopwatch.GetTimestamp());
for (int ixLoop = -1; ixLoop < 100; ixLoop++)
{
int nCount;
int[] aItems;
// special case as the provided sample sequence
if (ixLoop == -1)
{
aItems = new[] { 2, 4, 8, 3, 1, 12 };
nCount = aItems.Length;
}
else
{
// Generates a random amount of items based on my screen's width
nCount = 4 + random.Next(21);
aItems = new int[nCount];
for (int ixItem = 0; ixItem < nCount; ixItem++)
{
// Keep the generated numbers below 30 for easier human analysis
aItems[ixItem] = random.Next(30);
}
}
Console.WriteLine("***");
Console.WriteLine(" # " + GetText(Enumerable.Range(0, nCount).ToArray()));
Console.WriteLine(" " + GetText(aItems));
int[] aFixes;
var nFix = ComputeMaxDelta(aItems, out aFixes);
// Computes the new values, that will be always in ascending order
var aNew = new int[aItems.Length];
for (int ixItem = 0; ixItem < aItems.Length; ixItem++)
{
aNew[ixItem] = aItems[ixItem] + aFixes[ixItem];
}
Console.WriteLine(" = " + nFix.ToString());
Console.WriteLine(" ! " + GetText(aFixes));
Console.WriteLine(" > " + GetText(aNew));
}
Regards,
Daniele.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Filtering indexes from dataset in python - numpy-ndarray

I have a dataset with shape (2000,74,64). How to filter indexes 2,4,12,14,22,24 from 2nd dimension(from 74). How can I do it by using for loop?

a = np.array(...) # your array (2000, 74, 64) b = np.zeros((2000, 6, 64)) # new array shape j = 0 # iterator for i in range (0, a.shape[1]): if i in {2,4,12,14,22,24}: b[:,j,:] = a[:,i,:] j += 1

Related

StatsBase.sample() can't draw without replacement if FrequencyWeights() are provided

Dynamic Programming: max sum through a a 2 column list, left column skips a row

Optimizing the algorithm to run under 4 seconds for a quite large number of operations

Algorithm for the largest subarray of distinct values in linear time

How to go about a d-smooth sequence algorithm

Categories

Resources