Subsetting Data with GREP - bash

I have a very large text file (16GB) that I want to subset as fast as possible.
Here is a sample of the data involved
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
2 M 15 1
2 0 Q 0 17143989 4219157,1841361,853923,1720163,1912374,1755325,4454730 65548702,4975721 197782,39086 54375043,4396765 31589696,3091097 6876504,851594 3374640,455375 13274885,1354902 31585771,3091016 61234218,4723345 31583582,3091014
2 27 C 0 31589696
The first number on every line is a sessionID and any line with an 'M' denotes the start of a session (data is grouped by session). The number following an M is a Day and the second number is a userID, a user can have multiple sessions.
I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines). As a second task I also want to extract all session lines related to a specific day.
For example with the above data, to extract the records for userid '0' the output would be:
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
To extract the records for day 7 the output would be:
1 M 7 0
1 0 Q 0 17143989
I believe there is a much more elegant and simple solution to what I have achieved so far and it would be great to get some feedback and suggestions. Thank you.
What I have tried
I tried to use pcrgrep -M to apply this pattern directly (matching data between two M's) but struggled to get this working across the linebreaks. I still suspect this may be the fastest option so any guidance on whether this may be possible would be great.
The next part is quite scattered and it is not necessary to read on if you already have an idea for a better solution!
Failing the above, I split the problem into two parts:
Part 1: Isolating all 'M' lines to obtain a list of sessions which belonging to that user/day
grep method is fast (then need to figure out how to use this data)
time grep -c "M\t.*\t$user_id" trainSample.txt >> sessions.txt
awk method to create an array is slow
time myarr=$(awk '/M\t.*\t$user_id/ {print $1}' trainSample.txt
Part 2: Extracting all lines belonging to a session on the list created in part 1
Continuing from the awk method, I ran grep for each but this is WAY too slow (days to complete 16GB)
for i in "${!myarr[#]}";
do
grep "^${myarr[$i]}\t" trainSample.txt >> sessions.txt
echo -ne "Session $i\r"
done
Instead of running grep once per session ID as above using them all in the one grep command is MUCH faster (I ran it with 8 sessionIDs in a [1|2|3|..|8] format and it took the same time as each did separately i.e. 8X faster). However I need then to figure out how to do this dynamically
Update
I have actually established a working solution which only takes seconds to complete but it is some messy and inflexible bash coe which I have yet to extend to the second (isolating by days) case.

I want to extract all lines related to a specific user which for each session include all of the lines up until the next 'M' line is encountered (can be any number of lines).
$ awk '$2=="M"{p=$4==0}p' file
0 M 4 0
0 0 Q 0 10047345 3080290,4098689 50504886,4217515 9848058,1084315 50534229,4217515 50591618,4217515 26242582,2597528 34623075,3279130 68893581,5149883 50628761,4217517 32262001,3142702 35443881,3339757
0 108 C 0 50628761
0 1080 C 0 50628761
1 M 7 0
1 0 Q 0 17143989
As a second task I also want to extract all session lines related to a specific day.
$ awk '$2=="M"{p=$3==7}p' file
1 M 7 0
1 0 Q 0 17143989

Related

Set-up animation time in paraview from one column

I have a .txt file containing columns of data. At each row, i have 5 columns, containing object id, time, x position, y position, z position.
Example:
ID Time X Y Z
0 0 0 0 0
1 0 1 0 0
3 0 0 1 0
0 1 0 0 0
1 1 0 1 0
I don't know how to set up Paraview to read the column Time as the Time in the animation.
I know it's possible with multiple files (one for each time step), but is it possible with one single file containing all the time steps as shown above ?
Thanks for advice
Not supported in ParaView (last release 5.6.0) by default, but it should be quite easy to do it with a live programmable source.

Creating co-occurrence matrix in SAS

All, thanks to the amazing help and camaraderie at Stack Exchange, I can now build and do further analysis using the co-occurrence matrix R code that was discussed in my original thread: Creating Co-Occurrence Matrix.
I am now dealing with a massive data set that could only be processed on a server, and I am using SAS Studio to analyse it and thus, I will have to do the co-occurrence analysis using SAS. I would really appreciate any help from SAS experts out there, as my SAS programming techniques are limited. I am trying to do it in the SAS Studio environment.
So, essentially - I have a massive SAS .sav file of households and items, and I want to see a matrix of the number of households where items appear together. Taking the same example from my earlier thread, essentially I have a table containing the following:
HHID Items Quant
HH1 A 3
HH1 B 1
HH1 C 1
HH2 E 3
HH2 B 1
HH3 B 1
HH3 C 4
HH4 D 1
HH4 E 1
HH4 A 1
HH5 F 5
HH5 B 3
HH5 C 2
HH5 D 1, etc.
The output needed is something like this:
A B C D E F
A 0 1 1 0 1 1
B 1 0 3 1 1 0
C 1 3 0 1 0 0
D 1 1 1 0 1 1
E 1 1 0 1 0 0
F 0 1 1 1 0 0
I see that there is a macro out there that is done to do market basket analysis already, and although the output is not in this format, I can work with it as well. It's just too bad that the website doesn't exist anymore, so any help is much appreciated.
Thank you.

Connected component labeling in matrix

I'm trying to do the following
Given the following matrix (where 1's are empty cells and 0's are obstacles):
0 0 1 1
1 0 0 0
1 0 1 1
1 1 0 0
I want it to become like this:
0 0 1 1
2 0 0 0
2 0 2 2
2 2 0 0
What I need to do is to label all connected components (free spaces).
What I already tried to do is to write a function called isConnected() which takes indexes of two cells and checks if there is a connected path between them. And by repeating this function n^2 times on every empty cell on the matrix I can label all connected spaces. But as this algorithm has a bad time complexity (n^2*n^2*O(isConnected())) I prefer to use something else.
I hope these pictures will explain better what I'm trying to accomplish:

Re-sort a vector after a small number of elements have been modified

If we have a vector of size N that was previously sorted, and replace up to M elements with arbitrary values (where M is much smaller than N), is there an easy way to re-sort them at lower cost (i.e. generate a sorting network of reduced depth) than a full sort?
For example if N=10 and M=2 the input might be
10 20 30 40 999 60 70 80 90 -1
Note: the indices of the modified elements are not known (until we compare them with the surrounding elements.)
Here is an example where I know the solution because the input size is small and I was able to find it with a brute-force search:
if N = 5 and M is 1, these would be valid inputs:
0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1
For example the input may be 0 1 1 0 1 if the previously sorted vector was 0 1 1 1 1 and the 4th element was modified, but there is no way to form 0 1 0 1 0 as a valid input, because it differs in at least 2 elements from any sorted vector.
This would be a valid sorting network for re-sorting these inputs:
>--*---*-----*-------->
| | |
>--*---|-----|-*---*-->
| | | |
>--*---|-*---*-|---*-->
| | | |
>--*---*-|-----*---*-->
| |
>--------*---------*-->
We do not care that this network fails to sort some invalid inputs (e.g. 0 1 0 1 0.)
And this network has depth 4, a saving of 1 compared with the general case (a depth of 5 generally necessary to sort a 5-element vector.)
Unfortunately the brute-force approach is not feasible for larger input sizes.
Is there a known method for constructing a network to re-sort a larger vector?
My N values will be in the order of a few hundred, with M not much more than √N.
Ok, I'm posting this as an answer since the comment restriction on length drives me nuts :)
You should try this out:
implement a simple sequential sort working on local memory (insertion sort or sth. similar). If you don't know how - I can help with that.
have only a single work-item perform the sorting on the chunk of N elements
calculate the maximum size of local memory per work-group (call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE) and derive the maximum number of work-items per work-group,
because with this approach your number of work-items will most likely be limited by the amount of local memory.
This will probably work rather well I suspect, because:
a simple sort may be perfectly fine, especially since the array is already sorted to a large degree
parallelizing for such a small number of items is not worth the trouble (using local memory however is!)
since you're processing billions of such small arrays, you will achieve a great occupancy even if only single work-items process such arrays
Let me know if you have problems with my ideas.
EDIT 1:
I just realized I used a technique that may be confusing to others:
My proposal for using local memory is not for synchronization or using multiple work items for a single input vector/array. I simply use it to get a low read/write memory latency. Since we use rather large chunks of memory I fear that using private memory may cause swapping to slow global memory without us realizing it. This also means you have to allocate local memory for each work-item. Each work-item will access its own chunk of local memory and use it for sorting (exclusively).
I'm not sure how good this idea is, but I've read that using too much private memory may cause swapping to global memory and the only way to notice is by looking at the performance (not sure if I'm right about this).
Here is an algorithm which should yield very good sorting networks. Probably not the absolute best network for all input sizes, but hopefully good enough for practical purposes.
store (or have available) pre-computed networks for n < 16
sort the largest 2^k elements with an optimal network. eg: bitonic sort for largest power of 2 less than or equal to n.
for the remaining elements, repeat #2 until m < 16, where m is the number of unsorted elements
use a known optimal network from #1 to sort any remaining elements
merge sort the smallest and second-smallest sub-lists using a merge sorting network
repeat #5 until only one sorted list remains
All of these steps can be done artificially, and the comparisons stored into a master network instead of acting on the data.
It is worth pointing out that the (bitonic) networks from #2 can be run in parallel, and the smaller ones will finish first. This is good, because as they finish, the networks from #5-6 can begin to execute.

Form a Matrix From a Large Text File Quickly

Hi I am struggling with reading data from a file quickly enough. ( Currently left for 4hrs, then crashed) must be a simpler way.
The text file looks similar like this:
From To
1 5
3 2
2 1
4 3
From this I want to form a matrix so that there is a 1 in the according [m,n]
The current code is:
function [z] = reed (A)
[m,n]=size(A);
i=1;
while (i <= n)
z(A(1,i),A(2,i))=1;
i=i+1;
end
Which output the following matrix, z:
z =
0 0 0 0 1
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
My actual file has 280,000,000 links to and from, this code is too slow for this size file. Does anybody know a much faster was to do this in matlab?
thanks
You can do something along the lines of the following:
>> A = zeros(4,5);
>> B = importdata('testcase.txt');
>> A(sub2ind(size(A),B.data(:,1),B.data(:,2))) = 1;
My test case, 'testcase.txt' contains your sample data:
From To
1 5
3 2
2 1
4 3
The result would be:
>> A
A =
0 0 0 0 1
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
EDIT - 1
After taking a look at your data, it seems that even if you modify this code appropriately, you may not have enough memory to execute it as the matrix A would become too large.
As such, you can use sparse matrices to achieve the same as given below:
>> B = importdata('web-Stanford.txt');
>> A = sparse(B.data(:,1),B.data(:,2),1,max(max(B.data)),max(max(B.data)));
This would be the approach I'd recommend as your A matrix will have a size of [281903,281903] which would usually be too large to handle due to memory constraints. A sparse matrix on the other hand, maintains only those matrix entries which are non-zero, thus saving on a lot of space. In most cases, you can use sparse matrices more-or-less as you use normal matrices.
More information about the sparse command is given here.
EDIT - 2
I'm not sure why it isn't working for you. Here's a screenshot of how I did it in case that helps:
EDIT - 3
It seems that you're getting a double matrix in B while I'm getting a struct. I'm not sure why this is happening; I can only speculate that you deleted the header lines from the input file before you used importdata.
Basically it's just that my B.data is the same as your B. As such, you should be able to use the following instead:
>> A = sparse(B(:,1),B(:,2),1,max(max(B)),max(max(B)));

Resources