Count matches across files [closed]

Count matches across files [closed] - bash

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 days ago.
Improve this question
Iam stuck on an awk related problem to counts matching occurences. I have a file containing a list of IDs (numbers and or characters) as well as another file containing another ID with a second column containing a collection of the first IDs:
File 1:
| ID1 |
| --- |
| 1 |
| 2 |
| 5 |
| 7 |
File 2:
| ID2 | ID1_collection |
| -------- | -------- |
| 1 | 1,2,3 |
| 2 | 1 |
| 3 | 4 |
| 4 | |
| 5 | 5 |
| 6 | |
The column with the collection doesn't have to be filled or match any of the IDs present in the first file. The goal is a file that looks like this:
| ID2 | ID1_collection | count |
| -------- | -------- | -------- |
| 1 | 1,2,3 | 2
| 2 | 1 | 1
| 3 | 4 | 0
| 4 | | 0
| 5 | 5 | 1
| 6 | | 0
However I am unable to think about a logic which goes through the whole column of file 1 and count, how many of those IDs are present inside the collection with an awk script.
I thought I can create an array containing all ID1 values and split each string from ID1_collection at the separator inside the column (the global separator is "|") to grep for exact matches. But I am not able to figure out a) how efficient this would be (I guess not really) and b) how to write the syntax in a reasonable fashion...
Any help would be appreciated

An approach using awk
% awk 'NR == FNR{x++; arr[$1]++; next}
FNR == 1{print $0, "count"; next}
{n = split($2, a, ",")
for(i in arr){
for(j=1; j<=n; j++){
if(i == a[j]){ y++ }}}
print $0, y; y = 0}' file1 file2
ID2 ID1_collection count
1 1,2,3 2
2 1 1
3 4 0
4 "" 0
5 5 1
6 "" 0
Data
% cat file1 file2
ID1
1
2
5
7
ID2 ID1_collection
1 1,2,3
2 1
3 4
4 ""
5 5
6 ""

Related

Concanate two or more rows from result into single result on CI activerecord

I have situation like this, I want to get value from database(this values used comma delimited) from more than one rows based on month and year that I choose, for more detail check this out..
My Schedule.sql :
+---+------------+-------------------------------------+
|ID |Activ_date | Do_skill |
+---+------------+-------------------------------------+
| 1 | 2020-10-01 | Accountant,Medical,Photograph |
| 2 | 2020-11-01 | Medical,Photograph,Doctor,Freelancer|
| 3 | 2020-12-01 | EO,Teach,Scientist |
| 4 | 2021-01-01 | Engineering, Freelancer |
+---+------------+-------------------------------------+
My skillqmount.sql :
+----+------------+------------+-------+
|ID |Date_skill |Skill |Price |
+----+------------+------------+-------+
| 1 | 2020-10-02 | Accountant | $ 5 |
| 2 | 2020-10-03 | Medical | $ 7 |
| 3 | 2020-10-11 | Photograph | $ 5 |
| 4 | 2020-10-12 | Doctor | $ 9 |
| 5 | 2020-10-01 | Freelancer | $ 7 |
| 6 | 2020-10-04 | EO | $ 4 |
| 7 | 2020-10-05 | Teach | $ 4 |
| 8 | 2020-11-02 | Accountant | $ 5 |
| 9 | 2020-11-03 | Medical | $ 7 |
| 10 | 2020-11-11 | Photograph | $ 5 |
| 11 | 2020-11-12 | Doctor | $ 9 |
| 12 | 2020-11-01 | Freelancer | $ 7 |
+----+------------+------------+-------+
In my website I want to make calculation with those two table. So if in my website want to see start from date 2020-10-01 until 2020-11-01 for total amount between those date, I try to show it with this code :
Output example
+----+-----------+-----------+---------+
|No |Date Start |Date End |T.Amount |
+----+-------- --+-----------+---------+
|1 |2020-10-01 |2020-11-01 |$ 45 | <= this amount came from $5+$7+$5+$7+$5+$9+$7
+----+-------- --+-----------+---------+
Note :
Date Start : Input->post("A")
Date End : Input->post("B")
T.Amount : Total Amount based input A and B (on date)
I tried this code to get it :
<?php
$startd = $this->input->post('A');
$endd= $this->input->post('B');
$chck = $this->db->select('Do_skill')
->where('Activ_date >=',$startd)
->where('Activ_date <',$endd)
->get('Schedule')
->row('Do_skill');
$dcek = $this->Check_model->comma_separated_to_array($chck);
$t_amount = $this->db->select_sum('price')
->where('Date_skill >=',$startd)
->where('Date_skill <',$endd)
->where_in('Skill',$dcek)
->get('skillqmount')
->row('price');
echo $t_amount; ?>
Check_model :
public function comma_separated_to_array($chck, $separator = ',')
{
//Explode on comma
$vals = explode($separator, $chck);
$count = count($vals);
$val = array();
//Trim whitespace
for($i=0;$i<=$count-1;$i++) {
$val[] .= $vals[$i];
}
return $val;
}
My problem is the result from $t_amount not $45, I think there's some miss with my code above, please if there any advice, I very appreciate it...Thank you...

Your first query only return 1 row data.
I think you can do something like this for the first query.
$query1 = $this->db->query("SELECT Do_skill FROM schedule WHERE activ_date >= $startd and activ_date < $startd");
$check = $query1->result_array();
$array = [];
foreach($check as $ck){
$dats = explode(',',$ck['Do_skill']);
$counter = count($dats);
for($i=0;$i<$counter;$i++){
array_push($array,$dats[$i]);
}
and you can use the array to do your next query :)

The array $dcek has the values
Accountant,Medical,Photograph
The query from Codeigniter is
SELECT SUM(`price`) AS `price` FROM `skillqmount`
WHERE `Date_skill` >= '2020-10-01' AND
`Date_skill` < '2020-11-01' AND
`Skill` IN('Accountant', 'Medical', 'Photograph')
which returns 17 - this matches the first three entries in your data.
Your first query will only ever give one row, even if the date range would match multiple rows.

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.

You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'

Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.

I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

MDX - filter empty outside of selected range

Cube is populated with data divided into time dimension ( period ) which represents a month.
Following query:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
returns:
+--------+----+---+--------+
| Period | a | b | c |
+--------+----+---+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 5 | 23 | 2 | 2 |
+--------+----+---+--------+
Removing non empty
select {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
Renders:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 1 | (null) | (null) | (null) |
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
| 6 | (null) | (null) | (null) |
+--------+--------+--------+--------+
What i would like to get, is all records from period 2 to period 5, first occurance of values in measure "a" denotes start of range, last occurance - end of range.
This works - but i need this to be dynamically calculated during runtime by mdx:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].&[2] :[Period].[Period].&[5]} on rows
from MyCube
desired output:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
+--------+--------+--------+--------+
I tried looking for first/last values but just couldn't compose them into the query properly. Anyone has this issue before ? This should be pretty common seeing as I want to get a continuous financial report without skipping months where nothing is going on. Thanks.

Maybe try playing with NonEmpty / Tail function in a WITH clause:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;
to debug a custom set, to see what members it is returning you can do something like this:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First] on rows
FROM MyCube;
I think reading your comment about Children means that this is also an alternative - to add an extra [Period]:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;

Vbscript basic functions

I am new to programming and computer science. HTML is all I know and I have been facing problems with vbscript.
This program (my first in vbscript) was given by my teacher. But I really do not understand anything. I referred to my book but in vain.
I am not even sure if this is the right SE to post the question.
Please help.

What you have there is a loop with another nested loop, both of which print some text to the screen (document.write("...")).
The outer loop
For i = 1 To 5 Step 1
...
Next
iterates from 1 to 5 in steps of 1 (which is redundant, since 1 is the default step size, so you could just omit the Step 1). If you printed the value of i inside the loop
For i = 1 To 5 Step 1
document.Write(i & "<br>")
Next
You'd get the following output:
1
2
3
4
5
In your code sample you just print <br>, though, so each cycle of the outer loop just prints a line break.
In addition to printing line breaks in the outer loop you also have a nested loop, which for each cycle of the outer loop iterates from 1 to the current value of i, again in steps of 1.
For j = 1 To i Step 1
...
Next
So in the first cycle of the outer loop (i=1) the inner loop iterates from 1 to 1, in the second cycle of the outer loop (i=2) it iterates from 1 to 2, and so on.
For i = 1 To 5 Step 1
document.Write(i & "<br>")
For j = 1 To i Step 1
document.Write("*")
Next
Next
Since the inner loop prints an asterisk with each cycle you get i asterisks per line before the inner loop ends, the outer loop then goes into the next cycle and prints a line break, thus ending the current output line.
A good (although somewhat tedious) way to get an understanding of how the loops work is to note the current value of each variable as well as the current output line in a table on a sheet of paper, e.g. like this:
code line | instruction | i | j | output line
----------+------------------------+-------+-------+------------
1 | For i = 1 To 5 Step 1 | 1 | Empty |
2 | document.Write("<br>") | 1 | Empty | <br>
3 | For j = 1 To i Step 1 | 1 | 1 |
4 | document.Write("*") | 1 | 1 | *
5 | Next | 1 | 1 | *
6 | Next | 1 | 1 | *
1 | For i = 1 To 5 Step 1 | 2 | 1 | *
2 | document.Write("<br>") | 2 | 1 | *<br>
3 | For j = 1 To i Step 1 | 2 | 1 |
4 | document.Write("*") | 2 | 1 | *
5 | Next | 2 | 1 | *
3 | For j = 1 To i Step 1 | 2 | 2 | *
4 | document.Write("*") | 2 | 2 | **
5 | Next | 2 | 2 | **
6 | Next | 2 | 2 | **
1 | For i = 1 To 5 Step 1 | 3 | 2 | **
2 | document.Write("<br>") | 3 | 2 | **<br>
3 | For j = 1 To i Step 1 | 3 | 1 |
4 | document.Write("*") | 3 | 1 | *
... | ... | ... | ... | ...

Sum of the grouped distinct values

This is a bit hard to explain in words ... I'm trying to calculate a sum of grouped distinct values in a matrix. Let's say I have the following data returned by a SQL query:
------------------------------------------------
| Group | ParentID | ChildID | ParentProdCount |
| A | 1 | 1 | 2 |
| A | 1 | 2 | 2 |
| A | 1 | 3 | 2 |
| A | 1 | 4 | 2 |
| A | 2 | 5 | 3 |
| A | 2 | 6 | 3 |
| A | 2 | 7 | 3 |
| A | 2 | 8 | 3 |
| B | 3 | 9 | 1 |
| B | 3 | 10 | 1 |
| B | 3 | 11 | 1 |
------------------------------------------------
There's some other data in the query, but it's irrelevant. ParentProdCount is specific to the ParentID.
Now, I have a matrix in the MS Report Designer in which I'm trying to calculate a sum for ParentProdCount (grouped by "Group"). If I just add the expression
=Sum(Fields!ParentProdCount.Value)
I get a result 20 for Group A and 3 for Group B, which is incorrect. The correct values should be 5 for group A and 1 for group B. This wouldn't happen if there wasn't ChildID involved, but I have to use some other child-specific data in the same matrix.
I tried to nest FIRST() and SUM() aggregate functions but apparently it's not possible to have nested aggregation functions, even when they have scopes defined.
I'm pretty sure there is some way to calculate the grouped distinct sum without needing to create another SQL query. Anyone got an idea how to do that?

Ok I got this sorted out by adding a ROW_NUMBER() function my SQL query:
SELECT Group, ParentID, ROW_NUMBER() OVER (PARTITION BY ParentID ORDER BY ChildID ASC) AS Position, ChildID, ParentProdCount FROM Table
and then I replaced the SSRS SUM function with
=SUM(IIF(Position = 1, ParentProdCount.Value, 0))

Put a grouping over the ParentID and use a summation over that group,
eg:
if group over ParentID = "ParentIDGroup"
then
column sum of ParentPrdCount = SUM(Fields!ParentProdCount.Value,"ParentIDGroup")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Count matches across files [closed] - bash

Related

Concanate two or more rows from result into single result on CI activerecord

bash looping and extracting of the fragment of txt file

MDX - filter empty outside of selected range

Vbscript basic functions

Sum of the grouped distinct values

Categories

Resources