Count uniques by multiple columnar data in a gz file with duplicates - shell

I'm working with fairly large tsv zip files whereby each file has 3 columns only. I would like to count the number of unique occurrences for a particular regex (which is contained in column 3) across all files.
How do I make sure the count number in the output removes any duplicates based on values contained in column 1?
Tried both of these, but not sure if they are correct:
zgrep -c ",80447," AU_AAID_201812*.tsv.gz | uniq -c
zgrep -c ",80447," AU_AAID_201812*.tsv.gz
I want to get the unique count number so that if:
Column 1/Row 1 = "xyz123" and Column 3/Row 1 = ",80447,"
Column 1/Row 2 = "xyz123" and Column 3/Row 2 = ",80447,"
Then my output would be still be "1".

Use cut to get just column1 and column3, use sort -u to remove duplicates, and then use wc -l to get the count:
zgrep ',80447,' AU_AAID_201812*.tsv.gz | cut -d, -f1,3 | sort -u | wc -l

Related

How to sort lines based on specific part of their value?

When I run the following command:
command list -r machine-a-.* | sort -nr
It gives me the following result:
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
I wish to sort these lines based on the number at the end, in descending order.
( Clearly sort -nr doesn't work as expected. )
You just need the -t and -k options in the sort.
command list -r machine-a-.* | sort -t '-' -k 3 -nr
-t is the separator used to separate the fields.
By giving it the value of '-', sort will see given text as:
Field 1 Field 2 Field 3
machine a 9
machine a 8
machine a 72
machine a 71
machine a 70
-k is specifying the field which will be used for comparison.
By giving it the value 3, sort will sort the lines by comparing the values from the Field 3.
Namely, these strings will be compared:
9
8
72
71
70
-n makes sort treat the fields for comparison as numbers instead of strings.
-r makes sort to sort the lines in reverse order(descending order).
Therefore, by sorting the numbers from Field 3 in reverse order, this will be the output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Here is an example of input to sort:
$ cat 1.txt
machine-a-9
machine-a-8
machine-a-72
machine-a-71
machine-a-70
Here is our short program:
$ cat 1.txt | ( IFS=-; while read A B C ; do echo $C $A-$B-$C; done ) | sort -rn | cut -d' ' -f 2
Here is its output:
machine-a-72
machine-a-71
machine-a-70
machine-a-9
machine-a-8
Explanation:
$ cat 1.txt \ (put contents of file into pipe input)
| ( \ (group some commands)
IFS=-; (set field separator to "-" for read command)
while read A B C ; (read fields in 3 variables A B and C every line)
do echo $C $A-$B-$C; (create output with $C in the beggining)
done
) \ (end of group)
| sort -rn \ (reverse number sorting)
| cut -d' ' -f 2 (cut-off first unneeded anymore field)

Sorting using -k

I tried this solution to my list and I can't get what I want after sorting.
I got list:
m_2_mdot_3_a_1.dat ro= 303112.12
m_1_mdot_2_a_0.dat ro= 300.10
m_2_mdot_1_a_3.dat ro= 221.33
m_3_mdot_1_a_1.dat ro= 22021.87
I used sort -k 2 -n >name.txt
I would like to get list from the lowest ro to the highest ro. What I did wrong?
I got a sorting but by the names of 1 column or by last value but like: 1000, 100001, 1000.2 ... It sorted like by only 4 meaning numbers or something.
cat test.txt | tr . , | sort -k3 -g | tr , .
The following link gave a good answer Sort scientific and float
In brief,
you need -g option to sort on decimal numbers;
the -k option start
from 1 not 0;
and by default locale, sort use , as seperator
for decimal instead of .
However, be careful if your name.txt contains , characters
Since there's a space or a tab between ro= and the numeric value, you need to sort on the 3rd column instead of the 2nd. So your command will become:
cat input.txt | sort -k 3 -n

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4
Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

Join in unix when field is numeric in a huge file

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

multiple field and numeric sort

List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...
Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.
The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.

Resources