assign sequential number for 1st column of data and start from 1 when it reaches a blank line. using awk or/and sed - bash

I have a big data file consist of blocks of xy data, blocks are separated by a blank line. now I want to change all x to a set of sequential number, and start from 1 for next block. number of rows within each block could be different.
input:
165168 14653
5131655 51365
155615 1356
13651625 13651
12 51
55165 51656
64 64
651456 546546
desired output:
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546

I would use:
$ awk '!NF{i=0; print; next} {print ++i, $2}' file
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546
Explanation
It is a matter of keeping a counter i and resetting it appropriately.
!NF{i=0; print; next} if there are no fields, that is, if the line is empty, print an empty line and reset the counter.
{print ++i, $2} otherwise, increment the counter and print it together with the 2nd field.

Maybe even
awk '!NF { n=NR } NF { $1=NR-n } 1' file
So on an empty line, we set n to the current line number. On nonempty lines, we change the first field to the current line number minus n. Print all lines.

Related

Cross-referencing strings from two files by line number and collecting them into a third file

I have two files that I wish to coordinate into a single file for plotting an xy-graph.
File1 contains a different x-value on each line, followed by a series of y-values on the same line. File2 contains the specific y-value that I need from File1 for each point x.
In reality, I have 50,000 lines and 50-100 columns, but here is a simplified example.
File1 appears like this:
1 15 2 3 1
2 18 4 6 5
3 19 7 8 9
4 23 10 2 11
5 25 18 17 16
column 1 is the line number.
column 2 is my x-value, sorted in ascending order.
columns 3-5 are my y-values. They aren't unique; a y on one line could match a y on a different line.
File2 appears like this:
3
5
2
18
The y on each line in File2 corresponds to a number matching one of the y's in File1 from the same line (for the first few hundred lines). After the first few hundred lines, they may not always have a match. Therefore, File2 has fewer lines than File1. I would like to either ignore these rows or fill it with a 0.
Goal
The output, File3, should consist of:
15 3
18 5
19 0
23 2
25 18
or the line with
19 0
removed, whichever works for the script. If neither option is possible, then I would also be okay with just matching the y-values line-by-line until there is not a match, and then stopping there.
Attempts
I initially routed File2 into an array:
a=( $(grep -e '14,12|:*' File0 | cut -b 9-17) )
but then I noticed similar questions (1, 2) on Stackexchange used a second file, hence I routed the above grep command into File2.
These questions are slightly different, since I require specific columns from File1, but I thought I could at least use them as a starting point. The solutions to these questions:
1)
grep -Fwf File2 File1
reproduces of course the entire line in File1, and I'm not sure how to proceed from there; or
2)
awk 'FNR==NR {arr[$1];next} $1 in arr' File2 File1
fails entirely for me, with no error message except the general awk help response.
Is this possible to do? Thank you.
awk 'NR==FNR { arr[NR] = $1; next } {
for (i = 3; i <= NF; ++i) {
if ($i == arr[n]) {
print $2, $i
n++
next
}
}
print $2, 0
}' n=1 file2 file1
another awk, will print the first match only
$ awk 'NR==FNR {a[$1]; next}
{f2=$2; $1=$2="";
for(k in a) if($0 FS ~ FS k FS) {print f2,k; next}}' file2 file1
15 2
18 5
23 2
25 18
padded FS to eliminate sub-string matches. Note the order of the files, file2 should be provided first.

How to use awk counting the number of specificed digit in certain column

How can I count the matched value in the certain column?
I have a file:(wm.csv)
I executed the command to get the targeted value in certain column: tail -n +497 wm.csv | awk -F"," '$2=="2" {print $3" "$4}'
then I get the following output data I want:
hit 2
hit 2
hit 2
hit 2
miss
hit 2
hit 2
hit 2
hit 2
hit 2
hit 2
hit 2
incorrect 1
hit 2
hit 2
hit 2
I want to count the number of "2" in second column in order to do simple math like: total digits in column divided by total number of row. Specifically, in this case, it would looks like: 14 (fourteen "2" in second column) / 16(total number of row)
Following is the command I tried but this does not work :
tail -n +497 wm.csv | awk -F"," '$2=="2" {count=0;} { if ($4 == "2") count+=1 } {print $3,$4,$count }'
thanks
taking the posted data as input file
$ awk '$2==2{c++} END{print NR,c,c/NR}' file
16 14 0.875
awk '($0 ~ "hit 2"){count += 1} END{print count, FNR, count/FNR}' sample.csv
14 16 0.875
I use ~ to compare the whole line($0) matches "hit 2", if it is increase counter by 1. FNR is the file number of records which is the total line number.

Subset a file by row and column numbers

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).
inputFile.txt Tab delimited text file
header 62 9 3 54 6 1
25 1 2 3 4 5 6
96 1 1 1 1 0 1
72 3 3 3 3 3 3
18 0 1 0 1 1 0
82 1 0 0 0 0 1
77 1 0 1 0 1 1
15 7 7 7 7 7 7
82 0 0 1 1 1 0
37 0 1 0 0 1 0
18 0 1 0 0 1 0
53 0 0 1 0 0 0
57 1 1 1 1 1 1
subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.
1,4,6
subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.
1,3,7
Current solution using cut and awk loop (Related post: Select rows using awk):
# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt
# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput
Output file: result.txt
1 4 6
3 3 3
7 7 7
Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.
Any better way?
Real input files info:
# $fileInput:
# Rows = 20127
# Cols = 533633
# Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers
More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.
System:
$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux
Update: Solution provided below by #JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7
Even though in If programming languages were countries, which country would each language represent? they say that...
Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.
... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!
So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.
awk -v cols="1 4 6" -v rows="1 3 7" '
BEGIN{
split(cols,c); for (i in c) col[c[i]] # extract cols to print
split(rows,r); for (i in r) row[r[i]] # extract rows to print
}
(NR-1 in row){
for (i=2;i<=NF;i++)
(i-1) in col && line=(line ? line OFS $i : $i); # pick columns
print line; line="" # print them
}' file
With your sample file:
$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7
With your sample file, and inputs as variables, split on comma:
awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput
I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.
One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:
$ awk '
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc";
}
FILENAME==ARGV[1] { # process rows file
n=split($0,t,",");
for(i=1;i<=n;i++) r[t[i]]
}
FILENAME==ARGV[2] { # process cols file
m=split($0,t,",");
for(i=1;i<=m;i++) c[t[i]]
}
FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file
for(i in c)
printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt
1 4 6
3 3 3
7 7 7
Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.
Not to take anything away from both excellent answers. Just because this problem involves large set of data I am posting a combination of 2 answers to speed up the processing.
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
for (i in r)
row[r[i]]
}
(NR-1) in row {
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
PS: This should work with older awk version or non-gnu awk as well.
to refine #anubhava solution we can
get rid of searching over 10k values for each row
to see if we are on the right row by takeing advantage of the fact the input is already sorted
awk -v cols="$(<subsetCols.txt)" -v rows="$(<subsetRows.txt)" '
BEGIN {
n = split(cols, c, /,/)
split(rows, r, /,/)
j=1;
}
(NR-1) == r[j] {
j++
for (i=1; i<=n; i++)
printf "%s%s", $(c[i]+1), (i<n?OFS:ORS)
}' inputFile.txt
Python has a csv module. You read a row into a list, print the desired columns to stdout, rinse, wash, repeat.
This should slice columns 20,000 to 30,000.
import csv
with open('foo.txt') as f:
gwas = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
for row in gwas:
print(row[20001:30001]

Creating a mapping count

I have this data with two columns
Id Users
123 2
123 1
234 5
234 6
34 3
I want to create this count mapping from the given data like this
123 3
234 11
34 3
How can I do it in bash?
You have to use associative arrays, something like
declare -A newmap
newmap["123"]=2
newmap["123"]=$(( ${newmap["123"]} + 1))
obviously you have to iterate through your input, see if the entry exists then add to it, else initialize it
It will be easier with awk.
Solution 1: Doesn't expect the file to be sorted. Stores entire file in memory
awk '{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
34 3
234 11
123 3
What we are doing here is using the first column as key and adding second column as value. In the END block we iterate over our array and print the key=value pair.
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk 'NR>1{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
NR>1 is telling awk to skip the first line. NR contains the line number so we instruct awk to start creating our array from second line onwards.
Solution 2: Expects the file to be sorted. Does not store the file in memory.
awk '$1!=prev && NR>1{print prev,sum}{prev=$1; sum+=$2}END{print prev,sum}' file
123 3
234 14
34 17
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk '$1!=prev && NR>2{print prev, sum}NR>1{prev = $1; sum+=$2}END{print prev, sum}' ff
123 3
234 14
34 17
A Bash (4.0+) solution:
declare -Ai count
while read a b ; do
count[$a]+=b
done < "$infile"
for idx in ${!count[#]}; do
echo "${idx} ${count[$idx]}"
done
For a sorted output the last line should read
done | sort -n

Divide the first entry and last entry of each row in a file using awk

I have a file with varying row lengths:
120 2 3 4 5 9 0.003
220 2 3 4 0.004
320 2 3 5 6 7 8 8 0.009
I want the output to consist of a single column with entries like:
120/0.003
220/0.004
320/0.009
That is i want to divide the first column and last column of each row.
How can I achieve this using awk?
To show the output of the division operation:
$ awk '{ printf $1 "/" $NF "=" ; print ($1/$NF)}' infile
120/0.003=40000
220/0.004=55000
320/0.009=35555.6
awk will split its input based on the value of FS, which by default is any sequence of whitespace. That means you can get at the first and last column by referring to $1 and $NF. NF is the number of fields in the current line or record.
So to tell awk to print the first and last column do something like this:
awk '{ print $1 "/" $NF }' infile
Output:
120/0.003
220/0.004
320/0.009

Resources