Comparing few colums of a file with columns of another file

Comparing few colums of a file with columns of another file - shell

I have two data files 1.txt and 2.txt
1.txt contains valid lines.
For example.
1 2 1 2
1 3 1 3
In 2.txt i have an extra coloum, but if you ignore that, I have a few valid lines, and few invalid lines. There could be multiple occurrences of the same line in 2.txt
For example:
1 2 1 2 1.9
1 3 1 3 3.4
1 3 1 3 3.4
2 3 2 3 5.6
2 3 2 3 5.6
The second and third lines are the same and valid.
The fourth and fifth lines are also the same but invalid.
I want to write a shell script which compares these two files and outputs two files, valid.txt and invalid.txt which look like these...
valid.txt :
1 2 1 2 1
1 3 1 3 2
and invalid.txt :
2 3 2 3 2
The last extra column of valid.txt and invalid.txt contains the number of times the line has been repeated in 2.txt.

this awk script works for the example data:
awk 'NR==FNR{sub(/ *$/,"");a[$0]++;next}
{sub(/ [^ ]*$/,"")
if($0 in a)
v[$0]++
else
n[$0]++
}
END{
for(x in v)print x,v[x] > "valid.txt"
for(x in n) print x,n[x] >"inv.txt"
}' file1 file2
output:
kent$ head inv.txt valid.txt
==> inv.txt <==
2 3 2 3 2
==> valid.txt <==
1 3 1 3 2
1 2 1 2 1

Related

How to merge three lines at a time

I have a .txt file with 9 lines:
1 2 3 4
1 2 3 5
1 2 3 6
1 2 3 4
1 2 3 5
1 2 3 6
1 2 3 4
1 2 3 5
1 2 3 6
I want to put the first 3 lines into one line, and the next three lines, and again the last three lines:
1 2 3 4 1 2 3 5 1 2 3 6
1 2 3 4 1 2 3 5 1 2 3 6
1 2 3 4 1 2 3 5 1 2 3 6
however it only gives me one consecutive line
I tried
cat old.txt | tr -d '\n' > new.txt

You can use paste to merge together lines.
paste -d " " - - - < input.txt
The -d " " uses a space to delimit between the lines being joined. Each - reads from stdin (and we're redirecting your input file to stdin). If you wanted to join more lines, just increase the number of - etc.

awk: print first column, then some values, and then all other columns

I want to print the first column, then a couple of columns with fixed values, like this command would do:
awk '{print $1,"1","2","1"}'
and then print all columns except the first after that...
I know this command prints all but the first column:
awk '{$1=""; print $0}'
But that gets rid of the first column.
In other words, this:
3 5 2 2
3 5 2 2
3 5 2 2
3 5 2 2
Needs to become this:
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
Any ideas?

use a loop to iterate through rest of the columns like this:
awk '{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
As an example:
$echo "3 5 2 2" | awk 'BEGIN{ORS=""}{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
3 1 2 1 5 2 2
$
Edit1 :
$ echo "3 5 2 2" | awk 'BEGIN{ORS="\n";OFS="\n"}{print $1,"1","2","1 ";for(i=2;i<=NF;i++) print $i" "}'
3
1
2
1
5
2
2
$
Edit2:
$ echo "3 5 2 2" | awk '{print $1,"1","2","1";for(i=2;i<=NF;i++) print $i}'
3 1 2 1
5
2
2
$
Edit3:
$ echo "3 5 2 2
3 5 2 2
3 5 2 2
3 5 2 2" | awk '{printf("%s %s ", $1,"1 2 1");for(i=2;i<=NF;i++) printf("%s ", $i); printf "\n"}'
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2

You are almost there, you just need to store the first column in a temporary variable:
{
head=$1; # Store $1 in head, used later in printf
$1=""; # Empty $1, so that $0 will not contain first column
printf "%s 1 2 1%s\n", head, $0
}
And a full script:
echo "3 5 2 2" | awk '{head=$1;$1="";printf "%s 1 2 1%s\n", head, $0}'

Another solution with awk:
awk '{sub(/.*/, "1 2 1 "$2, $2)}1' File
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
3 1 2 1 5 2 2
Substitute the 2nd field with "1 2 1" followed by 2nd field itself.

You can do this using sed by replacing the first space by the string you want.
sed 's/ / 1 2 1 /' file
(OR)
With awk by replacing the first field($1):
awk '{$1=$1 " 1 2 1"}1' file
(I prefer the sed solution since it has less characters).

awk command to merge the content of the same file

I have an input file with the following content
1 1
2 1
3 289
4 1
5 2
0 Clear
1 Warning
2 Indeterminate
3 Minor
4 Major
5 Critical
I want to merge the first type of lines with the messages by the first column and obtain
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

Just use awk:
awk '$1 in a { print $1, a[$1], $2; next } { a[$1] = $2 }' file
Output:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

Using join/sed, sed creates different views of the file for each part and join joins on the common field:
join <(sed '/^[0-9]* [0-9]* *$/!d' input) <(sed '/^[0-9]* [0-9]* *$/d' input)
Gives:
1 1 Warning
2 1 Indeterminate
3 289 Minor
4 1 Major
5 2 Critical

You can do this with Awk:
awk 'BEGIN{n=0}NR>6{n=1}n==0{a[$1]=$2}n==1{print $1,a[$1],$2}' file
or another way:
awk 'NR<=5{a[$1]=$2}$2~/[a-zA-z]+/ && $1>0 && $1<=5{print $1,a[$1],$2}' file

Unix: Count occurrences of similar entries in first column, sum the second column

I have a file with two columns of data, I would like to count the occurrence of similarities in the first column. When two similar entries in the first column are matched, I would like to also sum the value of the second column of the two matched entries.
Example list:
2013-11-13-03 1
2013-11-13-06 1
2013-11-13-13 2
2013-11-13-13 1
2013-11-13-15 1
2013-11-13-15 1
2013-11-13-15 1
2013-11-13-17 1
2013-11-13-23 1
2013-11-14-01 1
2013-11-14-04 6
2013-11-14-07 1
2013-11-14-08 1
2013-11-14-09 1
2013-11-14-09 1
I would like the output to read similar to the following
2013-11-13-03 1 1
2013-11-13-06 1 1
2013-11-13-13 2 3
2013-11-13-15 3 3
2013-11-13-17 1 1
2013-11-13-23 1 1
2013-11-14-01 1 1
2013-11-14-04 1 6
2013-11-14-07 1 1
2013-11-14-08 1 1
2013-11-14-09 2 2
Column 1 is the matched columns from the earlier example column 1, column 2 is the count of matches of column 1 from the earlier example (1 if no other matches), column 3 is the sum of column 2 from the matched column 1 entries from the earlier example. Anyone have any tips on completing this using awk or a mixture of uniq and awk?

Here's a quickie with awk and sort:
awk '
{
counts[$1]++; # Increment count of lines.
totals[$1] += $2; # Accumulate sum of second column.
}
END {
# Iterate over all first-column values.
for (x in counts) {
print x, counts[x], totals[x];
}
}
' file.txt | sort
You can skip the sort if you don't care about the order of output lines.

Here a pure Bash solution
$ cat t
2013-11-13-03 1
2013-11-13-06 1
2013-11-13-13 2
2013-11-13-13 1
2013-11-13-15 1
2013-11-13-15 1
2013-11-13-15 1
2013-11-13-17 1
2013-11-13-23 1
2013-11-14-01 1
2013-11-14-04 6
2013-11-14-07 1
2013-11-14-08 1
2013-11-14-09 1
2013-11-14-09 1
$ declare -A SUM CNT
$ while read ts vl; do (( SUM[$ts]=+$vl )) ; (( CNT[$ts]++ )); done < t
$ for i in "${!CNT[#]}"; do echo "$i ${CNT[$i]} ${SUM[$i]} "; done | sort
2013-11-13-03 1 1
2013-11-13-06 1 1
2013-11-13-13 2 3
2013-11-13-15 3 3
2013-11-13-17 1 1
2013-11-13-23 1 1
2013-11-14-01 1 1
2013-11-14-04 1 6
2013-11-14-07 1 1
2013-11-14-08 1 1
2013-11-14-09 2 2

Paste side by side multiple files by numerical order

I have many files in a directory with similar file names like file1, file2, file3, file4, file5, ..... , file1000. They are of the same dimension, and each one of them has 5 columns and 2000 lines. I want to paste them all together side by side in a numerical order into one large file, so the final large file should have 5000 columns and 2000 lines.
I tried
for x in $(seq 1 1000); do
paste `echo -n "file$x "` > largefile
done
Instead of writing all file names in the command line, is there a way I can paste those files in a numerical order (file1, file2, file3, file4, file5, ..., file10, file11, ..., file1000)?
for example:
file1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
...
file2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
....
file 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
....
paste file1 file2 file3 .... file 1000 > largefile
largefile
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
....
Thanks.

If your current shell is bash: paste -d " " file{1..1000}

you need rename the files with leading zeroes, like
paste <(ls -1 file* | sort -te -k2.1n) <(seq -f "file%04g" 1000) | xargs -n2 echo mv
The above is for "dry run" - Remove the echo if you satisfied...
or you can use e.g. perl
ls file* | perl -nlE 'm/file(\d+)/; rename $_, sprintf("file%04d", $1);'
and after you can
paste file*

With zsh:
setopt extendedglob
paste -d ' ' file<->(n)
<x-y> is to match positive decimal integer numbers from x to y. x and/or y can be omitted so <-> is any positive decimal integer number. It could also be written [0-9]## (## being the zsh equivalent of regex +).
The (n) is the globbing qualifiers. The n globbing qualifier turns on numeric sorting which sorts on all sequences of decimal digits appearing in the file names.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Comparing few colums of a file with columns of another file - shell

Related

How to merge three lines at a time

awk: print first column, then some values, and then all other columns

awk command to merge the content of the same file

Unix: Count occurrences of similar entries in first column, sum the second column

Paste side by side multiple files by numerical order

Categories

Resources