finding common rows in files based on one column - shell

I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
.
.
.
.
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help

Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly
Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.

Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.

Related

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

Add the values of each line of only one column from many files with 2 columns - shell scripting

I have many files that have this structure that have two columns of numbers. And I want to add each line value of the second column, for all of my files, so I'll end up with only one file. Can anyone help? Hope the question was clear enough. Thanks.
The following is based on the information OP provided in his comments here above:
We have multiple files and we have to sum the second column of each of these files. As far as we know we could have hundreds or thousands of different files
The first column in each file seems not important and I'm going to assume (based on OP sample data) we have the same (first) column in each input file
The basic idea is to start with an empty summary (file tot), paste one after the other each file with tot and sum 2 and 4 columns (if present) into the second column of the new tot file.
In other words...
$ touch tot ; for f in * ; do paste tot ${f} | awk '{ if ( NF > 3 ) { print $1, $2+$4 } else { print $1, $2 } }' > tmp ; mv tmp tot ; done
I did test it with 8 different files and seems to work as expected.
Of course for f in * has to be changed in order to capture ALL and ONLY the files we want to sum.
Assuming what you want is the sum of all values of the second column of each file, it looks like a simple enough job for awk:
cat files | awk '{ sum += $2 } END { print sum }'

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).
Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)
In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Detect if a series of numbers is sequential in bash/awk

So I have a series of scripts that generate intermediary text files along the way as a means of storing information across different scripts. Essentially the scripts detect rows within data that have been approved by the user for removal. The line numbers that are to be removed from the source file are stored in a file.
For example, say I have a source data file like this:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a4,b4,c4,d4
a5,b5,c5,d5
a6,b6,c6,d6
a7,b7,c7,d7
And the intermediary file would contain something like this:
1 3 4 5 6
Which would result, when the script is run, in an output data file as follows:
a2,b2,c2,d2
a7,b7,c7,d7
This all works fine, there is nothing to fix in this code. The problem is, when I'm dealing with actual data files sometimes there are literally thousands of numbers stored in the intermediary file for removal. This means I can't use a loop, because it will take a massive amount of time, and my current method of using sed gets overloaded with a error: too many arguments. Many of the line numbers are consecutive, so here's where I get to my question:
Is there a way in bash or awk to detect whether a series of space-separated numbers are consecutive?
I can sort out everything beyond that, I'm just stumped on how I could do this in one/a series of step(s). My plan, if I can detect consecutive values, is to change the intermediary file from:
1 3 4 5 6
To:
1 3-6
And then I'll be able to write code that will run on each range of values in a more manageable way.
If possible I'd like to avoid looping through each value and checking individually whether or not it's one step above the previous value, since I'm dealing with tens of thousands of numbers in a list.
If this is not possible in bash/awk, is there another way to accomplish this task to reduce the overall number of arguments passed to my script and greatly reduce the chances of encountering an error for too many arguments?
What about this?
BEGIN {
getline < "intermediate.txt"
split($0, skippedlines, " ")
skipindex = 1
}
{
if (skippedlines[skipindex] == NR)
++skipindex;
else
print
}
Use cat, join, and cut:
Files infile and ids:
a1,b1,c1,d1 1
a2,b2,c2,d2 3
a3,b3,c3,d3 4
a4,b4,c4,d4 5
a5,b5,c5,d5 6
a6,b6,c6,d6
a7,b7,c7,d7
Removal of selected lines:
$ join -v 2 ids <(cat -n infile) | cut -f 2 -d ' '
a2,b2,c2,d2
a7,b7,c7,d7
What's going on:
First, the initial file receives an id on each line, with cat -n infile;
then, the resulting file is joined on the first column with the file holding the ids;
only non-matching lines from second file are printed -- join -v 2;
the first column, with the ids, is removed;
and, it's a neat shell one-liner (:
In case your file with ids is written as an unique line, you can still make use of the above one-liner, simply adding a translation on the file with ids, as follows:
$ join -v 2 <(tr ' ' '\n' ids) <(cat -n infile) | cut -f 2 -d ' '
#jmihalicza's answer nicely uses awk to solve the whole problem of selecting the lines from source file that match those in the intermediate file. For completeness, the following awk program reduces the list of individual line numbers to ranges, where possible, which I think answers the original question:
{ for (j = 1; j <= NF; j++) {
lin[i++] = $j;
}
}
END {
start = lin[0];
j = 1;
while (j <= i) {
end = start
while (lin[j] == (lin[j-1]+1)) {
end = lin[j++];
}
if ((end+0) > (start+0)) {
printf "%d-%d ",start,end
} else {
printf "%d ",start
}
start = lin[j++];
}
}
Given this script, which I've called merge.awk and a file testlin.txt as follows:
1 3 4 5 6 9 10 11 13 15
... we can do this:
$ awk -f merge.awk <testlin.txt
1 3-6 9-11 13 15
This might work for you (GNU sed):
sed -r 's/\S+/&d/g;s/\s+/\n/g' intermediate_file | sed -f - source_file
Change the intermediate file into a sed script.

Resources