Compare multiple tsv files for match columns - bash

I have 4466 .tsv files with this structure:
file_structure
I want to compare the 4466 files to see how many IDs (the first column) matches.
I only found bash commands with two files with "comm". Could you tell me how I could do that?
Thank you

I read your question as:
Amongst all TSV files, which column IDs are found in every file?
If that's true, we want the intersection of all the sets of column IDs from all files. We can use the join command to get the intersection of any two files, and we can use the algebraic properites of an intersection to effectively join all files.
Consider the intersection of ID for these three files:
file1.tsv file2.tsv file3.tsv
--------- --------- ---------
ID ID ID
1 1 2
2 3 3
3
"3" is the only ID shared between all three. We can only join two files together at a time, so we need some way to effectively get, join (join file1.tsv file2.tsv) file3.tsv. Fortunately for us intersections are idempotent and associative, so we can apply join iteratively in a loop over all the files, like so:
# "Prime" the common file
cp file1.tsv common.tsv
for TSV in file*.tsv; do
join "$TSV" common.tsv > myTmp
mv myTmp common.tsv
echo "After joining $TSV, common IDs are:"
cat common.tsv
done
When I run that it prints the following:
After joining file1.tsv, common IDs are:
ID
1
2
3
After joining file2.tsv, common IDs are:
ID
1
3
After joining file3.tsv, common IDs are:
ID
3
The first iteration joins file1 with itself (because we primed common with file1); this is where we intersection to be idempotent
The second iteration joins in file2, cutting out ID "2"
The third iteration joins in file3, cutting ID down to just "3"
Technically, join considers the string "ID" to be one of the things to evaluate... it doesn't know what a header line is, or an what an ID is... it just knows to look in some number of fields for common values. In that example we didn't specify a field so it defaulted to the first field, and it always found "ID" and it always found "3".
For your files, we need to tell join to:
separate on a tab character, with -t <TAB-CHAR>
only output the join field (which, by default, is the first field), with -o 0
Here's my full implementation:
#!/bin/sh
TAB="$(printf '\t')"
# myJoin joins tsvX with the previously-joined common on
# the first field of both files; saving the the first field
# of the joined output back into common
myJoin() {
tsvX="$1"
join -t "$TAB" -o 0 common.tsv "$tsvX" > myTmp.tsv
mv myTmp.tsv common.tsv
}
# "Prime" common
cp input1.tsv common.tsv
for TSV in input*.tsv; do
myJoin "$TSV"
done
echo "The common IDs are:"
tail -n -1 common.tsv
For an explanation of why "$(printf '\t')", check out the following for POSIX compliance:
https://www.shellcheck.net/wiki/SC3003
https://unix.stackexchange.com/a/468048/366399

The question sounds quite vague. So, assuming that you want to extract IDs that all 4466 files have in common, i.e. IDs such that each of them occurs at least once in all of the *.tsv files, you can do this (e.g.) in pure Bash using associative arrays and calculating “set intersections” on them.
#!/bin/bash
# removes all IDs from array $1 that do not occur in array $2.
intersect_ids() {
local -n acc="$1"
local -rn operand="$2"
local id
for id in "${!acc[#]}"; do
((operand["$id"])) || unset "acc['${id}']"
done
}
# prints IDs that occur in all files called *.tsv in directory $1.
get_ids_intersection() (
shopt -s nullglob
local -ar files=("${1}/"*.tsv)
local -Ai common_ids next_ids
local file id _
if ((${#files[#]})); then
while read -r id _; do ((++common_ids["$id"])); done < "${files[0]}"
for file in "${files[#]:1}"; do
while read -r id _; do ((++next_ids["$id"])); done < "$file"
intersect_ids common_ids next_ids
next_ids=()
done
fi
for id in "${!common_ids[#]}"; do printf '%s\n' "$id"; done
)
get_ids_intersection /directory/where/tsv/files/are

Related

Rename list of files based off of a list and a directory of files

So, I have a master file - countryCode.tsv, which goes like this,
01 united_states
02 canada
etc.
I have another list of country files, which go like this,
united_states.txt
Wyoming
Florida
etc.
canada.txt
some
blah
shit
etc.
and, I have a list of files that are named like this,
01_1
01_2
02_1
02_2
etc.
the first part of the filename belongs to the country code in the first list, and the second part belongs to the line number of the country file.
for example,
01_02 would contain the info related to florida (united states).
now, here comes my question,
how do i rename these numerically named files to the country_state format, i.e., for example,
01_02 becomes united_states_florida
The way I would do this is to first read all of the countries into an associative array, then I would iterate over that array looking for '.txt' files for each country. When I find one, read each line in turn and look for a file that matches the country code and the line number from that file. If found, rename it.
Here is some sample code:
#!/bin/bash
declare -A countries # countries is an associative array.
while read code country; do
if [ ${#code} -ne 0 ]; then # Ignore blank lines.
countries[${code}]=${country}
fi
done < countryCodes.txt # countryCodes.txt is STDIN for the while
# loop, which is passed on to the read command.
for code in ${!countries[#]}; do # Iterate over the array indices.
counter=0
country=${countries[${code}]}
if [ -r "${country}.txt" ]; then # In case country file does not exist.
while read state; do
((counter++))
if [ -f "${code}_${counter}" ]; then
mv "${code}_${counter}" "${country}_${state}"
fi
done < "${country}.txt"
fi
done

How can I create array of lines in this case?

Given a file so that in any line can be more than one word, and exists a single space between any word to other, for example:
a a a a
b b b b
c c
d d
a a a a
How can I create array so that in the cell number i will be the line number i , but WITHOUT DUPLICATES BETWEEN THE ELEMENTS IN THE ARRAY !
In according to the file above, we will need create this array:
Array[0]="a a a a" , Array[1]="b b b b" , Array[2]="c c" , Array[3]=d d.
(The name of the file pass to the script as argument).
I know how to create array that will contain all the lines. Something like that:
Array=()
while read line; do
Array=("${Array[#]}" "${line}")
done < $1
But how can I pass to the while read.. the sorting (and uniq) output of the file?
You should be able to use done < <(sort "$1" | uniq) in place of done < $1.
The <() syntax creates a file-like object from a subshell to execute a separate set of commands.

Bash script to efficiently return two file names that both contain a string found in a list

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.
This is an example of the set.csv file
"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","
This is the very inefficient code I wrote
for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt;
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]];
then echo $ID + $FILE1 + $FILE2;
fi;
done;
done;
done
Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.
There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.
Thanks!
Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?
A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming
awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
id=substr($4,4,4)
if(ids[id]) {
print id " is in " ids[id] " and " FILENAME;
} else {
ids[id]=FILENAME;
}
}' *.txt
The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

reformatting text file from rows to column

i have multiple files in a directory that i need to reformat and put the output in one file, the file structure is:
========================================================
Daily KPIs - DATE: 24/04/2013
========================================================
--------------------------------------------------------
Number of des = 5270
--------------------------------------------------------
Number of users = 210
--------------------------------------------------------
Number of active = 520
--------------------------------------------------------
Total non = 713
--------------------------------------------------------
========================================================
I need the output format to be:
Date,Numberofdes,Numberofusers,Numberofactive,Totalnon
24042013,5270,210,520,713
The directory has around 1500 files with the same format and im using Centos 7.
Thanks
First we need a method to join the elements of an array into a string (cf. Join elements of an array?):
function join_array()
{
local IFS=$1
shift
echo "$*"
}
Then we can cycle over each of the files and convert each one into a comma-separated list (assuming that the original file have a name ending in *.txt).
for f in *.txt
do
sed -n 's/[^:=]\+[:=] *\(.*\)/\1/p' < $f | {
mapfile -t fields
join_array , "${fields[#]}"
}
done
Here, the sed command looks inside each input file for lines that:
begin with a substring that contains neither a : nor a = character (the [^:=]\+ part);
then follow a : or a = and an arbitrary number of spaces (the [:=] * part);
finally, end with an arbitrary substring (the *\(.*\) part).
The last substring is then captured and printed instead of the original string. Any other line in the input files is discared.
After that, the output of sed is read by mapfile into the indexed array variable fields (the -t ensures that trailing newlines from each line read are discarded) and finally the lines are joined thanks to our previously-defined join_array method.
The reason whereby we need to wrap mapfile inside a subshell is explained here: readarray (or pipe) issue.

Element matching within the different lists

As usually :) I have two sets of input files with the same names but different extensions.
Using bash I made simple script which create 2 lists with identical elements names while looping 2 sets of the files from 2 dirs and using only it names w/o extensions as the elements of those lists:
#!/bin/bash
workdir=/data2/Gleb/TEST/claire+7d4+md/water_analysis/MD1
traj_all=${workdir}/tr_all
top_all=${workdir}/top_all
#make 2 lists for both file types
Trajectories=('');
Topologies=('');
#looping of 1st input files
echo "Trr has been found in ${traj_all}:"
for tr in ${traj_all}/*; do # ????
tr_n_full=$(basename "${tr}")
tr_n="${tr_n_full%.*}"
Trajectories=("${Trajectories[#]}" "${tr_n}");
done
#sort elements within ${Trajectories[#]} lists!! >> HERE I NEED HELP!
#looping of 2nd files
echo "Top has been found in ${top_all}:"
for top in ${top_all}/*; do # ????
top_n_full=$(basename "${top}")
top_n="${top_n_full%.*}"
Topologies=("${Topologies[#]}" "${top_n}");
done
#sort elements within ${Topologies[#] lists!! >> HERE I NEED HELP!
#make input.in file for some program- matching of elements from both lists >> HERE I NEED HELP!
for i in $(seq 1 ${#Topologies[#]}); do
printf "parm $top_all/${Topologies[i]}.top \ntrajin $traj_all/${Trajectories[i]}.mdcrd\nwatershell ${Area} ${output}/watershell_${Topologies[i]}_${Area}.dat > output.in
done
I'd thankful if someone provide me with good possibility how to improve this script:
1) I need to sort elements in both lists in the similar pattern after last elements have been added in each of them;
2) I need to add some test on the LAST step of the script which will create final output.in file only in case if elements are the same (in principle in this case it always should be the same!) in both lists which are matched during this operation by printf.
Thanks for the help,
Gleb
Here's a simpler way to create the arrays:
# Create an empty array
Trajectories=();
for tr in "${traj_all}"/*; do
# Remove the string of directories
tr_base=${tr##*/}
# Append the name without extension to the array
Trajectories+="${tr_base%.*}"
done
In bash, this will normally result in a sorted list, because the expansion of * in the glob is sorted. But you can sort it with sort; it is simplest if you are certain there are no newlines in the filenames:
mapfile -t sorted_traj < <(printf %s\\n "${Trajectories[#]}" | sort)
To compare two sorted arrays, you could use join:
# convenience function; could have helped above, too.
lines() { printf %s\\n "$#"; }
# Some examples:
# 1. compare a with b and print the lines which are only in a
join -v 1 -t '' <(lines "${a[#]}") <(lines "${b[#]}")
# 2. create c as an array with the lines which are only in b
mapfile -t c < <( join -v 2 -t '' <(lines "${a[#]}") <(lines "${b[#]}") )
If you create both difference lists, then the two arrays are equal if both lists are empty. If you are expecting both arrays to be the same, though, and if this is time-critical (probably not), you could do a simple precheck:
if [[ "${a[*]}" = "${b[*]}" ]]; then
# the arrays are the same
else
# the arrays differ; do some more work to see how.
fi

Resources