Shell script - copy lines from file by key - bash

I have two input files such that:
file1
123
456
789
file2
123|foo
456|bar
999|baz
I need to copy the lines from file2 whose keys are in file1, so the end result is:
file3
123|foo
456|bar
Right now, I'm using a shell script that loops through they key file and uses grep for each one:
grep "^${keys[$keyindex]}|" $datafile >&4
But as you can imagine, this is extremely slow. The key file (file1) has approximately 400,000 keys and the data file (file2) has about 750,000 rows. Is there a better way to do this?

You can try using join:
join -t'|' file1.txt file2.txt > file3.txt

I would use something like Python, which would process it pretty fast if you used an optimized data type like set. Not sure of your exact requirements, so you would need to adjust accordingly.
#!/usr/bin/python
# Create a set to store all of the items in file1
Set1 = set()
for line in open('file1', 'r'):
Set1.add(line.strip())
# Open a file to write to
file4 = open('file4', 'w')
# Loop over file2, and only write out the items found in Set1
for line in open('file2', 'r'):
if '|' not in line:
continue
parts = line.strip().split('|', 1)
if parts[0] in Set1:
file4.write(parts[1] + "\n")

join is the best solution, if sorting is OK. An awk solution:
awk -F \| '
FILENAME==ARGV[1] {key[$1];next}
$1 in key
' file1 file2

Related

Merge unsorted lines from two files based on similar part

I am wondering if is it possible to merge information from two files together based on a similar part. file1 is ID with sequence after the blast, and file2 contains taxonomic names corresponding to two first numbers in name of sequences.
file 1:
>301-89_IDNAGNDJ_171582
>301-88_ALPEKDJF_119660
>301-88_ALPEKDJF_112039
...
file2:
301-89--sample1
301-88--sample2
...
output:
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
The files are unsorted and file1 contains more lines where is first two numbers similar to the first two numbers in one line in file2. I am looking for some tips/help on how to do that, it is possible to do that like this? which command or language should I use?
(mawk/nawk/gawk -e/-ce/-Pe) '
FNR == !_ {
_ = ! ( ___=match(FS=FNR==NR ? "[-][-]" : "[>_]", "[>-]"))
$_ = $_
} FNR == NR { __[$!_]="--"$NF; next } sub("$", __[$___])' file2.txt file1.txt
———————————————————————————
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_112039--sample2
>301-88_ALPEKDJF_119660--sample2
Using awk
$ awk -F"[_-]" 'BEGIN{OFS="-"}NR==FNR{a[$2]=$4;next}{print $0,a[$2]}' file2 OFS="--" file1
>301-89_IDNAGNDJ_171582--sample1
>301-88_ALPEKDJF_119660--sample2
>301-88_ALPEKDJF_112039--sample2
I am wondering if is it possible to merge information from two files together based on a similar part
Yes ...
The files are unsorted
... but only if they're sorted.
It's easier if we transform them so the delimiters are consistent, and then format it back together later:
sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' file1 produces
301-88 ALPEKDJF_112039
301-88 ALPEKDJF_119660
301-89 IDNAGNDJ_171582
...
which we can just pipe through sort -k1
sed 's/--/ /' f2 produces
301-89 sample1
301-88 sample2
...
which we can sort the same way
join sorted1 sorted2 (with the sorted results of the previous steps) produces
301-88 ALPEKDJF_112039 sample2
301-88 ALPEKDJF_119660 sample2
301-89 IDNAGNDJ_171582 sample1
...
and finally we can format those 3 fields as you originally wanted, by piping through
sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
If it's reasonable to sort them on the fly, we can just do that using process substitution:
$ join \
<( sed 's/>\([0-9]*-[0-9]*\)_\(.*\)$/\1 \2/' f1 | sort -k1 ) \
<( sed 's/--/ /' f2 | sort -k1 ) \
| sed 's/\(.*\) \(.*\) \(.*\)$/\1_\2--\3/'
301-88_ALPEKDJF_112039--sample2
301-88_ALPEKDJF_119660--sample2
301-89_IDNAGNDJ_171582--sample1
...
If it's not reasonable to sort the files - on the fly or otherwise - you're going to end up building a hash in memory, like the awk answer is doing. Give them both a try and see which is faster.

Extract lines from file2 that exist in file1 using a loop

I am very new at shell scripting and I am having some trouble with the following task:
I want to extract lines from file2 that are found also in file1 and extract those lines to a new file3. I am only allowed to use loops for this (I know it works with the basic grep command, but I need to find a way with a loop)
File1
John 5 red books
Ashley 4 yellow music
Susan 8 green films
File2
John
Susan
Desired output for file3 would be:
John 5 red books
Susan 8 green films
The desired output has to be found using bash script and a loop. I have tried the following loop, but I am missing some lines in the results by using this:
while read line
do
grep "${line}" $file1
done < $file2 >> file3.txt
If anyone has any thoughts on how to improve my script or any new ideas (again using loops) it would be greatly appreciated. Thank you!
Looping here is a good educational exercise but it isn't ideal for this in the real world.
Technically, this AWK solution works and uses a loop, but I'm guessing it's not what your instructor is looking for:
awk 'NR == FNR { find[$1]=1; next } find[$1]' File2 File1 >File3
I've swapped the order of the files so the file with the data (File1) is loaded after the file listing what we want (File2).
This starts with a condition that ensures we're on the first file AWK reads (NR is the "number of records" (lines) seen so far across all inputs and FNR is the current file's number of records, so since this clause requires them to be the same value, it can only fire on the first input file). It sets a hash (a data structure with key/value pairs, a.k.a. an associative array or dictionary) whose key is the value of the first column ($1) on the line so we can extract it later, then next skips the later stanza for that input line.
When the code loops through the next file (File1), the first clause does not fire and instead the first column of input is looked up in the find hash. If it is present, its value is 1 and that evaluates to true, so we print the value. (A clause with no action implies { print })
See Toby Speight's answer for a native bash answer with only builtins. It uses loops and hashes. You'll likely find that solution is slower on larger data sets.
Since you're using Bash, you could create an associative array from File2, and use that to check membership. Something like (untested):
read -a names <File2
local -A n
for i in "${names[#]}"
do n["$i"]="$i"
done
while read -r name rest
do [ "${n[$name]}" ] && echo "$name $rest"
done <File1 >file3
Awk solution:
awk 'NR==FNR{ arr[$0]="";next } { for (i in arr) { if (i == $1 ) { print $0 } } }' file2 file1
First we create an array of with the data in file2. We then use this to check the first space delimited piece of data and print if there is a match,
With awk :
$ awk 'NR==FNR{ a[$1];next } $1 in a' file2 file1`
With grep:
$ grep -F -f file2 file1

Split one file into multiple files based on pattern with awk

I have a binary file with the following format:
file
04550525023506346054(....)64645634636346346344363468badcafe268664363463463463463463463464647(....)474017497417428badcafe34376362623626(....)262
and I need to split it in multiple files (using awk) that look like this:
file1
045505250235063460546464563463634634634436346
file2
8badcafe26866436346346346346346346346464747401749741742
file3
8badcafe34376362623626262
I have found on stackoverflow the following line:
cat file |
awk -v RS="\x8b\xad\xca\xfe" 'NR > 1 { print RS $0 > "file" (NR-1); close("file" (NR-1)) }'
and it works for all the files but the first.
Indeed, the file I called file1, is not created because it does not start with the eye catcher 8badcafe.
How can I fix the previous command line in order to have the output I need?
Thanks!
try:
awk '{gsub(/8badcafe/,"\n&");num=split($0, a,"\n");for(i=1;i<=num;i++){print a[i] > "file"++e}}' Input_file
Substituting the string "8badcafe" to a new line and string's value. Then splitting the current line into an array named a whose field separator is new line. then traversing through the array a's all values and printing them one by one into the file1, file2.... with "file" and a increasing count variable named e.
Output files as follows:
cat file1
045505250235063460546464563463634634634436346
cat file2
8badcafe26866436346346346346346346346464747401749741742
cat file3
8badcafe34376362623626262

Match and merge lines based on the first column

I have 2 files:
File1
123:dataset1:dataset932
534940023023:dataset:dataset039302
49930:dataset9203:dataset2003
File2
49930:399402:3949304:293000232:30203993
123:49030:1204:9300:293920
534940023023:49993029:3949203:49293904:29399
and I would like to create
Desired result:
49930:399402:3949304:293000232:30203993:dataset9203:dataset2003
534940023023:49993029:3949203:49293904:29399:dataset:dataset039302
etc
where the result contains one line for each pair of input lines that have identical first column (with : as the column separator).
The join command is your friend here. You'll likely need to sort the inputs (either pre-sort the files, or use a process substitution if available - e.g. with bash).
Something like:
join -t ':' <(sort file2) <(sort file1) >file3
When you do not want to sort files, play with grep:
while IFS=: read key others; do
echo "${key}:${others}:$(grep "^${key}:" file1 | cut -d: -f2-)"
done < file2

concatenating multiple files

I have multiple files, and in each file is the following:
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
That is, each file contains one gene sequence for species HM001 to HM050. I would like to concatenate all these files, so I have a single file that contains the genome for species HM001 to HM050:
>HM001
ATGCT...ATGAA...ATGTT
>HM002
ATGTC...ATGCT...ATGCT
>HM003
ATGCC...ATGC...ATGAT
The ellipses are not actually required in the final file. I suppose cat should be used, but I'm not sure how. Any ideas would be appreciated.
Data parsing and formatting will be alot easier with awk. Try this:
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
For files like:
==> f1 <==
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
==> f2 <==
>HM001
ATGDD...
>HM002
ATGDD...
>HM003
ATGDD...
==> f3 <==
>HM001
ATGEE...
>HM002
ATGEE...
>HM003
ATGEE...
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
>HM001
ATGCT... ATGDD... ATGEE...
>HM002
ATGTC... ATGDD... ATGEE...
>HM003
ATGCC... ATGDD... ATGEE...
Might I suggest converting your group of files into a CSV? It's almost
exactly what you're suggesting, and is easily incorporated into just
about any application for processing (e.g., Excel, R, python).
Up front, I'll assume that all species and gene sequences are simply
alpha-numeric, no spaces or quote-like characters. I'm also assuming
access to sed, sort, and uniq, which are all standard in *nix,
MacOSX, and easily accessible for windows via
msys or
cygwin, to name two.
First, generate an array of file names and species. I'm assuming the
files are named file1, file2, etc. Just adjust the first line
accordingly; it's just a glob, not an executed command.
FILES=($(file*))
SPECIES=($(sed -ne 's/^>//gp' file* | sort | uniq))
This gives us one line per species, sorted, with no repeats. This
ensures that our columns are independent and the set is complete.
Next, create a CSV header row with named columns, dumping it into a
CSV file named csvfile:
echo -n "\"Species\"" > csvfile
for fn in ${FILES[#]} ; do echo -n ",\"${fn}\"" ; done >> csvfile
echo >> csvfile
Now iterate through each gene sequence and extract it from all files:
for sp in ${SPECIES[#]} ; do
echo -n "\"${sp}\""
for fn in ${FILES[#]}; do
ANS=$(sed -ne '/>'${sp}'/,/^/ { /^[^>]/p }' ${fn})
echo -n ",\"${ANS}\""
done
echo
done >> csvfile
This works but is inefficient for larger data sets (i.e., large
numbers of files and/or species). Better implementations (e.g, python,
ruby, perl, even R) would read each file once, forming an
internally-maintained matrix, dictionary, or associative array, and
write out the CSV in one chunk.
What about appending them using echo - along these lines?:
find . -type f -exec bash -c 'echo "append this" >> "$0"' {} \;
Source: https://stackoverflow.com/a/15604608/1662973
I would do it using "type", but that is MSDOS. The above should work for you.
The simplest way I can think of is to use cat. For example (assuming you're on a *nix-type system):
cat file1 file2 file3 > outfile
Another awk implementation:
awk '
{key=$0; getline; value[key] = value[key] $0}
END {for (key in value) {print key; print value[key]}}
' file ...
Now, this will probably not output the keys in sorted order: array keys are inherently unsorted. To ensure sorted output, use gawk and
awk '
{key=$0; getline; val[key] = val[key] $0}
END {
n = asorti(val, keys)
for (i=1; i<=n; i++) {print keys[i]; print val[keys[i]]}
}
' file ...

Resources