Appending Text From A File To Text In Another File - bash

I have two separate text files, one with 4 letter words and one with 4 digit numbers, all on individual lines. The words in the on file correspond to the numbers on the same line in the other file. For example:
CATS
RATS
HATS
matches up with
2287
7287
4287
What I would like is to append the numbers to the end of their matching word, so it looks like this:
CATS2287
RATS7287
HATS4287
so far what I have is this:
for i in $(cat numbers); do
sed 's/$/'$i'/' words;
done
but the problem is a) that doesn't print/echo out to a new file and b) it loops through each word every time the first loop comes to a new number so in the end, all the words are paired up with the last number in the number file. Thanks in advance for the help.

paste -d "" /path/to/letters /path/to/numbers
Proof of Concept
$ paste -d "" CATS NUMS
CATS2287
RATS7287
HATS4287

You can use the excellent little paste(1) utility:
$ cat a
CATS
RATS
HATS
$ cat b
2287
7287
4287
$ paste -d "" a b
CATS2287
RATS7287
HATS4287
$
-d specifies a list of delimiters; I gave it a blank list: no delimiters, no delimiters.

Hmm, my version of paste with -d"" just results in numbers, the words get overwritten (GNU paste 8.10 on cygwin). my input files have no carriage returns.
paste words numbers | tr -d '\t'
Also, just with shell builtins
exec 3<words
exec 4<numbers
while read -u3 word; do
read -u4 num
echo $word$num
done
exec 3<&-
exec 4<&-

On Mac OS X:
paste -d "\0" <(echo abc) <(echo def)

there are a few ways to do that
Paste:
paste -d "" file1 file2
awk
awk '{ getline f<"file2" ; print $0f}' file1
Bash:
exec 6<"file2"
while read -r line
do
read -r S <&6
printf "${line}${S}\n"
done <"file1"
exec >&6-
Ruby(1.9+)
ruby -ne 'BEGIN{f=File.open("file1")};print $_.chomp+f.readline;END{f.close}' file

Related

Speed up bash for loop which contains multiple sed commands

my bash for loop looks like:
for i in read_* ; do
cut -f1 $i | sponge $i
sed -i '1 s/^/>/g' $i
sed -i '3 s/^/>ref\n/g' $i
sed -i '4d' $i
sed -i '1h;2H;1,2d;4G' $i
mv $i $i.fasta
done
Are there any methods of speeding up this process, perhaps using GNU parallel?
EDIT: Added input and expected output.
Input:
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Hopeful output:
>ref
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
I used the sed -i '1h;2H;1,2d;4G' $i command to swap lines 2 and 4.
If I read it right, this should create the same result, though it would probably help a LOT if I could see what your input and expected output look like...
awk '{$0=$1}
FNR==1{hd=">"$0; next}
FNR==2{hd=hd"\n"$0;next}
FNR==3{print ">ref\n"$0 > FILENAME".fasta"}
FNR==4{next}
FNR==5{print hd"\n"$0 > FILENAME".fasta"}
' read_*
My input files:
$: cat read_x
foo x
bar x
baz x
last x
curiosity x
$: cat read_y
FOO y
BAR y
BAZ y
LAST y
CURIOSITY y
and the resulting output files:
$: cat read_x.fasta
>ref
baz
>foo
bar
curiosity
$: cat read_y.fasta
>ref
BAZ
>FOO
BAR
CURIOSITY
This runs in one pass with no loop aside from awk's usual internals, and leaves the originals in place so you can check it first. If all is good, all that's left is to remove the originals. For that, I would use extended globbing.
$: shopt -s extglob; rm read_!(*.fasta)
That will clean up the original inputs but not the new outputs.
Same results, three commands, no loops.
I am, or course, making some assumptions about what you are meaning to do that might not be accurate. To get this format in a single sed call -
$: sed -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_x
>ref
baz
>foo
bar
curiosity
but that's not the same commands you used, so maybe I'm misreading it.
To use this to in-place edit multiple files at a time (instead of calling it in a loop on each file), use -si so that the line numbers apply to each file rather than the stream of records they collectively produce.
DON'T use -is, though you could use -i -s.
$: sed -s -i -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_*
This still leaves you with the issue of renaming each, but xargs makes that pretty easy in the given example.
printf "%s\n" read_* | xargs -I# mv # #.fasta
addendum
Using the file you gave in the OP, assuming every file is the same general structure and exactly 4 lines -
$: cat file_0 # I made files 0 through 7, but with same data
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
$: sed -Esi '1{s/^([^[:space:]]+).*/>\1/;h;s/.*/>ref/}; 3x;' file_?
$: cat file_0 # used a diff on each, worked on all at once
>ref
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Breakout:
-Esi Extended pattern matching, separate file linecounts, in-place edits
1{...}; Collectively do these commands, in order, only on every line 1
s/^([^[:space:]]+).*/>\1/ add leading > but strip everything after any whitespace
h store the resulting >\1 line in the hold buffer
s/.*/>ref/ then replace the whole line with a literal >ref
`3x' swap line 3 with the value in the hold buffer from line 1
file_? I used a glob to supply the appropriate list of files all at once.
Doing same with awk:
$: awk 'FNR==1{id=">"$1; print ">ref" >FILENAME".fasta"; next} FNR==3{print id > FILENAME".fasta"; next} {print $0 > FILENAME".fasta"}' file_?
Then you can do file management as above with the xargs/mv for the sed or the shopt/rm for the awk - or we could add a little organizational work in awk if you like. Consider this:
awk 'BEGIN { system(" mkdir -p done ") }
FNR==1 { id=">"$1; print ">ref" > FILENAME".fasta"; next } # skip printing original
FNR==3 { print id > FILENAME".fasta"; next } # skip printing original
{ print $0 > FILENAME".fasta" } # every line NOT skipped
FNR==4 { close(FILENAME); close(FILENAME".fasta");
system("mv " FILENAME " done/")
}' file_?
Then if there are any problems, it's easy to delete the fasta's, move the originals back, adjust the code, and try again. If everything is ok, it's fast and easy to rm -fr done, yes?
Note that I really only added the mkdir inside a system call in the awk to show that you can, and to keep from having to manually do it separately if you have to run a few iterations or move it all into a wrapper script, etc.
The code in the question runs multiple subprocesses (cut, sponge, sed four times, and mv) for each file that is processed. Running subprocesses is relatively slow, so you can speed up the code significantly by reducing the number of them.
This Shellcheck-clean code is one way to do it:
#! /bin/bash -p
old_files=()
for f in read_* ; do
readarray -t lines <"$f"
printf '>ref\n%s\n>%s\n%s\n' \
"${lines[3]}" "${lines[0]%%[[:space:]]*}" "${lines[1]}" >"$f.fasta"
old_files+=( "$f" )
done
rm -- "${old_files[#]}"
This runs no subprocesses when processing individual files. It just reads the lines of the old file into an array using the built-in readarray command and writes to the new file using the built-in printf.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the %% in ${lines[0]%%[[:space:]]*}.
To avoid running rm for each file, the code keeps a list of files to be deleted and removes all of them at the end. If you try the code, consider commenting the rm line until you are very confident that the rest of the code is doing what you want.

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

Having issues with shell script concept

Hi, I have a file with following contents
> 1234 alphabet /vag/one/arun
> 1454 bigdata /home/two/ogra
> 5684 apple /vinay/three/dire
but i want the output to be like
> 1234 alphabet one
> 1454 bigdata two
> 5684 apple three
awk '{
split($NF,ar,"/");
$NF=ar[3]
for (i=1;i<=NF;i++) {
printf "%s ",$i
}
printf "\n"
}' filename
Take the last field delimited by space and split it into the array ar based on "/", the last field equal to the third element in ar and then loop through the fields printing them.
Cut each input line into pieces, throw the parts you don't need into a wastebasket, and reassemble what is left. For instance:
grep -Eo '(^|/[^/]+/|/[^/]+$|[^/]+)' <INPUTFILE| grep -Fv /|xargs -L 2 -d '\n' echo >OUTPUTFILE
You can do it simply by controlling IFS (the Internal Field Separator) which when it includes the '/' character will cause filed splitting to occur allowing you to read a/b/c into separate variables. Then it's just a matter of printing the variables you want, e.g. with your original contents in file,
$ while IFS="${IFS}/" read -r n l a b c; do echo "$n $l $b"; done < file
1234 alphabet one
1454 bigdata two
5684 apple three

UNIX : Deleting all the lines containing string & number

I have a TXT files with lines of about 1 Million.
#Test.txt
zs272
zs273
zs277
zs278
zs282
zs285
zs288
zs289
zs298
zs300
zs7
zsa
zsag
zsani179yukkie
zsani182zaide
zsaqgiw
zsb86581
zsbguepqtkcn
zscazn
zscfhlsv
zscgxadrwijl
zsclions111yuen
zscwqtk
zscz
zsder
zsdfdgdgg
I wanted to delete the line which has the numbers and keeping only strings.
I tried,
grep -v '^[1-9]' Test.txt > 1_Test.txt
Couldn't get the desired result.
Expected output:
#1_Test.txt
zsa
zsag
zsbguepqtkcn
zscazn
zscfhlsv
zscgxadrwijl
zscwqtk
zscz
zsder
zsdfdgdgg
sed '/[0-9]/d' file
If you want to edit your file "in place" use sed's option -i.
awk '!/[0-9]/' file
With bash:
while read -r line; do [[ ! $line =~ [0-9] ]] && printf "%s\n" "$line"; done < file
Just remove the start of the line anchor ^.
^[1-9] regex only matches the numbers 1-9 which exists at the start.
grep -v '[1-9]' Test.txt > 1_Test.txt
to work for all digits including 0.
grep -v '[0-9]' Test.txt > 1_Test.txt
A solution in AWK!
awk '!/[0-9]+/{print}' file

Append index variable at the end of each line

I am trying to append an index variable at the end of each line of a file that I have. However I dont want to lose the escape characters that I have in the textFile and thus cannot echo into the file again.
Here's what I tried:
while read p; do
tempCom+=$p
tempCom+=$indexVar
echo $tempCom >> otherFile.txt
tempCom=""
done < result.txt
What I am after:
Read:
"asdasdasdasdasdasd\ asdasd/asda"
"qweqweqweqweqweqwe\ qweqwe/qweq"
Output:
"asdasdasdasdasdasd\ asdasd/asda" 1
"qweqweqweqweqweqwe\ qweqwe/qweq" 2
Note that indexVar is an index that is stored elsewhere and does not necessarily correspond to the line that its being appended to.
Your problem is very likely a quoting problem. Observe the IFS= and the -r option in the read statement too.
while IFS= read -r p
tempCom+=$p$indexVar
printf '%s\n' "$tempCom" >> otherFile.txt # Observe the quotes
tempCom=
done < result.txt
If you just want to append the line number to the end why not use awk?
awk '{print $0, "\t", NR}' < file.txt
EDIT 1: It sounds like you want to use paste then (assuming you want to just join line by line)
paste file1.txt file2.txt > fileresults.txt
EDIT 2: You can use sed then:
sed "s|$|${indexVar}|" input
Use the -r option of the read command, so that the backslashes are preserved.
while read -r p; do

Resources