Combine multiple files into single file in unix shell scripting - shell

I want to combine the data of 3 (say) files having the same columns and datatype for those, into a single file, which I can further use for processing.
Currently I have to process the files one after the other. So, I am looking for a solution which I can write in a script to combine all the files into one single file.
For ex:
File 1:
mike,sweden,2015
tom,USA,1522
raj,india,455
File 2:
a,xyz,155
b,pqr,3215
c,lmn,3252
Expected combined file 3:
mike,sweden,2015
tom,USA,1522
raj,india,455
a,xyz,155
b,pqr,3215
c,lmn,3252
Kindly help me with this.

Answer to the original form of the question:
As #Lars states in a comment on the question, it looks like a simple concatenation of the input files is desired, which is precisely what cat is for (and even named for):
cat file1 file2 > file3
To fulfill the requirements you added later:
#!/bin/sh
# Concatenate the input files and sort them with duplicates removed
# and save to output file.
cat "$1" "$2" | sort -u > "$3"
Note, however, that you can combine the concatenation and sorting into a single step, as demonstrated by Jean-Baptiste Yunès's answer:
# Sort the input files directly with duplicates removed and save to output file.
sort -u "$1" "$2" > "$3"
Note that using sort is the simplest way to eliminate duplicates.
If you don't want sorting, you'll have to use a different, more complex approach, e.g. with awk:
#!/bin/sh
# Process the combined input and only
# output the first occurrence in a set of duplicates to the output file.
awk '!seen[$0]++' "$1" "$2" > "$3"
!seen[$0]++ is a common awk idiom to only print the first in a set of duplicates:
seen is an associative array that is filled with each input line ($0) as the key (index), with each element created on demand.
This implies that all lines from a set of duplicates (even if not adjacent) refer to the same array element.
In a numerical context, awk's variable values and array elements are implicitly 0, so when a given input line is seen for the first time and the post-decrement (++) is applied, the resulting value of the element is 1.
Whenever a duplicate of that line is later encountered, the value of the array element is incremented.
The net effect is that for any given input line !seen[$0]++ returns true if the input line is seen for the first time, and false for each of its duplicates, if any. Note that ++, due to being a post-increment, is only applied after !seen[$0] is evaluated.
! negates the value of seen[$0], causing a value of 0 - which is false in a Boolean context to return true, and any nonzero value (encountered for duplicates) to return false.
!seen[$0]++ is an instance of a so-called pattern in awk - a condition evaluated against the input line that determines whether the associated action (a block of code) should be processed. Here, there is no action, in which case awk implicitly simply prints the input line, if !seen[$0]++ indicates true.
The overall effect is: Lines are printed in input order, but for lines with duplicates only the first instance is printed, effectively eliminating duplicates.
Note that this approach can be problematic with large input files with few duplicates, because most of the data must then be held in memory.

A script like:
#!/bin/sh
sort "$1" "$2" | uniq > "$3"
should do the trick. Sort will sort the concatenation of the two files (two first args of the script), pass the result to uniq which will remove adjacent identical lines and push the result into the third file (third arg of the script).

If your file naming convention is same(say file1,file2,file3...fileN), then you can use this to combine all.
cat file* > combined_file
Edit: Script to do the same assuming you are passing file names as parameter
#!/bin/sh
cat $1 $2 $3 | uniq > combined_file
Now you can display combined_file if you want. Or access it directly.

Related

How do I split a CSV file based on the column data?

I'm given a alphabet.CSV file with columns A B C D. In column B, the data entered is either 1, or 2.
I'm tasked to write a shellscript that reads data out of the CSV file and split it into two files with names B_1.csv and B_2.csv. Each of these files must only contain data rows belonging to the respective data in column B.
i.e:
B_1.csv should contain all the columns A B C D, but in column B, only 1 shows up.
B_2.csv should contain all the columns A B C D, but in column B, only 2 shows up.
Here is what I have so far:
file=alphabet.csv
if grep -c '1', file
then cat >> B_1.csv
else cat >> B_2.csv
fi
exit
However, this gives me the following error:
command not found
I'm a bit lost. I'm looking through guides, quite understand what they mean, but I'm not sure how I could do this with "sed" or "awk" or other formats.
enter image description here
awk is well suited to this task:
awk '$2 == 1{print > "B_1.csv"} $2 ==2 {print > "B_2.csv"}' FS=, alphabet.csv
This can be easily generalized to more possible values in column 2:
awk '{print > ("B_" $2 ".csv")}' FS=, alphabet.csv
cat without a file name argument will write standard input to standard output. You probably don't have anything meaningful on standard input (unless you are manually typing in the file one line at a time). Similarly, your grep command would count the number of occurrences of 1, on all lines of the file (and basically always succeed; so not a very good command to put in a condition).
Probably you mean you want to write the current line to a different file, depending on what it contains. Awk makes this really easy;
awk -F ',' 'BEGIN { IFS=OFS } { print >($2 == "1" ? "B_1.csv" : "B_2.csv") }' alphabet.csv
In brief, this splits the current line on commas, examines the second field, and writes to one file or the other depending on what it contains. (Awk in general reads one line at a time, and applies the current script to each line in turn.) The compact but slightly obscure notation a ? b : c checks the truth value of a, and returns b if it is true, otherwise c (this "tertiary operator" exists in many languages, including C).
The assignment OFS=FS makes sure the output it comma-separated, too (the default is to read and print whitespace-separated fields).
If you wanted to do this in pure shell script, it would look something like
while IFS=, read -r one two three four; do
case $two in
1) echo "$one,$two,$three,$four" >>B_1.csv;;
2) echo "$one,$two,$three,$four" >>B_2.csv;;
esac
done <alphabet.csv
But really, use Awk. An important task when learning shell scripting is also to learn when the shell is not the most adequate tool for the job; you generally tend to learn sed and Awk (or these days a modern scripting language like Python) as you go.
Both of these are slightly brittle with real-world CSV files, where a comma may not always be a column delimiter. (Commonly you have double-quoted fields which may contain literal commas which are not acting as delimiters, but CSV is not properly standardized.)

How do I parse a csv file to find the "fails" in the file which is on column 2 and find the average of column 7

grep "false" $1 | cut -d ',' -f2,7
this is as far as I got. with this I can get all the false errors and their respond time. But I am having a hard time finding the average out of all the respond time's combine.
It's not fully clear what you're trying to do, but if you're looking for the arithmetic mean of all second comma-delimited fields ("columns") where the seventh field is false then here's an answer using awk:
awk -F ',' '$7 == "false" { f++; sum += $2 } END { print sum / f }' "$#"
This sets the field separator to be , and then parses only lines whose seventh (comma-delimited) field is exactly false (also consider tolower($7) == "false"), incrementing a counter (f) and adding the second column to a sum variable. After running through all lines of all input files, the script prints the arithmetic mean by dividing the sum by the number of rows it keyed on. The trailing "$#" will send each argument to your shell script as a file for this awk command.
A note on fields: awk is one-indexed, but 0 often has a special value. $0 is the whole line, $1 is the first field, and so on. awk is pretty flexible, so you can also do things like $i to refer to the field represented by a variable i, including things like $(NF-1) to refer to the contents of the field before the last field of the line.
Non-delimiting commas:
If your data might have quoted values with commas in them, or escaped commas, the field calculation in awk (or in cut) won't work. A proper CSV parser (requiring a more complete language than bash plus extras like awk, sed, or cut) would be preferable to making your own. Alternatively, if you control the format, you can consider a different delimiter such as Tab or the dedicated ASCII Record Separator character (RS, a.k.a. U+001E, Information Separator Two, which you can enter in bash as $'\x1e' and in awk (and most other languages) as "\x1e").

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt

How to find integer values and compare them then transfer the main files?

I have some output files (5000 files) of .log which are the results of QM computations. Inside each file there are two special lines indicate the number of electrons and orbitals, like this below as an example (with exact spaces as in output files):
Number of electrons = 9
Number of orbitals = 13
I thought about a script (bash or Fortran), as a solution to this problem, which grep these two lines (at same time) and get the corresponding integer values (9 and 13, for instance), compare them and finds the difference between two values, and finally, list them in a new text file with the corresponding filenames.
I would really appreciate any help given.
Am posting an attempt in GNU Awk, and have tested it in that only.
#!/bin/bash
for file in *.log
do
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk 'function abs(v) {return v < 0 ? -v : v} {print abs($1-$2)}' >> output_"$file"
done
The reason I split the AWK logic to two was to reduce the complexity in doing it in single huge command. The first part is for extracting the numbers from your log file in a columnar format and second for getting their absolute value.
I will break-down the AWK logic:-
-F'=[[:blank:]]*' is a mult0 character delimiter logic including = and one or more instances of [[:blank:]] whitespace characters.
'/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' searches for lines starting with Number of and prints it in a columnar fashion, i.e. as 9 13 from your sample file.
The second part is self-explanatory. I have written a function to get the absolute value from the two returned values and print it.
Each output is saved in a file named output_, for you to process it further.
Run the script from your command line as bash script.sh, where script.sh is the name containing the above lines.
Update:-
In case if you are interested in negative values too i.e. without the absolute function, change the awk statement to
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk '{print ($1-$2)}' >> output_"$file"
Bad way to do it (but it will work)-
while read file
do
first=$(awk -F= '/^Number/ {print $2}' "$file" | head -1)
second=$(awk -F= '/^Number/ {print $2}' "$file" | tail -1)
if [ "$first" -gt "$second" ]
then
echo $(("$first" - "$second"))
else
echo $(("$second" - "$first"))
fi > "$file"_answer ;
done < list_of_files
This method picks up the values (in the awk one liner and compares them.
It then subtracts them to give you one value which it saves in the file called "$file"_answer. i.e. the initial file name with '_answer' as a suffix to the name.
You may need to tweak this code to fit your purposes exactly.

Resources