Merge two .txt files together and reorder - bash

I have two .txt files both 42 lines each (42nd line is just a blank space).
One file is called date.txt, and each line has a format like:
2017-03-16 10:45:32.175 UTC
The second file is called version, and each line has a format like:
1.2.3.10
Is there a way to merge the two files together so that the date is appended to the version number (separated by a space). So the 1st line of each file is merged together, then the second line, third line, etc...
So it would look like:
1.2.3.10 2017-03-16 10:45:32.175 UTC
After that, is it possible to reorder the new file by the date and time? (Going from the oldest date to the latest/current one).
The end file should still be 42 lines long.
Thanks!

Use paste:
paste file1.txt file2.txt -d' ' > result.txt
-d is used to set the delimiter.
You can then attempt to sort by second and third column using sort:
sort -k2,3 result.txt > sorted.txt
-k is used to select columns to sort by.
But note that this doesn't parse the date and time, only sorts them as strings.
In general:
paste file1.txt file2.txt | sort -k2,3 > result.txt

Related

Reduce Unix Script execution time for while loop

Have a reference file "names.txt" with data as below:
Tom
Jerry
Mickey
Note: there are 20k lines in the file "names.txt"
There is another delimited file with multiple lines for every key from the reference file "names.txt" as below:
Name~~Id~~Marks~~Column4~~Column5
Note: there are about 30 columns in the delimited file:
The delimited file looks like something :
Tom~~123~~50~~C4~~C5
Tom~~111~~45~~C4~~C5
Tom~~321~~33~~C4~~C5
.
.
Jerry~~222~~13~~C4~~C5
Jerry~~888~~98~~C4~~C5
.
.
Need to extract rows from the delimited file for every key from the file "names.txt" having the highest value in the "Marks" column.
So, there will be one row in the output file for every key form the file "names.txt".
Below is the code snipped in unix that I am using which is working perfectly fine but it takes around 2 hours to execute the script.
while read -r line; do
getData `echo ${line// /}`
done < names.txt
function getData
{
name=$1
grep ${name} ${delimited_file} | awk -F"~~" '{if($1==name1 && $3>max){op=$0; max=$3}}END{print op} ' max=0 name1=${name} >> output.txt
}
Is there any way to parallelize this and reduce the execution time. Can only use shell scripting.
Rule of thumb for optimizing bash scripts:
The size of the input shouldn't affect how often a program has to run.
Your script is slow because bash has to run the function 20k times, which involves starting grep and awk. Just starting programs takes a hefty amount of time. Therefore, try an approach where the number of program starts is constant.
Here is an approach:
Process the second file, such that for every name only the line with the maximal mark remains.
Can be done with sort and awk, or sort and uniq -f + Schwartzian transform.
Then keep only those lines whose names appear in names.txt.
Easy with grep -f
sort -t'~' -k1,1 -k5,5nr file2 |
awk -F'~~' '$1!=last{print;last=$1}' |
grep -f <(sed 's/.*/^&~~/' names.txt)
The sed part turns the names into regexes that ensure that only the first field is matched; assuming that names do not contain special symbols like . and *.
Depending on the relation between the first and second file it might be faster to swap those two steps. The result will be the same.

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

How to sort list of names by date and get last in bash

I'm retrieving the list of artifacts from arifactory:
https://artifactory.example.com/artifactory/api/storage/TEST-REPO/com/1.0.1-SNAPSHOT/app-dist-1.0.1-20161206.093508-1-bin.zip
https://artifactory.example.com/artifactory/api/storage/TEST-REPO/com/1.0.1-SNAPSHOT/app-dist-1.0.1-20161206.125420-2-bin.zip
...
How can i compare this list by date and take the latest?
I would do it as follows:
sort -t- -nk5 list.txt | tail -1
Separate by character '-' and sort by the 5th column (which has the date), then print only the last line of the output.
Check if it works properly just in case of I haven't taken into account some date format which may appear in your file.

merge output in separate columns (comma separated)

I have two outputs of a file (fetch using cut command).
One is
700316307503
700315522410
700317709443
and second is
ab
bc
cd
Both outputs have same number of rows.
I need to merge them or arrange them in a new file as following (comma separated new file)
700316307503,ab
700315522410,bc
700317709443,cd
One way using paste:
paste -d"," file1 file2 >file3

Resources