Isolate certain parts of a text file with shell script

Isolate certain parts of a text file with shell script - bash

//unit-translator
#head
<
shell: /bin/bash;
>
#stuffs
<
[~]: ~;
[binary's]: /bin/bash;
[run-as-root]: sudo;
>
#commands
<
make-directory:mkdir;
move-to-directory:cd;
url-download-current-dirrectory:wget;
extract-here-tar:tar;
copy:cp;
remove-directory-+files:rm -R;
enter-root:su;
>
I want to isolate everything after "#commands", between the 2 "<", ">"'s as a string. How do I go about this?
I made the whole fille a string
translator=$(<config.txt)
I want to iscolate everything in the commands section, and store it as the variable "translator commands".
From that point I plan to split each line, and each set of commands something like this:
IFS=';' read -a translatorcommandlines <<< "$translatorcommands"
IFS=':' read -a translatorcommand <<< "$translatorcommandlines"
I'm so clueless, please help me!

If you mean to extract all lines after #command between < and > you can go with this command:
sed '0,/^#command/d' config.txt | sed '/>/q' | grep "^\w"
which skips all lines before #command, prints lines until > and takes only those starting with word character.
My output for your file is:
make-directory:mkdir;
move-to-directory:cd;
url-download-current-dirrectory:wget;
extract-here-tar:tar;
copy:cp;
remove-directory-+files:rm -R;
enter-root:su;

The general purpose text processing tool for UNIX is "awk". You don't show in your question what you want your output to be so idk what you want but hopefully this is enough for you to figure it out from here:
$ cat tst.awk
BEGIN { RS=">"; FS="\n" }
{ gsub(/^.*<[[:blank:]]*\n|\n[[:blank:]]*$/,"") }
NF {
for (i=1;i<=NF;i++) {
print "record", NR, "field", i, "= [" $i "]"
}
print "----"
}
$ awk -f tst.awk file
record 1 field 1 = []
record 1 field 2 = [shell: /bin/bash;]
record 1 field 3 = []
----
record 2 field 1 = []
record 2 field 2 = [[~]: ~;]
record 2 field 3 = [[binary's]: /bin/bash;]
record 2 field 4 = [[run-as-root]: sudo;]
record 2 field 5 = []
record 2 field 6 = []
----
record 3 field 1 = []
record 3 field 2 = [make-directory:mkdir;]
record 3 field 3 = [move-to-directory:cd;]
record 3 field 4 = [url-download-current-dirrectory:wget;]
record 3 field 5 = [extract-here-tar:tar;]
record 3 field 6 = [copy:cp;]
record 3 field 7 = [remove-directory-+files:rm -R;]
record 3 field 8 = [enter-root:su;]
record 3 field 9 = []
----

Related

Reading a log file into 2D/1D array in bash script

I have a file log.txt as below
1 0.694003 5.326995 7.500997 6.263974 0.633941 36.556128
2 2.221990 4.422010 4.652992 5.964420 0.660997 51.874905
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
4 1.914000 5.451991 0.668012 6.355688 0.634081 106.733134
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
6 1.997993 1.406001 7.977006 3.923551 0.517551 37.611894
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
8 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
I want to read 6th column of this file into an array myArray so that it will give below:
echo ${myArray[9]} = 0.396635
Thank you.

Here's a way to do this (bash 4+), assuming log.txt's first column starts at 1 and doesn't skip any numbers.
readarray -t myArray < <(tr -s ' ' < log.txt | cut -d' ' -f6)
echo ${myArray[8]}
tr -s ' ' collapses the whitespace, for easier manipulation
cut -d' ' -f6 selects the 6th space separated column
<(...) turns the subcommand into a temporary file
readarray reads lines from the file into the variable myArray
Note that the array is 8 indexed, so I've selected [8] instead of [9].

Assumption:
first column of file is an integer
first column of file may not be sequential
OP wants (needs?) the array index to match the value in the first column
Sample data file:
$ cat log.txt
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
23 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
One idea using awk (to parse the input file):
$ awk '{print $1,$6}' log.txt
3 0.728308
5 0.687023
7 0.567905
9 0.396635
23 0.508963
We can then feed this into a while loop to build the array:
unset myArray
while read -r ndx value
do
myArray["${ndx}"]="${value}"
done < <(awk '{print $1,$6}' log.txt)
Verify contents of array:
$ typeset -p myArray
declare -a myArray=([3]="0.728308" [5]="0.687023" [7]="0.567905" [9]="0.396635" [23]="0.508963")
$ for ndx in "${!myArray[#]}"
do
echo "index = ${ndx} ; value = ${myArray[${ndx}]}"
done
index = 3 ; value = 0.728308
index = 5 ; value = 0.687023
index = 7 ; value = 0.567905
index = 9 ; value = 0.396635
index = 23 ; value = 0.508963

Another approach using just bash4+ builtins. (if acceptable)
#!/usr/bin/env bash
mapfile -t rows < log.txt
read -ra column <<< "${rows[8]}"
echo "${column[5]}"

mean ad standart deviation of multiple files or every three lines in a single line in linux

My operating system is windows 10, I use bash on windows for executing linux commands. I have a file with 96 lines and I have multiple files that covered every three lines of this file and I want to add the mean and standard deviation of them into a single file as line by line.
Single file
1 31.31
2 32.24
3 32.11
4 20.97
5 20.93
6 20.91
7 22.58
8 22.46
9 22.52
10 20.71
11 20.25
12 20.51
File 1
1 31.31
2 32.24
3 32.11
File 2
4 20.97
5 20.93
6 20.91
File 3
7 22.58
8 22.46
9 22.52
First of all, I tried to split file with verbose mode to multiple files with
grep -i 'Sample' Sample3.txt | awk '{print $5, $6}' | sed 's/\,/\./g' >> Sample4.txt | split -l3 Sample4.txt --verbose
Can tcsh commands like foreach and awk used for bash scripting? can we do this in a single text file or do we have to split that single file into files?
for example output can be:
output.txt
mean stand.D.
31.88667 0.50362 ----- first three rows mean and sd
20.93667 0.030 ----- second three rows mean and sd
22.52 0.06 ----- third three rows mean and sd
etc etc etc

How about using this awk script?
BEGIN {
avg=0; j=0
fname = "file_output.txt"
printf "mean\t stand.D\n" > fname
}
{
avg = avg + $2
values[j] = $2
j = j + 1
if (NR % 3 == 0) {
printf "%f\t", avg/3 > fname
sd = 0
for (k = 0; k < 3; k++) {
sd = sd + (values[k]-avg/3)*(values[k]-avg/3)
}
printf "%f\n", sqrt(sd/3) > fname
avg = 0; j = 0
}
}
Output:
mean stand.D
31.8867 0.411204
20.9367 0.0249444
22.52 0.0489898
20.49 0.188326
"Bash script" (foo.sh):
#!/bin/bash
# data.txt is Single File
awk -F " " -f script.awk data.txt

Retrive entire column to a new file if it matches from list of another file

I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq

$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.

How do I delete all lines in a concatenated text file that match the header WITHOUT deleting the header? [bash] [duplicate]

This question already has answers here:
Is there way to delete duplicate header in a file in Unix?
(2 answers)
How to delete the first column ( which is in fact row names) from a data file in linux?
(5 answers)
Closed 4 years ago.
My apologies if this question already exists out there. I have a concatenated text file that looks like this:
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu_2500_NA 0.66666024153854 -0.194766358934969
2 1 764484 787307 1:764484:787307:clu_2500_NA -0.602342191830433 0.24773430748199
3 1 880180 880422 1:880180:880422:clu_2501_NA -0.211378452591182 2.02508282380949
4 1 880180 880437 1:880180:880437:clu_2501_NA 0.231916912049866 -2.20305649485074
5 1 889462 891303 1:889462:891303:clu_2502_NA -2.3215482460681 0.849095194607155
6 1 889903 891303 1:889903:891303:clu_2502_NA 2.13353943689806 -0.920181808417383
7 1 899547 899729 1:899547:899729:clu_2503_NA 0.990822909478346 0.758143648905368
8 1 899560 899729 1:899560:899729:clu_2503_NA -0.938514081703866 -0.543217522714283
9 1 986217 986412 1:986217:986412:clu_2504_NA -0.851041440248378 0.682551011244202
The first line, #Chr start end ID GTEX-Q2AG GTEX-NPJ8, is the header, and because I concatenated several similar files, it occurs multiple times throughout the file. I would like to delete every instance of the header occuring in the text without deleting the first header
BONUS I actually need help with this too and would like to avoid posting another stack overflow question. The first column of my data was generated by R and represents row numbers. I want them all gone without deleting #Chr. There are too many columns and it's a problem.
This problem is different than ones recommended to me because of the above additional issue and also because you don't necessarily have to use regex to solve this problem.

The following AWK script removes all lines that are exactly the same as the first one.
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile > outputfile
It will print the first line because the initial value of header is an empty string. Then it will store the first line in header because it is empty.
After this it will print only lines that are not equal to the first one already stored in header. The second if will always be false once the header has been saved.
Note: If the file starts with empty lines these empty lines will be removed.
To remove the first number column you can use
sed 's/^[0-9][0-9]*[ \t]*//' inputfile > outputfile
You can combine both commands to a pipe
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile | sed 's/^[0-9][0-9]*[ \t]*//' > outputfile

maybe this helpful:
delete all headers
delete first column
add first header
cat foo.txt
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
2 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
3 1 764484 783034 1:764484:783034:clu
sed '/#Chr start end ID GTEX-Q2AG GTEX-NPJ8/d' foo.txt | awk '{$1 = ""; print $0 }' | sed '1i #Chr start end ID GTEX-Q2AG GTEX-NPJ8'
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu

Using sed
sed '2,${/HEADER/d}' input.txt > output.txt
Command explained:
Starting at line 2: 2,
Search for any line matching 'HEADER' /HEADER
Delete it /d

I would do
awk 'NR == 1 {header = $0; print} $0 != header' file

Output the line number when there is a matching value, for each column

Say I've got a file.txt
Position name1 name2 name3
2 A G F
4 G S D
5 L K P
7 G A A
8 O L K
9 E A G
and I need to get the output:
name1 name2 name3
2 2 7
4 7 9
7 9
It outputs each name, and the position numbers where there is an A or G
In file.txt, the name1 column has an A in position 2, G's in positions 4 and 7... therefore in the output file: 2,4,7 is listed under name1
...and so on
Strategy I've devised so far (not very efficient): reading each column one at a time, and outputting the position number when a match occurs. Then I'd get the result for each column and cbind them together using r.
I'm fairly certain there's a better way using awk or bash... ideas appreciated.

$ cat tst.awk
NR==1 {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", $nameNr, (nameNr<NF?OFS:ORS)
}
next
}
{
for (nameNr=2;nameNr<=NF;nameNr++) {
if ($nameNr ~ /^[AG]$/) {
hits[nameNr,++numHits[nameNr]] = $1
maxHits = (numHits[nameNr] > maxHits ? numHits[nameNr] : maxHits)
}
}
}
END {
for (hitNr=1; hitNr<=maxHits; hitNr++) {
for (nameNr=2;nameNr<=NF;nameNr++) {
printf "%5s%s", hits[nameNr,hitNr], (nameNr<NF?OFS:ORS)
}
}
}
$ awk -f tst.awk file
name1 name2 name3
2 2 7
4 7 9
7 9

Save the below script :
#!/bin/bash
gawk '{if( NR == 1 ) {print $2 >>"name1"; print $3 >>"name2"; print $4>>"name3";}}
{if($2=="A" || $2=="G"){print $1 >> "name1"}}
{if($3=="A" || $3=="G"){print $1 >> "name2"}}
{if($4=="A" || $4=="G"){print $1 >> "name3"}}
END{system("paste name*;rm name*")}' $1
as finder. Make finder an executable(using chmod) and then do :
./finder file.txt
Note : I have used three temporary files name1, name2 and name3. You could change the file names at your convenience. Also, these files will be deleted at the end.
Edit : Removed the BEGIN part of the gawk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Isolate certain parts of a text file with shell script - bash

Related

Reading a log file into 2D/1D array in bash script

mean ad standart deviation of multiple files or every three lines in a single line in linux

Retrive entire column to a new file if it matches from list of another file

How do I delete all lines in a concatenated text file that match the header WITHOUT deleting the header? [bash] [duplicate]

Output the line number when there is a matching value, for each column

Categories

Resources