Want I want to do is simply add a column with the numbers of a huge file:
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
xxx xxxxx xxxx
To get the next output:
xxx 1 xxxx xxxxx
xxx xxxx xxxx
xxx 2 xxxx xxxxx
xxx xxxx xxxx
xxx 3 xxxx xxxxx
I tried something with awk '{print NR % 2==1 etc ...} but it doesn't work
Any suggestion?
Many thanks in advance
You're on the right track
awk 'NR%2 { $1 = $1" "++i}; 1;' file.txt
NR%2 evalutes to true for odd-numbers lines. The resulting assignment replace the first field with the value in the first field plus a number that (starting from 0) is incremented then concatenated. The 1; always evaluates to true and applies the default action (print) to the line. The longer-but-clear equivalent is NR%2 { $1 = $1" "++i}; {print}.
perl -lane 'if ($. % 2 == 1){$n++; print "$F[0] $n #F[1..$#F]"} else{print}' file.txt
produces the output:
xxx 1 xxxxx xxxx
xxx xxxxx xxxx
xxx 2 xxxxx xxxx
xxx xxxxx xxxx
xxx 3 xxxxx xxxx
Explanation:
-n loop around every line of the input file, put the line in the $_ variable, do not automatically print every line
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array.
-e execute the perl code
$. is the line number
#F is the array of words in each line, indexed starting with 0
$#F is the number of words in #F
#F[1..$#F] is an array slice of element 1 through the last element
Related
This question already has answers here:
Printing a sequence from a fasta file
(5 answers)
Closed 20 days ago.
Hi I have a similar situation with Grep group of lines, but slightly different.
I have a file in the format of:
> xxxx AB=AAA NNN xxxx CD=DDD xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA JJJ xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=AAA NNN xxxx CD=FFF xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
(each item starting with > does not necessarily contain same number of lines with xxx, xxx is a list of string with all capital letters, the only cue that the record of this item is completed is that the next line starts with >)
Firstly, I want to grep all items with AB = EEE FFF as a resultant file like below:
>xxxx AB=EEE FFF xxxx CD=GGG xxxxx
xxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=TTT xxxxx
xxx
xxx
xxx
xxx
>xxxx AB=EEE FFF xxxx CD=EEE xxxxx
xxx
xxx
xxx
xxx
xxx
xxx
Then, I have a csv file with list of CD items, and I want to grep all these with CD=xxx as xxx is a line in csv file.
A sample of an item is:
>sp|P01023|A2MG_HUMAN Alpha-2-macroglobulin OS=Homo sapiens OX=9606 GN=A2M PE=1 SV=3
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTV
SASLESVRGNRSLFTDLEAENDVLHCVAFAVPKSSSNEEVMFLTVQVKGPTQEFKKRTTV
MVKNEDSLVFVQTDKSIYKPGQTVKFR
AB in my example refers to OS here, and CD in my example refers to GN (so it's a single string containing capital letters AND/OR number
My csv file looks like (with ~1000 lines):
A2M
AIF1
Thanks a lot!
Your question doesn't have much in the way of testable sample data but something like this might be a starting point:
awk -v s1='AB=EEE FFF' -v s2='CD' -v out='out.dat' '
/^>/ {
if ( ok = index($0,s1) )
for ( i=1; i<=NF; i++ )
if ( index($i, s2"=")==1 )
print substr( $i, index($i,"=")+1 )
}
ok { print >out }
' in.dat |\
grep -Fx -f - in.csv >out.csv
use awk to process in.csv:
look for lines starting > and if found:
set flag based on presence/absence of desired string s1 (flag will remain set until re-tested at next > line)
if s1 present, search for a field that starts with string s2 followed by =
if found, write section after = to stdout
(for efficiency, one could break out of the for here)
if ok flag is set, copy the line to out.dat
awk's stdout is piped into grep.
use grep to search for fixed strings listed in awk's output that match an entire line of in.csv, and save results to out.csv
The title of my question is very similar to other posts, I haven't found anything on my specific example though. I have to read in a text file as "$1", then put the values into an array line by line. Example:
myscript.sh /path/to/file
My question is would this approach work?
1 #!/bin/bash
2 file="$1"
3 readarray array < file
Would this code treat the "path/to/file" as "$1" then place that path into the variable "file". And if that part works correctly I believe line 3 should properly put the lines into an array correct?
This is the contents of the text file:
$ head short - rockyou .txt
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
.
.
.
I hope this is enough information to help
Very close. :)
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0 needed" >&2; exit 1;; esac
file="$1"
readarray -t array <"$file"
declare -p array >&2 # print the array to stderr for demonstration/proof-of-concept
Note the use of the -t argument to readarray (to discard trailing newlines), and the use of $file rather than just file.
I use the following for placing the lines of a file in an array:
IFS=$'\r\n' GLOBIGNORE='*' command eval 'array=($(<filename))'
This gets all the columns and you can later work with it.
Edit: Explanations on the procedure above:
IFS=$'\r\n': stands for "internal field separator". It is used by the shell to determine how to do word splitting, i. e. how to recognize word boundaries.
GLOBIGNORE='*': From the bash's manual page: A colon-separated list of patterns defining the set of filenames to be ignored by pathname expansion. If a filename matched by a pathname expansion pattern also matches one of the patterns in GLOBIGNORE, it is removed from the list of matches.
command eval: The addition of command eval allows for the expression to be kept in the present execution environment
array=...: Simply the definition.
There are different threads on Stackoverflow and Stackexchange with more details on this:
https://unix.stackexchange.com/questions/184863/what-is-the-meaning-of-ifs-n-in-bash-scripting
https://unix.stackexchange.com/questions/105465/how-does-globignore-work
Read lines from a file into a Bash array
Then I just loop around the array like this:
for (( b = 0; b < ${#array[#]}; b++ )); do
#Do Somethng
done
This could be matter of opinion. Please, wait for more comments.
Edit: Use case with empty lines and globs
After the comments yesterday. I finally have had time to test the suggestions (empty lines, lines with globs)
In both cases the array is working fine when working in conjunction with awk. In the following example I attempt to print only the column2 into a new text file:
IFS=$'\r\n' GLOBIGNORE='*' command eval 'array=($(<'$1'))'
for (( b = 0; b < ${#array[#]}; b++ )); do
echo "${array[b]}" | awk -F "/| " '{print $2}' >> column2.txt
done
Starting with the following text file:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
20901 rockyou
20553 12345678
16648 abc123
/*/*/*/*/*/*
20901 rockyou
20553 12345678
16648 abc123
Clear empty lines and globs in the script.
The result of the execution is the following:
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Clear evidence that the array is working as expected.
Execution example:
adama#galactica:~$ ./processing.sh test.txt
adama#galactica:~$ cat column2.txt
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Should we wish to remove empty lines (as it doesn't make sence to me have them in the output) we can do it in awk by changing the following line:
echo "${array[b]}" | awk -F "/| " '{print $2}' >> column2.txt
adding /./
echo "${array[b]}" | awk -F "/| " '/./ {print $2}' >> column2.txt
End Result:
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
rockyou
12345678
abc123
*
rockyou
12345678
abc123
Should you wish to apply it to the whole file (not column by column) you can take a look at the following thread:
AWK remove blank lines
Edit: Security concern on rm:
I actually went ahead and placed $(rm -rf ~) in the test file to test what would happen on a virtual machine:
Test.txt contents now:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
$(rm -rf ~)
20901 rockyou
20553 12345678
16648 abc123
/*/*/*/*/*/*
20901 rockyou
20553 12345678
16648 abc123
Execution:
adama#galactica:~$ ./processing.sh test.txt
adama#galactica:~$ ll
total 28
drwxr-xr-x 3 adama adama 4096 dic 1 22:41 ./
drwxr-xr-x 3 root root 4096 dic 1 19:27 ../
drwx------ 2 adama adama 4096 dic 1 22:38 .cache/
-rw-rw-r-- 1 adama adama 144 dic 1 22:41 column2.txt
-rwxr-xr-x 1 adama adama 182 dic 1 22:41 processing.sh*
-rw-r--r-- 1 adama adama 286 dic 1 22:39 test.txt
-rw------- 1 adama adama 1545 dic 1 22:39 .viminfo
adama#galactica:~$ cat column2.txt
123456
12345
123456789
password
iloveyou
princess
1234567
rockyou
12345678
abc123
-rf
rockyou
12345678
abc123
*
rockyou
12345678
abc123
No effect on the system.
Note: I am using Ubuntu 18.04 x64 LTS on an VM. Best not to try testing the security issue with root.
Edit: set -f necessity:
adama#galactica:~$ ./processing.sh a
adama#galactica:~$ cat column2.txt
[a]
adama#galactica:~$
Works perfectly without set -f
BR
I want to catch the row contain "Will_Liu>" from massive_data.txt if the n < m or n==0 or m==0, a period of the prototype is as below.
cat massive_data.txt
Will_Liu> set Name.* xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
. ... .... OK
. ... .... OK
m name-m .... Not_OK
============================================
Total: m name attempted, n name set OK
In the above code, the "m" and "n" are variable, if the n < m or n==0 or m==0, print the rows contain "Will_Liu>" ;
if n==m and both of them !=0, just skip and ignore this situation.
I just could use "grep" and "sed" to grasp key points like those:
cat test.txt
Will_Liu> set Name_group1 xxx
============================================
Id Name Para status
============================================
1 name-1 xxxxx OK
2 name-2 xxxxx OK
3 name-3 xxxxx Not_OK
============================================
Total: 3 name attempted, 2 name set OK
Will_Liu> set Name_group2 yyy
============================================
Id Name Para status
============================================
1 name-4 xxxxx OK
2 name-5 xxxxx Not_OK
3 name-6 xxxxx Not_OK
============================================
Total: 3 name attempted, 1 name set OK
I could use "sed" and "grep" command like this:
sed -n "/Total: 3 name attempted,/p" test.txt
Total: 3 name attempted, 2 name set OK
Total: 3 name attempted, 1 name set OK
grep -B 9 "Total: 3 name attempted" test.txt | sed -n '/Will_Liu>/p'
Will_Liu> set Name_group1 xxx
Will_Liu> set Name_group2 yyy
in the grep command the 9 is 3+6, the 6 is base on the format of the structure, it's a fixed value.
So how can I introduce 2 variates to define the "m" and "n" and improve my code to get expected result from massive_data.txt?
My expect output:
Will_Liu> set Name1 xxx
Will_Liu> set Name2 yyy
Will_Liu> set Name3 zzz
. . .
. . .
. . .
In general, any previous line you want to print matches another pattern. In these cases it is better to store the last candidate to be printed and when you reach your condition, decide what to do with it. For example
awk '/^Will_Liu/{
last_will=$0
}
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print last_will
}' file
In cases where you really don't have any pattern to select the last candidate to print, and you have to decide some line number to print after a math operation on matched line data, then you could double pass a file, or use tac to invert the input or keep all last lines in a hash array or any similar approach. These approaches could be not efficient sometimes. For example, with storing all lines, which is not recommended for your case
awk '{ line[NR]=$0 }
/^Total/{
m=$2; n=$5
if (m>n || (m==0 && n==0))
print line[NR-(m+5)]
}' file
I have a file.txt which has the following columns
id chr pos alleleA alleleB
1 01 1234 CT T
2 02 5678 G A
3 03 8901 T C
4 04 12345 C G
5 05 567890 T A
I am looking for a way of creating a new column so that it looks like : chr:pos:alleleA:alleleB
The problem is that alleleA and alleleB should be sorted based on:
1. alphabetical order
2. either of these two columns with more letter per line should be first and followed by the second column
In this example , it would look like this :
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
I appreciate any help and suggestion. Thanks.
EDIT
Up to now i can modify chr column so that it will have a look as "chr:1"...
AlleleA and AlleleB columns should be combined so that if either of columns contains more than 1 letter, in column newID it would be in the first place. If there is only one letter in both columns, these letters are arranged alphabetically in the newID column
gawk solution:
awk 'function custom_sort(i1,v1,i2,v2){ # custom function to compare 2 crucial fields
l1=length(v1); l2=length(v2); # getting length of both fields
if (l1 == l2) {
return (v1 > v2)? 1:-1 # compare characters if field lengths are equal
} else {
return l2 - l1 # otherwise - compare by length (descending)
}
} NR==1 { $0=$0 FS "newID" } # add new column
NR>1 { a[1]=$4; a[2]=$5; asort(a,b,"custom_sort"); # sort the last 2 columns using function `custom_sort`
$(NF+1) = sprintf("chr%s:%s:%s:%s",$1,$3,b[1],b[2])
}1' file.txt | column -t
The output:
id chr pos alleleA alleleB newID
1 01 1234 CT T chr1:1234:CT:T
2 02 5678 G A chr2:5678:A:G
3 03 8901 T C chr3:8901:C:T
4 04 12345 C G chr4:12345:C:G
5 05 567890 T A chr5:567890:A:T
Perl to the rescue:
perl -lane '
if (1 == $.) { print "$_ newID" }
else { print "$_ ", join ":", "chr" . ($F[1] =~ s/^0//r),
$F[2],
sort { length $b <=> length $a
or $a cmp $b
} #F[3,4];
}' -- input.txt
-l removes newlines from input and adds them to print
-n reads the input line by line
-a splits each input line on whitespace into the #F array
$. is the input line number, the condition just prints the header for the first line
s/^0// removes the initial zero from $F[1] (i.e. column 2)
/r returns the result of the substitution
the last two column lenghts are compared, if they are the same, string comparison is used.
I have 2 files:
File_1.txt:
John
Mary
Harry
Bill
File_2.txt:
My name is ID, and I am on line NR of file 1.
I want to create four files that look like this:
Output_file_1.txt:
My name is John, and I am on line 1 of file 1.
Output_file_2.txt:
My name is Mary, and I am on line 2 of file 1.
Output_file_3.txt:
My name is Harry, and I am on line 3 of file 1.
Output_file_4.txt:
My name is Bill, and I am on line 4 of file 1.
Normally I would use the following sed command to do this:
for q in John Mary Harry Bill
do
sed 's/ID/'${q}'/g' File_2.txt > Output_file.txt
done
But that would only replace the ID for the name, and not include the line nr of File_1.txt. Unfortunately, my bash skills don't go much further than that... Any tips or suggestions for a command that includes both file 1 and 2? I do need to include file 1, because actually the files are much larger than in this example, but I'm thinking I can figure the rest of the code out if I know how to do it with this hopefully simpler example... Many thanks in advance!
How about:
n=1
while read q
do
sed -e 's/ID/'${q}'/g' -e "s/NR/$n/" File_2.txt > Output_file_${n}.txt
((n++))
done < File_1.txt
See the Advanced Bash Scripting Guide on redirecting input to code blocks, and maybe the section on double parentheses for further reading.
How about awk, instead?
[ghoti#pc ~]$ cat file1
John
Mary
[ghoti#pc ~]$ cat file2
Harry
Bill
[ghoti#pc ~]$ cat merge.txt
My name is %s, and I am on the line %s of file '%s'.
[ghoti#pc ~]$ cat doit.awk
#!/usr/bin/awk -f
BEGIN {
while (getline line < "merge.txt") {
fmt = fmt line "\n";
}
}
{
file="Output_File_" NR ".txt";
printf(fmt, $1, FNR, FILENAME) > file;
}
[ghoti#pc ~]$ ./doit.awk file1 file2
[ghoti#pc ~]$ grep . Output_File*txt
Output_File_1.txt:My name is John, and I am on the line 1 of file 'file1'.
Output_File_2.txt:My name is Mary, and I am on the line 2 of file 'file1'.
Output_File_3.txt:My name is Harry, and I am on the line 1 of file 'file2'.
Output_File_4.txt:My name is Bill, and I am on the line 2 of file 'file2'.
[ghoti#pc ~]$
If you really want your filenames to be numbered, we can do that too.
What's going on here?
The awk script BEGINs by reading in your merge.txt file and appending it to the variable "fmt", line by line (separated by newlines). This makes fmt a printf-compatile format string.
Then, for every line in your input files (specified on the command line), an output file is selected (NR is the current record count spanning all files). The printf() function replaces each %s in the fmt variable with one of its options. Output is redirected to the appropriate file.
The grep just shows you all the files' contents with their filenames.
This might work for you:
sed '=' File_1.txt |
sed '1{x;s/^/'"$(<File_2.txt)"'/;x};N;s/\n/ /;G;s/^\(\S*\) \(\S*\)\n\(.*\)ID\(.*\)NR\(.*\)/echo "\3\2\4\1\5" >Output_file_\1.txt/' |
bash
TXR:
$ txr merge.txr
My name is John, and I am on the line 1 of file1.
My name is Mary, and I am on the line 2 of file1.
My name is Harry, and I am on the line 3 of file1.
My name is Bill, and I am on the line 4 of file1.
merge.txr:
#(bind count #(range 1))
#(load "file2.txt")
#(next "file1.txt")
#(collect)
#name
#(template name #(pop count) "file1")
#(end)
file2.txt:
#(define template (ID NR FILE))
#(output)
My name is #ID, and I am on the line #NR of #FILE.
#(end)
#(end)
Read the names into an array.
get the array length
iterate over the array
Test preparation:
echo "John
Mary
Harry
Bill
" > names
Names and numbers:
name=($(<names))
max=$(($(echo ${#name[*]})-1))
for i in $(seq 0 $max) ; do echo $i":"${name[i]}; done
with template:
for i in $(seq 0 $max) ; do echo "My name is ID, and I am on the line NR of file 1." | sed "s/ID/${name[i]}/g;s/NR/$((i+1))/g"; done
My name is John, and I am on the line 1 of file 1.
My name is Mary, and I am on the line 2 of file 1.
My name is Harry, and I am on the line 3 of file 1.
My name is Bill, and I am on the line 4 of file 1.
A little modification needed in your script.Thats it.
pearl.306> cat temp.sh
#!/bin/ksh
count=1
cat file1|while read line
do
sed -e "s/ID/${line}/g" -e "s/NR/${count}/g" File_2.txt > Output_file_${count}.txt
count=$(($count+1))
done
pearl.307>
pearl.303> temp.sh
pearl.304> ls -l Out*
-rw-rw-r-- 1 nobody nobody 59 Mar 29 18:54 Output_file_1.txt
-rw-rw-r-- 1 nobody nobody 58 Mar 29 18:54 Output_file_2.txt
-rw-rw-r-- 1 nobody nobody 58 Mar 29 18:54 Output_file_3.txt
-rw-rw-r-- 1 nobody nobody 58 Mar 29 18:54 Output_file_4.txt
-rw-rw-r-- 1 nobody nobody 58 Mar 29 18:54 Output_file_5.txt
pearl.305> cat Out*
My name is linenumber11, and I am on the line 1 of file 1.
My name is linenumber2, and I am on the line 2 of file 1.
My name is linenumber1, and I am on the line 3 of file 1.
My name is linenumber4, and I am on the line 4 of file 1.
My name is linenumber6, and I am on the line 5 of file 1.
pearl306>