get Nth line in file after parsing another file - bash

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,

while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt

while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere

awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.

Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'

This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh

sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

Counting palindromes in a text file

Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.
#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
search "$1"
paste <(search <"$1") <(search < "$1" | rev) \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c
All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as
425 did
120 non
etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.
Using awk and sed
awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ # 2.60GHz/4 cores/16GB)
Explanation :
awk '
function palindrome(str) # Function to check Palindrome
{
len=length(str);
for(k=1; k<=len/2+len%2; k++)
{
if(substr(str,k,1)!=substr(str,len+1-k,1))
return 0
}
return 1
}
{
for(i=1; i<=NF; i++) # For Each field in a record
{
if(length($i)>=3) # if length>=3
{
gsub(/[^a-zA-Z]/,"",$i); # remove non-alpha character from it
if(length($i)>=3) # Check length again after removal
{
$i=tolower($i); # Covert to lowercase
if(palindrome($i)) # Check if it's palindrome
arr[$i]++ # and store it in array
}
}
}
}
END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.
Old Answer (Before EDIT)
You can try below steps if you want to :
Step 1 : Pre-processing
Remove all unnecessary chars and store the result in temp file
tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space
tr ' ' '\n' This will convert space to \n to separate each word in newlines
Step-2: Processing
grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
grep -wof temp <(rev temp) This will give you all palindromes
-w : Select only those lines containing matches that form whole words.
For example : level won't match with levelAAA
-o : Print only the matched group
-f : To use each string in temp file as pattern to search in <(rev temp)
sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB
awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result
Example :
Input File :
$ cat file
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
Output:
$ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
$ grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
3 dad
3 kayak
3 bob
Just a quick Perl alternative:
perl -0nE 'for( /(\w{3,})/g ){ $a{$_}++ if $_ eq reverse($_)}
END {say "$_ $a{$_}" for keys %a}'
in Perl, $_ should be read as "it".
for( /(\w{3,})/g ) ... for all relevant words (may need some work to reject false positives like "12a21")
if $_ eq reverse($_) ... if it is palindrome
END {say "$_ $a{$_}" for...} ... tell us all the its and its number
\thanks{sokowi,batMan}
Running the Script
The script expects that the file is given as an argument. The script does not read stdin.
Remove the line search "$1" in the middle of the script. It is not part of the linked answer.
Make the script executable using chmod u+x path/to/palindrome.
Call the script using path/to/palindrome path/to/sample.txt. If all the files are in the current working directory, then the command is
./palindrome sample.txt
Alternative Script
Sometimes the linked script works and sometimes it doesn't. I haven't found out why. However, I wrote an alternative script which does the same and is also a bit cleaner:
#! /bin/bash
grep -Po '\w{3,}' "$1" | grep -Evw '(.)\1*' | sort > tmp-words
grep -Fwf <(rev tmp-words) tmp-words | uniq -c
rm tmp-words
Save the script, make it executable, and call it with a file as its first argument.

How can I use bash to split only some elements of a text file?

I'm trying to figure out how to make a .txt file (myGeneFile.txt) of IDs and genes that looks like this:
Probe Set ID Gene Symbol
1007_s_at DDR1 /// MIR4640
1053_at RFC2
117_at HSPA6
121_at PAX8
1255_g_at GUCA1A
1294_at MIR5193 /// UBA7
into this:
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA
First I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{split($2,a,"///"); print a[1] "\t" a[2] "\t" a[3] "\t" a[4] "\t" a[5];}' > test.txt
(i.e., I removed the top (header) line of the file, I tried splitting the second line along the delimiter ///, and then printing any genes that might appear)
Then, I tried doing this:
cat myGeneFile.txt | tail -n +2 | awk '{print $2}' | grep -o -E '\w+' > test.txt
(literally listing out all of the words in the second column)
I got the same output in both cases - a long list of just the first gene in each row (e.g. so MIR4640 and UBA7 were mising)
Any ideas?
EDIT: Thanks #CodeGnome for your help. I ended up using that code and modifying it because I discovered that my file had between 1 and 30 different gene names on each row. So, I used:
awk 'NR == 1 {next}
{
sub("///", "")
print $2 }
{ for (i=3; i<=30; i++)
if ($i) {print $i}
}' myGeneFile.txt > test2.txt
#GlenJackson also had a solution that worked really well:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
My awk take:
awk 'NR>1 {for (i=2; i<=NF; i++) if ($i != "///") print $i}' file
or sed
sed '
1d # delete the header
s/[[:blank:]]\+/ /g # squeeze whitespace
s/^[^ ]\+ // # remove the 1st word
s| ///||g # delete all "///" words
s/ /\n/g # replace spaces with newlines
' file
Use Conditional Print Statements Inside an AWK Action
The following gives the desired output by removing unwanted characters with sub(), and then using multiple print statements to create the line breaks. The second print statement is conditional, and only triggers when the third field isn't empty; this avoids creating extraneous empty lines in the output.
$ awk 'NR == 1 {next}
{
sub("///", "")
print $2
if ($3) {print $3}
}' myGeneFile.txt
DDR1
MIR4640
RFC2
HSPA6
PAX8
GUCA1A
MIR5193
UBA7
This will work:
tail -n+2 tmp | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_'
Here's what is happening:
tail -n+2 Strip off the header
sed -E 's/ +/ /' Condense the whitespace
cut -d' ' -f2- Use cut to select all fields but the first, using a single space as the delimiter
sed 's_ */// *_\n_' Convert all /// (and any surrounding whitespace) to a newline
You don't need the initial cat, it's usually better to simply pass the input file as an argument to the first command. If you want the file name in a place that is easy to change, this is a better option as it avoids the additional process (and I find it easier to change the file if it's at the end):
(tail -n+2 | sed -E 's/ +/ /' | cut -d' ' -f2- | sed 's_ */// *_\n_') < tmp
Given the existing input and the modified requirement (from the comment on Morgen's answer) the following should do what you want (for any number of gene columns).
awk 'NR > 1 {
p=0
for (i = 2; i <= NF; i++) {
if ($i == "///") {
p=1
continue
}
printf "%s%s\n", p?"n":"", $i
}
}' input.txt
Your criteria for selecting which strings to output is not entirely clear, but here's another command that at least produces your expected output:
tail -n +2 myGeneFile.txt | grep -oE '\<[A-Z][A-Z0-9]*\>'
It basically just 1) skips the first line and 2) finds all other words (delimited by non-word characters and/or start/end of line) that consist entirely of uppercase letters or digits, with the first being a letter.

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

How to replace all but last matching in a file using bash?

Assuming using bash, having a configuration file like:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence <-- Replace
param-c=cccccc
# param-foo=first commented foo <-- Commented: don't replace
param-d=dddddd
param-e=eeeeee
param-foo=second occurence <-- Rreplace
param-foo=third occurence <-- Last active: don't replace
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Commented: don't replace
param-x=xxxxxx2
In which you can find multiple commented or uncommented lines of the param-foo,
how can you comment all the uncommented param-foos except the very last active one,
resulting in:
param-a=aaaaaa
param-b=bbbbbb
# param-foo=first occurence <-- Replaced
param-c=cccccc
# param-foo=commented foo <-- Left
param-d=dddddd
param-e=eeeeee
# param-foo=second occurence <-- Replaced
param-foo=third occurence <-- Left
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Left
param-x=xxxxxx2
Two parts of the question:
1. How to do it with only one known repeating param?
(only param-foo in the example above)
2. How to do it with all multiple active params at once?
(param-foo + param-x in the example above)
Attention: In this case I don't know previously the name of the repeating params!
Thanks
If awk is acceptable, this will do it for param-foo and param-x:
awk -F= -v p='param-foo param-x' 'BEGIN {
ARGV[ARGC++] = ARGV[ARGC - 1]
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile
You may use a single parameter: p=param-x or add more parameters separated by spaces: p='param-1 param-2 ... param-n'.
Edit: I'm assuming the real input file looks like this:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence
param-c=cccccc
# param-foo=commented foo
param-d=dddddd
param-e=eeeeee
param-foo=second occurence
param-foo=third occurence
param-x=xxxxxx1
param-f=ffffff
param-x=xxxxxx2
Let me know if it's different.
Second edit: providing a solution for mawk users:
awk -F= -v p='param-foo param-x' 'BEGIN {
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile infile
Adding solution for the latest requirement:
awk -F= 'NR == FNR {
if (NF && !/^#/)
_p[$1]++ && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
FNR != nr[$1] && $0 = "# " $0
}1' infile infile
I have not tested fully the script, but it worked on the first example:
#!/bin/bash
input_file=/path/to/your/input/file
last_occurence=`nl $input_file | grep 'param-foo' | grep -v '#' | tail -1 | awk -F" " '{print $1}'`
sed -i '/#/!s/param-foo/# param-foo/g' $input_file
sed -i "${last_occurence}s/# param-foo/param-foo/" $input_file
It's very straight forward logic. First we get the last occurrence of param-foo, which is not commented.
The first sed goes and comments all param-foo, which are not commented.
The second sed uses the line_number of last occurence of param-foo and removes the # character. You can easily wrap that in a function and use it inside a loop, providing a list of parameters, instead of only one.
A bit slow for long files, but should work for all the parameters:
grep -v ^# $file |
cut -f1 -d= |
sort -u |
sed 's/^/grep -n . '$file' |
tac |
grep -m1 :/;s/$/= /' |
bash |
sed -r 's%([0-9]+):(.*)=(.*)%\1!s/^\2=/# \2=/%' |
sed -f- $file
This might work:
param="param-foo"
tac input_file |sed '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}'|tac >output_file
For multiple params:
cp input_file{,.backup}
params=(param-{foo,bar,baz})
tac input_file >backwards_file
for param in "${params[#]}"; do
sed -i '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}' backwards_file
done
tac backwards_file >output_file
Turn input_file backwards, preprend all but the first occurrence of $param with a comment #,then revert the file.
EDIT:
To extract the params from the file use this piece of code:
params=($(sed -rn '/^#/d;/^$/!s/^\s*([^=]*).*/\1/gp' input_file | sort | uniq))

Resources