How to append new line \n in a loop - bash

I want to add a new line character '\n' at the end of each line in my file.
Here is my code:
while read -r line
do
echo "$line"|awk -F'\t' '{
print($1);
for (i=2; i<=NF; i++){
split($i,arr,":")
print(arr[1])
};
}' | tr '\n' '\t' | tr '|' ' ' | tr '/' ' ' | {END OF LINE, WANNA ADD NEW LINE}
>> genotype_processed.txt
done < file_in
Also, is there any way that I can combine the 3 tr commands into one? They just look too redundant.
Many thanks!
EDIT:
The input looks like this:
id123 0|1:a:b:c 0/0:i:j:k ...
id456 1/1:j:f:z 1|0:.:j:v ...
...
The desired output:
id123 0 1 0 0 ...
id456 1 1 1 0 ...
...

You could open the output file only once, at the end of the while loop, and then do whatever you want to output within the loop.
while read -r line
do
echo "$line"|awk -F'\t' '{
print($1);
for (i=2; i<=NF; i++){
split($i,arr,":")
print(arr[1])
};
}' | tr '\n' '\t' | tr '|' ' ' | tr '/' ' '
# This will be the new "newline"
echo
done < file_in >genotype_processed.txt
And by the way, your loop could be improved, I think, by using a single command to do the replacements. Probably a sed could be a good choice.
Provide us more example of input and expected output
EDIT
After your input/output description, I think you could improve this part a lot.
You do while read line; do echo "$line" | awk '...'; done <input which is basically what you would get by doint a single awk '...' input.
I don't get exactly what you want to achieve, I think you misunderstand some things, but if what I think is right, then this is what you want.
sed -r 's/:[^[:blank:]]+//g; s/[|/]/ /g' input
Here, I first remove what follows the first : for in each column, and then I replace the characters | or / with a space.
Does that meets your needs ?

How about:
while read -r line
do
newline="\n"
echo $line$newline >> genotype_processed.txt
done < file_in
Or use "$line"$newline if needing to retain original formatting of $line according to your requirement.

Maybe you're overcomplicating this. (XY Problem?)
$: sed 's,[|/], ,g; s/:[^ ]*//g;' file_in > genotype_processed.txt
I sub'd all | and / with spaces, and any : followed by any number of non-tabs with nothing.
I used a single truncating output redirection since I did it all in one step. If there was stuff in the file you wanted to keep, then go back to the append.

Related

Extracting unique columns from a file into a comma separated list with a particular order

I have a .csv file with these values
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
and I want to make a file with comma separated rows in sorted order except "negative" always is at the end.
So I want
["brand","positive","product","negative"]
I was not able to automate this process so what I did was
awk -F ',' '{print $1}' file.csv | sort | uniq -c > file2.txt
awk '{if(NR>1) printf ", ";printf("\"%s\"",$0)} END {print ""}' file2.txt > file3.txt
I get "brand","negative","positive","product"
Then I manually move "negative" to the end and also append [ and ] to front and back to get
["brand","positive","product","negative"]
Is there a way to make it more efficient and automate the process?
another solution with easy to understand steps
$ awk -F, '{print ($1=="negative"?1:0) "\t\"" $1 "\""}' file | # mark negatives
sort | cut -f2 | uniq | # sort, cut, uniq
paste -sd, | sed 's/^/[/;s/$/]/' # serialize, add brackets
["brand","positive","product","negative"]
Here is a single gnu awk command to make it work:
awk -F, '{
a[$1] = ($1 == "negative" ? "~" : "") $1
}
END {
n = asort(a)
printf "["
for (i = 1; i <= n; i++) {
sub(/^~/, "", a[i])
printf "\"%s\"%s", a[i], (i < n ? ", " : "]\n")
}
}' file.csv
["brand", "positive", "product", "negative"]
There are lots of ways to approach this. Do you really want the result as what looks like a JSON array, with square brackets and quotation marks around the column names? If so, then jq is probably a good tool to use to generate it. Something like this will do it all as a single jq program:
jq -csR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)' file.csv
Which outputs this:
["brand","positive","product","negative"]
If you just want the headings separated by commas in a line without the other punctuation, suitable for heading up a CSV file, you can use more traditional text-manipulation commands:
cut -d, -f1 file.csv |
sed 's/negative/zzz&/' |
sort -u |
sed 's/zzz//' |
paste -d, -s -
Or you can slightly modify the jq command by adding the -r flag and another pipe at the end:
jq -csrR '[split("\n")|
map(select(length>0))[]|
split(",")[0]]|
sort_by(if .=="negative" then "zzzz" else . end)|
join(",")' file.csv
Either of which outputs this:
brand,positive,product,negative
Using Perl one-liner
$ cat unique.txt
product,0 0,no way
brand,0 0 0,detergent
product,0 0 1,sugar
negative,0 0 1, sight
positive, 0 0 1, salt
$ perl -F, -lane ' { $x=$F[0];$x=~s/^(negative)/z\1/g;$rating{$x}++ } END {$q="\x22";$y=join("$q,$q",sort keys %rating) ; $y=~s/${q}z/$q/g; print "[$q$y$q]" }' unique.txt
["brand","positive","product","negative"]
$
This worked for me:
cut -d, -f1 file.csv | sort -u | sed "/^negative/d" | tr '\n' ',' | sed -e 's/^/["/' -e 's/,/","/g' -e 's/$/negative"]/'

Including empty lines using pattern

My problem is the following: I have a text file where there are no empty lines, now I would like to include the lines according to the pattern file where 1 means print the line without including a new line, 0 - include a new line. My text file is :
apple
banana
orange
milk
bread
Thу pattern file is :
1
1
0
1
0
1
1
The desire output correspondingly:
apple
banana
orange
milk
bread
What I tried is:
for i in $(cat pattern file);
do
awk -v var=$i '{if var==1 {print $0} else {printf "\n" }}' file;
done.
But it prints all the lines first, and only after that it changes $i
Thanks for any prompts.
Read the pattern file into an array, then use that array when processing the text file.
awk 'NR==FNR { newlines[NR] = $0; next}
{ print $0 (newlines[FNR] ? "" : "\n") }' patternfile textfile
allow multiple 0 between 1
Self documented code
awk '# for file 1 only
NR==FNR {
#load an array with 0 and 1 (reversed due to default value of an non existing element = 0)
n[NR]=!$1
# cycle to next line (don't go furthier in the script for this line)
next
}
# at each line (of file 2 due to next of last bloc)
{
# loop while (next due to a++) element of array = 1
for(a++;n[a]==1;a++){
# print an empty line
printf( "\n")
}
# print the original line
print
}' pattern YourFile
need of inversion of value to avoid infinite new line on last line in case there is less info in pattern than line in data file
multiple 0 need a loop + test
unsynchro between file number of pattern and data file is a problem using a direct array (unless it keep how much newline to insert, another way to doing it)
This is a bit of a hack, but I present it as an alternative to your traditionally awk-y solutions:
paste -d, file.txt <(cat pattern | tr '\n' ' ' | sed 's,1 0,10,g' | tr ' ' '\n' | tr -d '1') | tr '0' '\n' | tr -d ','
The output looks like this:
apple
banana
orange
milk
bread
Inverse of Barmar's, read the text into an array and then print as you process the pattern:
$ awk 'NR==FNR {fruit[NR]=$0; next} {print $0?fruit[++i]:""}' fruit.txt pattern.txt
apple
banana
orange
milk
For an answer using only bash:
i=0; mapfile file < file
for p in $(<pattern); do
((p)) && printf "%s" "${file[i++]}" || echo
done

Shell command for inserting a newline every nth element of a huge line of comma separated strings

I have a one line csv containing a lot of elements. Now I want to insert a newline after every n-th element in a bash/shell script.
Bonus: I'd like to prepend a line with descriptors and using the count of descriptors as 'n'.
Example:
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221","94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713", (...)
into
"id","lon","lat"
"4908041eee3d4bf98e606140b21ebc89.16","7.38974601030349731","45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16","7.38845318555831909","45.31425320325949713"
(...)
Edit: I made a first attempt, but the comma delimiters are missing then:
(...) | xargs --delimiter=',' -n3
"4908041eee3d4bf98e606140b21ebc89.16" "7.38974601030349731" "45.31298584267982221"
"94ff11ce7eb54642b0768dde313e8b25.16" "7.38845318555831909" "45.31425320325949713"
trying to replace the " " with ","
(...) | xargs --delimiter=',' -n3 -i echo ${{}//" "/","}
-bash: ${{}//\": bad substitution
I would go with Perl for that!
Let's assume this outputs something like your file:
printf "1,2,3,4,5,6,7,8,9,10"
1,2,3,4,5,6,7,8,9,10
Then you could use this if you wanted every 4th comma replaced:
printf "1,2,3,4,5,6,7,8,9,10" | perl -pe 's{,}{++$n % 4 ? $& : "\n"}ge'
1,2,3,4
5,6,7,8
9,10
cat data.txt | xargs -n 3 -d, | sed 's/ /,/g'
With n=3 here and input filename is called data.txt
Note: What distinguishes this solution is that it derives the number of output columns from the number of columns in the header line.
Assuming that the fields in your CSV input have no embedded , instances (in which case you'd need a proper CSV parser), try awk:
awk -v RS=, -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
Note that if the input file ends with a newline (as is typical), you'll get an extra newline trailing the output.
With GNU Awk or Mawk (but not BSD/OSX Awk, which only supports literal, single-character RS values), you can fix this as follows:
awk -v RS='[,\n]' -v header='"id","lon","lat"' '
BEGIN {
print header
colCount = 1 + gsub(",", ",", header)
}
{
ORS = NR % colCount == 0 ? "\n" : ","
print
}
' file.csv
BSD/OSX Awk workaround: stick with -v RS=, and replace file.csv with <(tr -d '\n' < file.csv) in order to remove all newlines from the input first.
Assuming your input file is named input:
echo id,lon,lat; awk '{ORS=NR%3?",":"\n"}1' RS=, input

'Read' command stripping '\n' string

I want to extract data from a file which looks like this :
BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh
The idea is to show on the screen (without creating a temporary file, nor modifying the original file) what follows :
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
To sum up :
Strip the characters before ':'
Put the filenames before their corresponding directory
Sort the filenames by alphabetical order
Do a carriage return between each filename and its corresponding directory
I succeed doing all this, but there is still an ugly thing in my code concerning point #4 :
cut -f 2 -d ':' $big_file | \
sort -u | \
while read file ; do
echo "$(basename "$file")zipzapzupzop$(dirname "$file")" # <-- ugly thing #1
done | \
sort -dfb | \
while read line ; do
echo $line
done | \
sed 's/zipzapzupzop/\n/' # <-- ugly thing #2
At the beginning, I had written :
echo "$(basename "$file")\n$(dirname "$file")"
in place of ugly thing#1, in order to be able to do
echo -e "$line"
in the second while boucle. However, the read command strips each time the '\n' string, so that I obtain
alt-notify-send.shn/home/michael/Scripts
bk.bakn/home/michael/Scripts
bk.shn/home/michael/Scripts
demande_password.shn/home/michael/Scripts
gbk.shn/home/michael/Scripts
usb_backup.shn/home/michael/Scripts
yad_0.17.1.1-1_i386.debn/home/michael/Scripts
I tried to protect the '\' character by another '\', but the result is the same.
man read
is of no help either. So, is it a proper way to do this ?
read is a shell builtin, and man read may be giving you the docs for the (mostly unrelated) syscall.
read -r will prevent read from processing \ sequences.
The whole thing could have been done with a single awk script though:
awk '
{
start = index($0, ":") + 1
end = match($0, "[^/]*$")
out[NR] = substr($0, end) "\n" substr($0, start, end - start - 1)
}
END {
asort(out)
for (i = 1; i <= NR; i++)
print out[i]
}'
If you don't need to handle spaces in filenames, you can do this:
cat $bigfile | sed 's/.*://' | while read file; do
echo "$(basename $file) $(dirname $file)"
done | sort | awk '{print $1"\n"$2}'
You can do it with the following pipeline (should be on one line, I've split it and added comments for readability):
| sed -e 's/^[^:]*://' # Remove from start of line to first ':'
-e 's?/\([^/]*$\)? \1?' # Replace final '/' with a space
| sort -k2 # Sort on column 2 (filename)
| awk '{print $2"\n"$1}' # Reverse fields
See the following transcript:
echo 'BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh'
| sed -e 's/^[^:]*://'
-e 's?/\([^/]*$\)? \1?'
| sort -k2
| awk '{print $2"\n"$1}'
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
Just keep in mind that sort may not work as expected with lines containing spaces.
Assuming you do not have hash tags in your filenames you could use this coreutils pipeline:
cut -d: -f2- infile \
| sed -r 's,(.*)/([^/]*)$,\2#\1,' \
| sort -t'#' \
| tr '#' '\n'
cut removes the first part.
sed splits the path, swaps filename and directory and delimits them with a #.
sort hash tag delimited text.
tr finally replaces the hash tag with a newline.
If you know the number of path elements, you can use the simpler version:
cut -d: -f2- infile \
| sort -t/ -k4,4 \
| sed 's,(.*)/([^/]*)$,\2\n\1,'

finding pattern in a file

I have a txt file of 500 rows and one column.
The column in each row appears some what like this (as an example I am pasting two rows):
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB,chr22:49368010-49368760_NM_152247_CPT1B,chr22:49368010-49368760_NM_152253_CHKB
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB
Want I want to extract from each row is the values starting from NM_ or NR_
like
row 1 has NR_021492 NM_005198 NM_152247 NM_152253
row 2 has NR_021492 NM_005198
...
in tab delimited file
any suggestions for a bash command line?
Try:
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g'
Presuming GNU sed.
So
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g' your_file > tab_delimited_file
EDIT: Updated to not leave a trailing tab character on each row.
EDIT 2: Updated again to work for any chr-then-number sequence.
grep "NM" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NM_/'
grep "NR" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NR_/'
cat file|sed s/$.*!(NR)//;
Use a regular expression to remove everything before the NR
awk -F '[,:_-]' '{
for (i=1; i<NF; i++)
if ($i == "NR" || $i == "NM")
printf("%s_%s ", $i, $(i+1))
print ""
}'
This will also work, but will print each match on its own line: egrep -o 'N[RM]_[0-9]+

Resources