How to manipulate text to one side of a delimiter while preserving text on the opposite side - bash

I am trying to translate some documents in which every line is of the form:
name1:text to be translated
name2:text to be translated
I am using translate-shell to perform the translations. trans -b :es -input ~/path/to/file
The desired output would be:
name1:texto a traducir
name2:texto a traducir
But instead I am getting this output:
nombre1:texto a traducir
nombre2:texto a traducir
If I had to guess I would guess the answer probably lies in separating the fields with awk but I'm having difficulty understanding the man pages well enough to figure out how to do it properly. Right now I'm doing this
awk -F: '/:/ { print $1 ": " $2 }' ~/path/to/file
to separate the fields and then attempting to work with each field separately. But I am confused about the pattern-action statement awk. Can I run another command within the awk environment? So far all my attempts to do so have resulted in syntax errors.

Here is a recipe involving cut and paste:
cut the names and texts into two separated files:
cut -d: -f1 yourfile > names.txt
cut -d: -f2- yourfile > text.txt
translate text.txt using whatever workflow you are using at the moment
combine the old names.txt with the translated text:
paste -d: names.txt yourtranslated_text

I think #LarsFischer has the best answer so far but just in case you have some reason to need to use awk and you can pass individual strings to "trans" and the text to be translated cannot contain newlines, this is how you'd do it:
awk '
{
name = text = $0
sub(/:.*/,"",name)
sub(/[^:]+:/,"",text)
cmd = "trans args \"" text "\""
if ( (cmd | getline rslt) > 0 ) {
print name ":" rslt
}
close(cmd)
}
' file

Well, I can't get the translate-shell to work but maybe something like this:
awk -v dq='"' -F: '{printf "%s ", $1; gsub(/^.*:/,""); system("trans -b :es "dq""$0""dq)}' test.in

another alternative is paste the original and translated files together and cut the needed fields, that is
paste -d: original translation | cut -d: -f1,4

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

awk match by variable with dot in it

I have a script that will iterate over a file containing domains (google.com, youtube.com, etc). The purpose of the script is to check how many times each domain is included in the 12th column of a tab seperated value file.
while read domain; do
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
done < domains.txt
However awk seems to be interpretating the dots in the domains as a special character. The following error message is shown:
awk: syntax error at source line 1
context is
$12 ~ >>> google. <<< com
awk: bailing out at source line 1
I am a beginner in bash so any help would be greatly appreciated!
When you write:
domain='google.com'
awk -F '\t' '$12 == '$domain'' data.txt
the $domain is outside of any quotes:
awk -F '\t' '$12 == '$domain' ' data.txt
< > < >
start end start end
and so exposed to the shell for interpretation first and THEN it becomes part of the body of the awk script before awk sees it. So what awk sees is:
awk -F '\t' '$12 == google.com' data.txt
and google.com is not a valid symbol (e.g. variable or function) name nor string nor number. What you MEANT to do was:
awk -F '\t' '$12 == "'"$domain"'"' data.txt
so the shell would see "$domain" instead of just $domain (see https://mywiki.wooledge.org/Quotes for why that's important) and awk would finally see:
awk -F '\t' '$12 == "google.com"' data.txt
which is fine as now "google.com" is a string, not a symbol BUT you should never allow shell variables to expand to become part of an awk script as there are other caveats so what you should really have done is:
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt
See How do I use shell variables in an awk script? for more information.
By the way, even after fixing the above problem do not do this:
while read domain; do
awk -F '\t' -v dom="$domain" '$12 == dom' data.txt | wc -l
done < domains.txt
as it'll be immensely slow and contains insidious bugs (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice). Do something like this instead (untested):
awk -F'\t' '
NR==FNR {
cnt[$1] = 0
next
}
$12 in cnt {
cnt[$12]++
}
END {
for ( dom in cnt ) {
print dom, cnt[dom]
}
}
' domains.txt data.txt
That will be far more efficient, robust, and portable than calling awk inside a shell read loop.
See What are NR and FNR and what does "NR==FNR" imply? for how that awk script works. Get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn awk.
awk -F '\t' '$12 == '$domain'' data.txt | wc -l
The single quotes are building an awk program. They are not something visible to awk. So awk sees this:
$12 == google.com
Since there aren't any quotes around google.com, that is a syntax error. You just need to add quotation marks.
awk -F '\t' '$12 == "'"$domain"'"' data.txt
The quotes jammed together like that are a little confusing, but it's just this:
'....' stuff to send to awk. Single quotes are for the shell.
'..."...' a double quote inside the awk program for awk to see
'...'"..." stuff in double quotes _outside_ the awk program for the shell
We can combine those like this:
'..."'"$var"'"...'
That's a bunch of literal awk code ending in a double-quote, followed by the expansion of the shell parameter var, which is double-quoted as usual in the shell for safety, followed by more literal awk code starting with a double quotes. So the end result is a string passed to awk that includes the value of the var inside double quotes.
But you don't have to be so fancy or confusing since awk provides the -v option to set variables from the shell:
awk -v domain="$domain" '$12 == domain' data.txt
Since the domain is not quoted inside the awk code, it is interpreted as the name of a variable. (Periods are not legal in variable names, which is why you got a syntax error with your domains; if you hadn't, though, awk would have treated them as empty and been looking for lines whose twelfth field was likewise blank.)
Use a combination of cut to print the 12th column of the TAB-delimited file, sort and uniq to count the items:
cut -f12 data.txt | sort | uniq -c
This should give the count of how many lines of the input has "google.com" in $12
{m,g}awk -v __="${domain}" '
BEGIN { _*=\
( _ ="\t[^\t]*")*gsub(".",(_)_,_)*sub(".","",_)*\
gsub("[.:&=/-]","[&]",__)*sub("[[][^[]+$",__"\t?",_)*(\
FS=_ } { _+=NF } END { print _-NR }'

Comparing 2 files with a for loop in bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!
I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.
Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

bash shell how to cut the first column out of a file

so I have a file named 'file' that contains these characters
a 1 z
b 2 y
c 3 x
how can I cut the first column and put it in it's own file?
I know how to do the rest using the space as a delimiter like this:
cut -f1 -d ' ' file > filecolumn1
but I'm not sure how to cut just the first column since there isn't any character in the front that I can use as a delimiter.
The delimiter doesn't have to be before the column, it's between the columns. So use the same delimiter, and specify field 1.
cut -f1 -d ' ' file > filecolumn1
Barmar's got a good option. Another option is awk:
awk '{print $1}' file > output.txt
If you have delimiter, you could use -F switch and provide a delimiter. For example, if your data was like this:
a,1,2
b,2,3
c,3,4
you can use awk's -F switch in this manner:
awk -F',' '{print $1}' file > output.txt

What is the optimal way to extract values between braces in bash/awk?

I have the output in this format:
Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)
I want to extract the last two numbers in the last pair of braces. Some times there is only a single number in the last pair of braces.
This is the code I used.
echo "Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)" | \
tr "," " " | tr "(" " " | tr ")" " " | awk -F: '{print $4}'
Is a more clean way to extract the values? or a more optimal way?
Try this:
awk -F '[()]' '{print $(NF-1)}' input | tr -d ,
It's kind of refactoring of your command.
awk -F\( '{gsub("[,)]", " ", $NF); print $NF}' input
will give
33389 94934
I am a bit unclear about the meaning of "optimal"/"professional" in this problem's context, but this only uses one command/tool, not sure if that qualifies.
Or building on #kev's approach (but not needing tr to eliminate the comma):
awk -F'[(,)]' '{print $4, $5}' input
outputs:
33389 94934
This can also be done in pure bash. Assuming the text always looks like the sample in the question, the following should work:
$ text="Infosome - infotwo: (29333) - data-info-ids: (33389, 94934)"
$ result="${text/*(}"
$ echo ${result//[,)]}
33389 94934
This uses shell "parameter expansion" (which you can search for in bash's man page) to strip the string in much the same way you did using tr. Strictly speaking, the quotes in the second line are not necessary, but they help with StackOverflow syntax highlighting. :-)
You could alternately make this a little more flexible by looking for the actual field you're interested in. If you're using GNU awk, you can specify RS with multiple characters:
$ gawk -vRS=" - " -vFS=": *" '
{ f[$1]=$2; }
END {
print f["data-info-ids"];
# Or you could strip the non-numeric characters to get just numbers.
#print gensub(/[^0-9 ]/,"","g",f["data-info-ids"]);
}' <<<"$text"
I prefer this way, because it actually interprets the input data for what it is -- structured text representing some sort of array.

Resources