Extract occurrences of character in file line by line - bash

I have a large file of a bilingual lexicon with lines formatted as:
abatement: disminucion; mitigacion; moderacion; rebaja; deduccion; supresion; anulacion
I'm trying to find out which line has the most translated words, and so am looking to find the line with the most occurrences of ;, and then echo the English word.
I've managed to get something close but it uses sed to trim off data, meaning I can't match the English word back to the line.
Any ideas?

awk -F'[:;]' '{if(NF>n){n=NF;w=$1}}END{print w}' filename

Treating ; as a field separator, the line with the ; will have the most fields.
while IFS=';' read -a fields; do
n=${#fields[#]}
if (( n > max )); then
max=$n
english=${fields[0]%:}
fi
done < file.txt
echo "$english"

Related

Convert multi-line csv to single line using Linux tools

I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.
I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.
The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.
I tried something like this and it kind of works on one row but fails on multiple rows.
sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv
CSV example
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
The output I want is
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
First my apologies for getting here 7 months late...
I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:
First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.
To check for even or odd you can use the mod operand (%), in general:
even % 2 = 0
odd % 2 = 1
For the first line:
Odd means that the line expects more information on the next line.
Even means the line is complete.
For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.
When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:
Odd means that record status toggles (incomplete to complete).
Even means that record status remains as the previous line.
What I did was looping line by line while carrying the status of the last line to the next one:
incomplete=0
cat file.csv | while read line; do
quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
incomplete=$((($quotes+$incomplete)%2)) # Check if Odd or Even to decide status
if [ $incomplete -eq 1 ]; then
echo -n "$line " >> new.csv # If line is incomplete join with next
else
echo "$line" >> new.csv # If line completes the record finish
fi
done
Once this was executed, a file in your format generates a new.csv like this:
First name,Last name,Address,ZIP
John,Doe,"Country City Street",12345
I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...
Recovering the multi-line fields
Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv
the code is ~newline~, this is totally arbitrary of course.
Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:
sed -i 's/ ~newline~ /\n/g' new.csv
References:
Ternary operator: https://stackoverflow.com/a/3953666/6316852
Count char occurrences: https://stackoverflow.com/a/41119233/6316852
Other peculiar cases: https://www.linuxquestions.org/questions/programming-9/complex-bash-string-substitution-of-csv-file-with-multiline-data-937179/
TL;DR
Run this:
i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv
... and collect results in new.csv
I hope it helps!
If Perl is your option, please try the following:
perl -e '
while (<>) {
$str .= $_;
}
while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
if (($el = $&) =~ /^".*"$/s) {
$el =~ s/^"//s; $el =~ s/"$//s;
$el =~ s/""/"/g;
$el =~ s/\s+(?!$)/ /g;
}
push(#ary, $el);
}
foreach (#ary) {
print /\n$/ ? "$_" : "$_,";
}' sample.csv
sample.csv:
First name,Last name,Address,ZIP
John,Doe,"Country
City
Street",12345
John,Doe,"Country
City
Street",67890
Result:
First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
This might work for you (GNU sed):
sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file
Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.
N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.
Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.
# delete all newlines and replace the ^M with a new newline.
tr -d "\n" < file.csv| tr "\r" "\n"
# Above two steps with one command
tr "\n\r" " \n" < file.csv
When you want a space between the joined line, you need an additional step.
tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'
EDIT: #sjaak commented this didn't work is his case.
When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.
sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv
Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat
awk pattern matching is working.
answer in one line :
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile
if you'd like to drop quotes, you could use:
awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'
but I prefer to keep it.
to explain the code:
/Pattern/ : find pattern in current line.
ORS : indicates the output line record.
$0 : indicates the whole of the current line.
's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern
/g : does the previous action for all OldPattern
/w : write the result to Newfile

sed removing # and ; comments from files up to certain keyword

I have files that need to be removed from comments and white space until keyword . Line number varies . Is it possible to limit multiple continued sed substitutions based on Keyword ?
This removes all comments and white spaces from file :
sed -i -e 's/#.*$//' -e 's/;.*$//' -e '/^$/d' file
For example something like this :
# string1
# string2
some string
; string3
; string4
####
<Keyword_Keep_this_line_and_comments_white_space_after_this>
# More comments that need to be here
; etc.
sed -i '1,/keyword/{/^[#;]/d;/^$/d;}' file
I would suggest using awk and setting a flag when you reach your keyword:
awk '/Keyword/ { stop = 1 } stop || !/^[[:blank:]]*([;#]|$)/' file
Set stop to true when the line contains Keyword. Do the default action (print the line) when stop is true or when the line doesn't match the regex. The regex matches lines whose first non-blank character is a semicolon or hash, or blank lines. It's slightly different to your condition but I think it does what you want.
The command prints to standard output so you should redirect to a new file and then overwrite the original to achieve an "in-place edit":
awk '...' input > tmp && mv tmp input
Use grep -n keyword to get the line number that contains the keyword.
Use sed -i -e '1,N s/#..., when N is the line number that contains the keyword, to only remove comments on the lines 1 to N.

UNIX Search in specific column for user specified code and output entire line

I'm working on a program that searches a medication list and returns a report as requested by the user. So i am trying to search this list for a code that the user inputs and then return the relevant information.
EX. (medcode) (doseage)
commA6314 ifosfamide 30
home5341209 urokinase 6314
When i search the file i only want it to return the line if it finds a match in columns 6-12 (6314 for the first line) but at the moment it will return both lines since the second line also contains 6314. All of the answers i saw used text processing utilities like awk, sed or perl and one of the conditions of the program is not to use any of these utilities.
The programs expected output:
Enter medication code?
6314
See Generic name g/G or Dose d/D?
g
ifosfamide
What i am getting currently:
Enter medication code?
6314
See Generic name g/G or Dose d/D?
g
ifosfamide
urokinase
so it is also displaying information about the second medication because 6314 is also contained in the columns for doseage.
Using bash
To match 6314 but only if it starts in column 6 using just bash, try:
$ while read -r line; do [[ "$line" =~ ^.{5}6314 ]] && echo "$line"; done <infile
commA6314 ifosfamide 30
This reads lines from the file one-by-one. The line is echoed to output only if it matches the regex ^.{5}6314 which requires that 6314 appear starting at the sixth character from the start of the line.
To print just the second word on the line but only if the first word matches your number position six:
$ while read -r code name extra; do [[ "$code" =~ ^.{5}6314 ]] && echo "$name"; done <infile
ifosfamide
Using grep
To match 6314 but only if it starts in column 6, try:
$ grep -E '^.{5}6314' infile
commA6314 ifosfamide 30
Here, ^ specifies the beginning of a line and .{5} matches any five characters. Thus ^.{5}6314 matches 6314 but only if it starts as the sixth character on the line.
Using awk
$ awk '"6314" == substr($0, 6, 4)' infile
commA6314 ifosfamide 30
Here, substr($0, 6, 4) selects four characters from the line starting at the sixth. If this equals 6314, then the line is printed.
Using sed
$ sed -En '/^.{5}6314/p' infile
commA6314 ifosfamide 30
-n tells sed not to print unless we explicitly ask it to. /^.{5}6314/p tells sed to print any line that, starting at the sixth character, matches 6314.
Try this using just bash :
while read -r line; do
[[ ${line%% *} == *6314* ]] && echo "$line"
done < input_file
It search only in the medication column.
explanations
${line%% *}
is a bash parameter expansion, it keep only the first 'word' before the first space

Replace some lines in fasta file with appended text using while loop and if/else statement

I am working with a fasta file and need to add line-specific text to each of the headers. So for example if my file is:
>TER1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
I want a while loop that will read through each line; for those with a > at the start, I want to append |population: plus the first three characters after the >. So line one would be:
>TER1|population:TER
etc.
I can't figure out how to make this work. Here my best attempt so far.
filename="testfasta.fa"
while read -r line
do
if [[ "$line" == ">"* ]]; then
id=$(cut -c2-4<<<"$line")
printf $line"|population:"$id"\n" >>outfile
else
printf $line"\n">>outfile
fi
done <"$filename"
This produces a file with the original headers and following line each on a single line.
Can someone tell me where I'm going wrong? My if and else loop aren't working at all!
Thanks!
You could use a while loop if you really want,
but sed would be simpler:
sed -e 's/^>\(...\).*/&|population:\1/' "$filename"
That is, for lines starting with > (pattern: ^>),
capture the next 3 characters (with \(...\)),
and match the rest of the line (.*),
replace with the line as it was (&),
and the fixed string |population:,
and finally the captured 3 characters (\1).
This will produce for your input:
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Or you can use this awk, also producing the same output:
awk '{sub(/^>.*/, $0 "|population:" substr($0, 2, 3))}1' "$filename"
You can do this quickly in awk:
awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' infile.txt > outfile.txt
$ awk '$1~/^>/{$1=$1"|population:"substr($1,2,3)}{}1' testfile
>TER1|population:TER
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>TER2|population:TER
AGCATGCTAGCTAGACGACTCGATCGCATGCTC
>URC1|population:URC
AGCATGCTAGCTAGTCGACTCGATCGCATGCTC
>URC2|population:URC
AGCATGCTACCTAGTCGACTCGATCGCATGCTC
>UCR3|population:UCR
AGCATGCTAGCTAGTCGACTCGATGGCATGCTC
Here awk will:
Test if the record starts with a > The $1 looks at the first field, but $0 for the entire record would work just as well in this case. The ~ will perform a regex test, and ^> means "Starts with >". Making the test: ($1~/^>/)
If so it will set the first field to the output you are looking for (using substr() to get the bits of the string you want. {$1=$1"|population:"substr($1,2,3)}
Finally it will print out the entire record (with the changes if applicable): {}1 which is shorthand for {print $0} or.. print the entire record.

how to convert a text file into comma separated values

i have a text file that has text.txt
aaa/bbb/ccc/ddd/eee
119
fff/ggg/hhh/iii/jjj
20
now how do i convert this output into 2 columns and store this in another text file
file count
aaa/bbb/ccc/ddd/eee 119
fff/ggg/hhh/iii/jjj 20
i want to do this using shell script
This should work
sed 'N;s/\n/ /' fileName
The above N command is an example sed's multiline capability. N commands takes first_line and second_line and separates them by \n. The pattern is then applied to
first_line\nsecond_line.
In the above example N command is followed by replace of \n with space. As a result the output becomes
first_line second_line
In pure bash:
( echo "file,count"
while read line #Read line by line
do
echo -n "$line," #Print one line, with a comma and without a newline
read line #Get the next line
echo "$line" #Print that line as second column
done < "inputFilename" ) > "outputFilename" #Redirect to output file
should do the trick (assuming you want actual comma separated values).

Resources