Substitute special character - shell

I have a special character in my .txt file.
I want to substitute that special character ý with |
and rename the file to .mnt from .txt.
Here is my code: it renames the file to .mnt, but does not substitue the special character
#!/bin/sh
for i in `ls *.txt 2>/dev/null`;
do
filename=`echo "$i" | cut -d'.' -f1`
sed -i 's/\ý/\|/g' $i
mv $i ${filename}.mnt
done
How to do that?
Example:
BEGIN_RUN_SQLýDELETE FROM PRC_DEAL_TRIG WHERE DEAL_ID = '1:2:1212'

You have multiple problems in your code. Don't use ls in scripts and quote your variables. You should probably use $(command substitution) rather than the legacy `command substitution` syntax.
If your task is to replace ý in the file's contents -- not in its name -- sed -i is not wrong, but superfluous; just write the updated contents to the new location and delete the old file.
#!/bin/sh
for i in *.txt
do
filename=$(echo "$i" | cut -d'.' -f1)
sed 's/ý/|/g' "$i" >"${filename}.mnt" && rm "$i"
done
If your system is configured for UTF-8, the character ý is representable with either the byte sequence
\xc3 \xbd (representing U+00FD) or the decomposed sequence \0x79 \xcc \x81 (U+0079 + U+0301) - you might find that the file contains one representation, while your terminal prefers another. The only way to really be sure is to examine the hex bytes in the file and on your terminal. It is also entirely possible that your terminal is not capable of displaying the contents of the file exactly. Try
bash$ printf 'ý' | xxd
00000000: c3bd
bash$ head -c 16 file | xxd
00000000: 4245 4749 4e5f 5255 4e5f 5351 4cff 4445 BEGIN_RUN_SQL.DE
If (as here) you find that they are different (the latter outputs the single byte \xff between "BEGIN_RUN_SQL" and "DE") then the trivial approach won't work. Your sed may or may not support passing in literal hex sequences to say exactly what to substitute; or perhaps try e.g. Perl if not.

Related

How to use lines in a file as keyword for grep?

I've search lots of questions on here and other sites, and people have suggested things that should fix my problem, but I think there's something wrong with my code that I just don't recognize.
I have 24 .fasta files from NGS sequencing that are 150bp long. There's approximately 1M reads for each file. The reads are from targeted sequencing where we electroplated vectors with cDNA for genes of interest, and a unique barcode sequence. I need to look through the sequencing files for the presence or absence of the barcode sequence which corresponds to a specific gene.
I have a .txt list of the barcodeSequences that I want to pass to grep to look for the barcode in the .fasta file. I've tried so many variations of this command. I can give grep each barcode individually but that's so time consuming, I know it's possible to give it the list of barcode sequences and search each .fasta for each of the barcodes and record how many times each barcode is found in each file.
Here's my code where I give it each barcode individually:
# Barcode 33
mkdir --mode 755 $dir/BC33
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep 'TATTAGAGTTTGAGAATAAGTAGT' > $dir/BC33/"$f"
done
I tried to adapt it so that I don't have to feed every barcode sequence in individually:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
cat "$f" | tr -d "\n" | tr ">" "\n" | grep -c -f BarcodeScreenSeq.txt | sort > $dir/Results/"$f"
echo "Finished $f"
done
But it is not searching for the barcode sequences. With this iteration it is just returning new files in the /Results directory that are empty. I also tried a nest loop, where I tried to make the barcode sequence a variable that changed like the $FILES, but that just gave me a new file with the names of my .fasta files:
dir="/home/lozzib/AG_Barcode_Seq/"
cd $dir
FILES="*.fasta"
for f in $FILES; do
for b in `cat /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt`; do
cat "$f" | grep -c "$b" | sort > $dir/"$f"_Barcode
done ;
done
I want a output .txt file that has:
<barcode sequence>: <# of times that bc was found>
for each .fasta file because I want to put all the samples together to make one large excel sheet which shows each barcode and how many times it was found in each sample.
Please help, I've tried everything I can think of.
EDIT
Here is what the BarcodeScreenSeq.txt file would look like. It's just a txt file where each line is a barcode sequence:
head BarcodeScreenSeq.txt
TATTATGAGAAAGTTGAATAGTAG
ATGAAAGTTAGAGTTTATGATAAG
AATAGATAAGATTGATTGTGTTTG
TGTTAAATGTATGTAGTAATTGAG
ATAGATTTAAGTGAAGAGAGTTAT
GAATGTTTGTAAATGTATAGATAG
AAATTGTGAAAGATTGTTTGTGTA
TGTAAGTGAAATAGTGAGTTATTT
GAATTGTATAAAGTATTAGATGTG
AGTGAGATTATGAGTATTGATTTA
EDIT
lozzib#gliaserver:~/AG_Barcode_Seq$ file BarcodeScreenSeq.txt
BarcodeScreenSeq.txt: ASCII text, with CRLF line terminators
Windows Line Endings
Your BarcodeScreenSeq.txt has windows line endings. Each line ends with the special characters \r\n. Linux tools such as grep only deal with linux line endings \r and interpret your file ...
TATTATG\r\n
ATGAAAG\r\n
...
to look for the patterns TATTATG\r, ATGAAAG\r, ... (note the \r at the end). Because of the \r there is no match.
Either: Convert your file once bye running dos2unix BarcodeScreenSeq.txt or sed -i 's/\r//g' BarcodeScreenSeq.txt. This will change your file.
Or: replace every BarcodeScreenSeq.txt in the following scripts by <(tr -d '\r' < BarcodeScreenSeq.txt). This won't change the file, but creates more overhead as the file is converted over and over again.
Command
grep -c has only one counter. If you pass multiple search patterns at once (for instance using -f BarcodeScreenSeq.txt) you still get only one number for all patterns together.
To count the occurrences of each pattern individually you can use the following trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
sort | uniq -c |
awk '{print $2 ": " $1 }' > "Results/$file"
done
grep -o will print each match as a single line.
sort | uniq -c will count how often each line occurs.
awk is only there to change the format from #matches pattern to pattern: #matches.
Benefit: The command should be fairly fast.
Drawback: Patterns from BarcodeScreenSeq.txt that are not found in $file won't be listed at all. Your result will leave out lines of the form pattern: 0.
If you really need the lines of the form pattern: 0 you could use another trick:
for file in *.fasta; do
grep -oFf BarcodeScreenSeq.txt "$file" |
cat - BarcodeScreenSeq.txt |
sort | uniq -c |
awk '{print $2 ": " ($1 - 1) }' > "Results/$file"
done
cat - BarcodeScreenSeq.txt will insert the content of BarcodeScreenSeq.txt at the end of grep's output such that #matches is one bigger than it should be. The number is corrected by awk.
You can read a text file one line at a time and process each line separately using a redirect, like so:
for f in *.fasta; do
while read -r seq; do
grep -c "${seq}" "${f}" > "${dir}"/"${f}"_Barcode
done < /home/lozzib/AG_Barcode_Seq/BarcodeScreenSeq.txt
done

Bash: how to get the complete substring of a match in a string?

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?
Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"
You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file
Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

"filename too long" bash mv command old files

#! /bin/sh -
cd /PHOTAN || exit
fn=$(ls -t | tail -n -30)
mv -f -- "${fn}" /old
all I want todo is keep most recent 30 files... but cant get past the mv
"File name too long" problem
please help'
The notation "${fn}" adds all the file names into a single argument string, separated by spaces. Just for once, assuming you don't have to worry about file names with spaces in them, you need:
mv -f -- ${fn} /old
If you have file names with spaces in them, then you've got problems starting with parsing the output of the ls command.
But what if you do have to worry about spaces in your filenames?
Then, as I stated, you have major problems, starting with the issues of parsing the output of ls.
$ echo > 'a b'
$ echo > ' c d '
$
Two nice file names with spaces in them. They cause merry hell. I'm about to assume you're on Linux or something similar enough. You need to use bash arrays, the stat command, printf, sort -z, sed -z. Or you should simply outlaw filenames with spaces; it is probably easier.
names=( * )
The array names contains each file name as a separate array element, leading and trailing and embedded blanks all handled correctly.
names=( * )
for file in "${names[#]}"
do printf "%s\0" "$(stat -c '%Y' "$file") $file"
done |
sort -nzr |
sed -nze '1,30s/^[0-9][0-9]* //p' |
tr '\0' '\n'
The for loop evaluates the modification time of each file separately, and combines the modification time, a space, and the file name into a single string followed by a null byte to mark the end of the string. The sort command sorts the 'lines' numerically, assuming the lines are terminated by null bytes because of the -z option, and places the most recent file names first. The sed command prints the first 30 'lines' (file names) only; the tr command replaces null bytes with newlines (but in doing so, loses the identity of file name boundaries).
The code works even with file names containing newlines, but only on systems where sed and sort support the (non-standard) -z option to process null-terminated input 'lines' — that means systems using GNU sed and sort (even BSD sed as found on Mac OS X does not, though the Mac OS X sort is GNU sort and does support -z).
Ugh! The shell was designed for spaces to appear between and not within file names.
As noted by BroSlow in a comment, if you assume 'no newlines in filenames', then the code can be simpler and more nearly portable — but it is still tricky:
ls -t |
tail -30 |
{
list=()
while IFS='' read -r file
do list+=( "$file" )
done
mv -f -- "${list[#]}" /old
}
The IFS='' is needed so that leading and trailing spaces in filenames are preserved (and tabs, too).
I note in passing that the Korn shell would not require the braces but Bash does.

Bash sha1 with hex input

I found this solution to build hash-values:
echo -n wicked | shasum | awk '{print $1}'
But this works only with string input. I don't know how to hanlde input as hex, for example if i want to build sha1-value of sha1-value.
upd: I just found out there is option -b for shasum but it produces wrong output. Does it expect bytes with reversed endianness?
upd2: for example: I do the following input:
echo -n 9e38cc8bf3cb7c147302f3e620528002e9dcae82 | shasum -b | awk '{print $1}'
The output is bed846bb1621d915d08eb1df257c2274953b1ad9 but according to the hash calculator the ouput should be 9d371d148d9c13050057105296c32a1368821717
upd3: the -b option seems not to work at all. There is no difference whether I apply this parameter or not, i get the same result.
upd4: the whole script lookes as follows. It doesn't work because the null-byte gets removed as i either assign or concatenate .
password="wicked"
scrumble="4d~k|OS7T%YqMkR;pA6("
stage1_hash=$(echo -n $password| shasum | awk '{print $1}')
stage2_hash=$(echo $(echo -n $stage1_hash | xxd -r -p | shasum | awk '{print $1}') | xxd -r -p)
token=$(./xor.sh $(echo -n $scrumble$(echo 9d371d148d9c13050057105296c32a1368821717 | xxd -r -p) | shasum | awk '{print $1}') $stage1_hash)
echo $token
You can use xxd -r -p to convert hexadecimal to binary:
echo -n 9e38cc8bf3cb7c147302f3e620528002e9dcae82 | xxd -r -p | shasum -b | awk '{print $1}'
Note that the output of this is 9d371d148d9c13050057105296c32a1368821717; this matches what I get from hashing 9e38cc8bf3cb7c147302f3e620528002e9dcae82 using hash calculator. It appears that the value you got from bash calculator was a results of a copy-paste error, specifically leaving off the final "2" in the hex string.
UPDATE: I'm not sure exactly what the entire script is supposed to do, but I can point out several problems with it:
Shell variables, command arguments, and c strings in general cannot contain null bytes. There are also situations where trailing linefeeds get trimmed, and IIRC some early versions of bash couldn't handle delete characters (hex 7F)... Basically, don't try to store binary data (as in stage2_hash) or pass it as arguments (as in ./xor.sh) in the shell. Pipes, on the other hand, can pass raw binary just fine. So store it in hex, then convert to binary with xxd -r -p and pipe it directly to its destination.
When you expand a shell variable ($password) or use a command substitution ($(somecommand)) without wrapping it in double-quotes, the shell does some additional parsing on it (things like turning spaces into word breaks, expanding wildcards to lists of matching filenames, etc). This is almost never what you want, so always wrap things like variable references in double-quotes.
Don't use echo for anything nontrivial and expect it to behave consistently. Depending on which version of echo you have and/or what the password is, echo -n "$password" might print the password without a linefeed after it, or might print it with "-n " before it and a linefeed after, might do something with any backslash sequences in the password, or (if the password starts with "-") interpret the password itself as more options to the echo command. Use printf "%s" "$password" instead.
Don't use echo $(somecommand) (or even printf "%s" "$(somecommand)"). The echo and $() are mostly canceling each other here, but creating opportunities for problems in between. Just use the command directly.
Clean those up, and if it doesn't work after the cleanup try posting a separate question.
openssl command may help you. see HMAC-SHA1 in bash
like:
echo -n wicked | openssl dgst -sha1

Bash variables not acting as expected

I have a bash script which parses a file line by line, extracts the date using a cut command and then makes a folder using that date. However, it seems like my variables are not being populated properly. Do I have a syntax issue? Any help or direction to external resources is very appreciated.
#!/bin/bash
ls | grep .mp3 | cut -d '.' -f 1 > filestobemoved
cat filestobemoved | while read line
do
varYear= $line | cut -d '_' -f 3
varMonth= $line | cut -d '_' -f 4
varDay= $line | cut -d '_' -f 5
echo $varMonth
mkdir $varMonth'_'$varDay'_'$varYear
cp ./$line'.mp3' ./$varMonth'_'$varDay'_'$varYear/$line'.mp3'
done
You have many errors and non-recommended practices in your code. Try the following:
for f in *.mp3; do
f=${f%%.*}
IFS=_ read _ _ varYear varMonth varDay <<< "$f"
echo $varMonth
mkdir -p "${varMonth}_${varDay}_${varYear}"
cp "$f.mp3" "${varMonth}_${varDay}_${varYear}/$f.mp3"
done
The actual error is that you need to use command substitution. For example, instead of
varYear= $line | cut -d '_' -f 3
you need to use
varYear=$(cut -d '_' -f 3 <<< "$line")
A secondary error there is that $foo | some_command on its own line does not mean that the contents of $foo gets piped to the next command as input, but is rather executed as a command, and the output of the command is passed to the next one.
Some best practices and tips to take into account:
Use a portable shebang line - #!/usr/bin/env bash (disclaimer: That's my answer).
Don't parse ls output.
Avoid useless uses of cat.
Use More Quotes™
Don't use files for temporary storage if you can use pipes. It is literally orders of magnitude faster, and generally makes for simpler code if you want to do it properly.
If you have to use files for temporary storage, put them in the directory created by mktemp -d. Preferably add a trap to remove the temporary directory cleanly.
There's no need for a var prefix in variables.
grep searches for basic regular expressions by default, so .mp3 matches any single character followed by the literal string mp3. If you want to search for a dot, you need to either use grep -F to search for literal strings or escape the regular expression as \.mp3.
You generally want to use read -r (defined by POSIX) to treat backslashes in the input literally.

Resources