How to convert a complex bash command to python sh - bash

I am trying to convert the following bash command into a format that can be run by python's sh package.
cat "Some_File.txt" | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr
I have very little experience with sh and some experience with python. I understand how to use sh with simple bash commands but I am having trouble figuring out how to convert this complex command into sh format due in part to the piping. Any ideas?

I think you'd be better off using the shell.
from sh import sort, grep, tr, uniq
sort("-bnr",
uniq("-c",
sort(grep("-v",
"^\s*$",
tr("[:space:]",
"\n",
"Some_File.txt")))))

Related

How to properly pass filenames with spaces with $* from xargs to sed via sh?

Disclaimer: this happens on macOS (Big Sur); more info about the context below.
I have to write (almost did) a script which will replace images URLs in big text (xml) files by their Base64-encoded value.
The script should run the same way with single filenames or patterns, or both, e.g.:
./replace-encode single.xml
./replace-encode pattern*.xml
./replace-encode single.xml pattern*.xml
./replace-encode folder/*.xml
Note: it should properly handle files\ with\ spaces.xml
So I ended up with this script:
#!/bin/bash
#needed for `ls` command
IFS=$'\n'
ls -1 $* | xargs -I % sed -nr 's/.*>(https?:\/\/[^<]+)<.*/\1/p' % | xargs -tI % sh -c 'sed -i "" "s#%#`curl -s % | base64`#" $0' "$*"
What it does: ls all files, pipe the list to xargs then search all URLs surrounded by anchors (hence the > and < in the search expr. - also had to use sed because grep is limited on macOS), then pipe again to a sh script which runs the sed search & replace, where the remplacement is the big Base64 string.
This works perfectly fine... but only for fileswithoutspaces.xml
I tried to play with $0 vs $1, $* vs $#, w/ or w/o " but to no avail.
I don't understand exactly how does the variable substitution (is it how it's called? - not a native English speaker, and above all, not a script-writer at all!!! just a Java dev. all day long...) work between xargs, sh or even bash with arguments like filenames.
The xargs -t is here to let me check out how the substitution works, and that's how I noticed that using a pattern worked but I have to let the " around the last $*, otherwise only the 1st file is searched & replaced; output is like:
user#host % ./replace-encode pattern*.xml
sh -c sed -i "" "s#https://www.some.com/public/123456.jpg#`curl -s https://www.some.com/public/123456.jpg | base64`#" $0 pattern_123.xml
pattern_456.xml
Both pattern_123.xml and pattern_456.xml are handled here; w/ $* instead of "$*" in the end of the command, only pattern_123.xml is handled.
So is there a simple way to "fix" this?
Thank you.
Note: macOS commands have some limitations (I know) but as this script is intended to non-technical users, I can't ask them to install (or have the IT team installed on their behalf) some alternate GNU-versions installed e.g. pcregrep or 'ggrep' like I've read many times...
Also: I don't intend to change from xargs to for loops or so because, 1/ don't have the time, 2/ might want to optimize the 2nd step where some URLs might be duplicate or so.
There's no reason for your software to use ls or xargs, and certainly not $*.
./replace-encode single.xml
./replace-encode pattern*.xml
./replace-encode single.xml pattern*.xml
./replace-encode folder/*.xml
...will all work fine with:
#!/usr/bin/env bash
while IFS= read -r line; do
replacement=$(curl -s "$line" | base64)
in="$line" out="$replacement" perl -pi -e 's/\Q$ENV{"in"}/$ENV{"out"}/g' "$#"
done < <(sed -nr 's/.*>(https?:\/\/[^<]+)<.*/\1/p' "$#" | sort | uniq)
Finally ended up with this single-line script:
sed -nr 's/.*>(https?:\/\/[^<]+)<.*/\1/p' "$#" | xargs -I% sh -c 'sed -i "" "s#%#`curl -s % | base64`#" "$#"' _ "$#"
which does properly support filenames with or without spaces.

Asterisk in bash variable

I've a file that contains info that I'm retrieving such way
Command
cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'
Output
HLA-A*30:02,HLA-B*18:01,HLA-C*05:01
But I'm trying to save this in variable, the asterisk and a letter disappears, I've tried several ways, adding/removing commas etc and I'm yet not able to print it properly.
hla=`cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'`
echo $hla
HLA-05:01,HLA-18:01,HLA-30:02
echo "$hla"
HLA-05:01,HLA-18:01,HLA-30:02
There are multiple errors here, most of which will be aptly diagnosed by http://shellcheck.net/ without any human intervention.
You really should single-quote your regular expressions unless you specifically require the shell to perform wildcard expansion and whitespace tokenization on the regex before executing the command.
The obsolescent `command` in backticks introduces some unfortunate additional shell handling on the string inside the backticks. The solution since the 1990s is to prefer the $(command) syntax for command substitution, which does not exhibit this problem.
The cat is useless; grep knows full well how to read a file.
Try this refactored code:
hla=$(grep -o '[A-Z]*[0-9]*:[0-9]*' 2018_02_15_09_01_08_result.tsv |
sort -u | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//')
echo "$hla"
The double quotes around the variable interpolation in the echo are necessary and useful; notice also the line wraps for legibility and the use of sort -u in preference over sort | uniq (and generally try to reduce the number of processes -- once I understand what the sed | tr | sed does I can probably propose a simplification for that, too). Perhaps the simplest fix would be to refactor all of this into a single Awk script, but without access to the input, it's hard to tell you in more detail what that might look like.
(Also, are you really sure you need to capture the value to a variable? Often variable=value; echo "$variable" is just an obscure and inefficient way to say echo "value". And variable=$(command); echo "$variable" is better written simply command and capturing the command's standard output just so you can print it to standard output is a pure waste of cycles, unless you are planning to do something more with that variable's value.)
I've solved it by saving the output of the command with a redirection:
cat 2018_02_15_09_01_08_result.tsv |
grep -o [A-Z]\\*[0-9]*:[0-9]* |
sort | uniq |
sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//' > out_file
hla=`cat out_file`
echo $hla
which gets me the expected HLA-A*30:02,HLA-B*18:01,HLA-C*05:01. Not the ideal solution, but it works.

how to read a value from filename and insert/replace it in the file?

I have to run many python script which differ just with one parameter. I name them as runv1.py, runv2.py, runv20.py. I have the original script, say runv1.py. Then I make all copies that I need by
cat runv1.py | tee runv{2..20..1}.py
So I have runv1.py,.., runv20.py. But still the parameter v=1 in all of them.
Q: how can I also replace v parameter to read it from the file name? so e.g in runv4.py then v=4. I would like to know if there is any one-line shell command or combination of commands. Thank you!
PS: direct editing each file is not a proper solution when there are too many files.
Below for loop will serve your purpose I think
for i in `ls | grep "runv[0-9][0-9]*.py"`
do
l=`echo $i | tr -d [a-z.]`
sed -i 's/v/'"$l"'/g' runv$l.py
done
Below command was to pass the parameter to script extracted from the filename itself
ls | grep "runv[0-9][0-9]*.py" | tr -d [a-z.] | awk '{print "./runv"$0".py "$0}' | xargs sh
in the end instead of sh you can use python or bash or ksh.

Executing grep from Ruby

I want to run the following grep from within a Ruby script:
grep "word1" file1.json | grep -c "word2"
This will find the number of lines in the file with both words appearing. I could do this with Ruby regex, but it seems that Unix grep is much faster. So my question is how do I run this command within a script and return the result back to a Ruby variable?
I'd love to hear alternative solutions that don't use grep, if they are as fast or even faster.
Use backticks to execute shell commands:
result = `grep "word1" file1.json | grep -c "word2"`

bash shell script for mac to generate word list from a file?

Is there a shell script that runs on a mac to generate a word list from a text file, listing the unique words? Even better if it could sort by frequency....
sorry forgot to mention, yeah i prefer a bash one as i'm using mac now...
oh, my file is in french... (basically i'm reading a novel and learning french, so i try to generate a word list help myself). hope this is not a problem?
If I understood you correctly, you need something like that:
cat <filename> | sed -e 's/ /\n/g' | sort | uniq -c
This command will do
cat file.txt | tr "\"' " '\n' | sort -u
Here sort -u will not work on Macintosh machines. In that case use sort | uniq -c instead. (Thanks to Hank Gay)
cat file.txt | tr "\"' " '\n' | sort | uniq -c
Just answer my question to dot down the final version i'm using:
tr -cs "[:alpha:]" "\n" < FileIn.txt | sort | uniq -c | awk '{print $2","$1}' >> FileOut.csv
some notes:
tr can be used directly to do replacement.
since i'm interested creating a word list for my french vocabulary, i used [:alpha:]
awk is used to insert a comma, so that the output is a csv file, easier for me to upload...
thanks again for everyone helping me.
sorry i didn't put it clearly at the beginning that i'm using a mac and expect a bash script.
cheers.

Resources