Optimising my script code for GNU parallels - bash

I have a script which queries successfully an API, but is very slow. It will take around 16 hours to get all the resources. I looked at how I could optimise it, and I thought that using GNU parallels (installed on macos via Brew, version 20180522) would do the trick. But even with using 90 jobs (the API endpoints authorizes 100 connections max), my script is not faster. I'm not sure why.
I call my script like so:
bash script.sh | parallel -j90
The script is the following:
#!bin/bash
# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment
# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file
main(){
local file="${1}"
local line
local sign
local auteur_clean
local cosign_clean
while read line
do
sign="${line}/signataires"
auteur_clean=$(auteur $sign)
cosign_clean=$(cosignataires $sign)
echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}
# The auteur function takes the $sign variable as an input and
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
# 3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url="${1}"
local auteur
local auteur_nom
auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}
# The cosignataires function takes the $sign variable as an input and
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url="${1}"
local cosign
local cosign_nom
local i
cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}
main "url_amendements_15.txt"
and the content of url_amendements_15.txt looks like so:
https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161

Your script loops over a list of URLs and queries them sequentially. You need to break it up so each API query is done separately, that way parallel will have commands it can execute in parallel.
Change the script so it takes a single URL. Get rid of the main while loop.
main() {
local url=$1
local sign
local auteur_clean
local cosign_clean
sign=$url/signataires
auteur_clean=$(auteur "$sign")
cosign_clean=$(cosignataires "$sign")
echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
}
Then pass url_amendements_15.txt to parallel. Give it the list of URLs that can be processed in parallel.
parallel -j90 script.sh < url_amendements_15.txt

Related

Bash script to add double quotes in .CSV comma delimited file

I need to add double quotes to the csv file. My sample data is like this..
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296
I have tried some code available online with awk and sed, it is resulting as below , Error - **First digit in the number is being trimmed like for ex. in '378478' it is only displaying '78478'.
Also it is adding double quotes to already existing double quotes too!** nothing seems to be perfectly working. Please guide me!
"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""
This is the code I am using:
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"\1" , "\2"/' file
My total code is the below one. my Intention was to first convert all .xlsx to .csv and then add double quotes to same csv and save it in the same file.i know the $file.csv part is wrong, hence i need some help
find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp
cat path/temp | while IFS="" read -r -d $'\0' file;
do
echo $file
ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done
If you want to handle anything other than the simplest CSV files, you should probably move away from sed and awk. There are much better tools available.
For example, if you sudo apt install csvtool (or equivalent) on your favourite distro, you can use its call-per-line functionality to process each line in the input file. See the following script for an example:
#!/bin/bash
function quotify {
# Start empty line, process every field.
line=""
while [[ $# -ne 0 ]] ; do
# Append comma for all but first field, then quoted field.
[[ -n "${line}" ]] && line="${line},"
line="${line}\"$1\""
shift
done
# Output the fully quoted line.
echo "${line}"
}
# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify
# Pretty-print input and output.
echo "Input file:"
sed 's/^/ /' inputFile.csv
echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/ /'
Note the quotify function which is called for each line in the CSV file, with the arguments set to each field within that line (sans quotes, whether the original fields had quotes or not).
It basically constructs a string of all the fields in the line, with quotes around them, then writes that to standard output, as shown below in the output from that script:
Input file:
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"
Even though using a separate tool is probably the easiest way to go, if you absolutely cannot install other packages, then you're going to have to code up something in a package you already have. The following bash script is a good place to start, as it uses no other tools to achieve its goal.
At the moment, it's tied to a very specific set of rules, as follows:
White space matters. Anything between the commas is considered part of the field. This especially matters when detecting a quoted field, it must have the quote as the first character, no abc, "d,e,f",ghi stuff since the "d,e,f" won't be handled correctly.
Quoted fields are allowed to contain commas, and "" sequences within them are turned into ".
It's probably not a good idea to supply ill-formatted CSV files :-)
But, with that in mind, here we go. I'll offer a brief textual description of each section but hopefully the comments in the code will be enough to figure out what's going on.
First, a function for finding the position if some string within another string, useful for working out the field bounds:
function findPos {
haystack="$1"
needle="$2"
# Remove everything past the needle.
prefix="${haystack%%${needle}*}"
# If nothing was removed, it wasn't found, so supply massive number.
# Otherwise, it was found at the length of the string with removed stuff.
position=999999
[[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
echo ${position}
}
Then we can use that in the function that works out the length of the next field. This basically just looks for the next comma for unquoted fields, and does special handling for quoted fields by building up the field from segments (it has to handle quotes within quotes and commas):
function getNextFieldLen {
line="$1"
# Empty line means all work done.
[[ -z "${line}" ]] && echo -1 && return
# Handle unquoted first, this is easy.
[[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }
# Now handle quoted. Loop over all segments where a segment is defined as
# the text up to the next <"">, assuming it's before the next <",>.
field=""
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
# Append segment to the field and go back for next segment.
field="${field}${line:0:${nextDoubleQuote}}\"\""
line="${line:${nextDoubleQuote}}"
line="${line:2}"
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
done
# Add final segment (up to the comma) and output entire field.
field="${field}${line:0:${nextQuoteComma}}\""
echo "${#field}"
}
Finally, there's the top-level function which will quotify whatever comes in via standard input:
function quotifyStdIn {
# Process file line by line.
while read -r line; do
# Start with empty output line and non-comma separator.
outLine="" ; sep=""
# Place terminator to make processing easier, start field loop.
line="${line},"
fieldLen=$(getNextFieldLen "${line}")
while [[ ${fieldLen} -ge 0 ]]; do
# Get field and quotify if needed, adjust line (remove field and comma).
field="${line:0:${fieldLen}}"
[[ "${field:0:1}" = '"' ]] || field="\"${field}\""
line="${line:$((fieldLen+1))}"
#line="${line:${fieldLen}}"
#line="${line:1}"
# Append to output line and prepare for next field.
outLine="${outLine}${sep}${field}"; sep=","
fieldLen=$(getNextFieldLen "${line}")
done
# Output built line.
echo "${outLine}"
done
}
And, on the off-chance you want to read directly from a file (though providing a file name that's empty or "-" will use standard input so you can probably just use the file-based function for everything):
function quotifyFile {
file="$1"
# Empty file or "-" means standard input, otherwise take input from real file.
[[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
[[ "${file}" = "-" ]] && { quotifyStdIn; return; }
quotifyStdIn < "${file}"
}
And, finally, because every program that's not a "Hello, world" one deserves some form of test harness, this is what you can use to test the various capabilities:
(
echo 'paxdiablo,was here'
echo 'and,"then, strangely,",he,was,not'
echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
echo '17,"""Love"" is grand",19'
) > harness.csv
echo "Before:"
sed "s/^/ /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/ /"
rm -rf harness.csv
And, since a test harness is of little use unless you run the tests, here's the results of the first run:
Before:
paxdiablo,was here
and,"then, strangely,",he,was,not
50,"My name is ""Pax"", and yours is ""Bob""",42
17,"""Love"" is grand",19
After:
"paxdiablo","was here"
"and","then, strangely,","he","was","not"
"50","My name is ""Pax"", and yours is ""Bob""","42"
"17","""Love"" is grand","19"
Hopefully, that will be enough to get you going in the absence of being able to install packages. Of course, if one of the packages you can't install in bash itself, then you have problems that I can't help you with :-)
Your starting CSV is not a good CSV: the 2 rows have different number of columns
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone | - | - | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29 | - | - | - | - | - | - |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
Using Miller (https://github.com/johnkerl/miller) you could run
mlr --csv --quote-all -N unsparsify input >output
to have
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""
You can use it downloading the executable https://github.com/johnkerl/miller/releases/tag/v5.7.0

How to properly remove new lines from JSON parsed data in bash

I want to parsing JSON data using jq(as described here ) and delete any newlines character from the resulting string.
I've already tried to use tr but this approach remove also all the white spaces between parsed values.
My code:
IP=$(curl -s https://ipinfo.io/ip) # Get ip address
curl -s https://ipinfo.io/${IP}/geo | jq -r '.ip, .city, .country' | tr -d '\n' # parse only few values from the JSON data and remove new lines.
What i get with the code above is the following string:
XXX.XXX.XXX.XXXCity_NameCountry_Name but i want something like this:
XXX.XXX.XXX.XXX City_Name Country_Name
You could craft a single string from the three pieces of data so that it appears on a single line (single result by input) :
IP=$(curl -s https://ipinfo.io/ip)
curl -s https://ipinfo.io/${IP}/geo | jq -r '.ip + " " + .city + " " + .country'
> myIp myCity myCountryCode
Another option for a different but similar output format would be to use the #csv output format, where you would want to output an array of cells for each input :
IP=$(curl -s https://ipinfo.io/ip)
curl -s https://ipinfo.io/${IP}/geo | jq -r '[.ip, .city, .country] | #csv'
> "myIp","myCity","myCountryCode"
This result could be easily worked on from any spreadsheet software.

Shell Script not appending, instead it's erasing contents

My goal is to curl my newly created API with a list of usernames in a .txt file, then receive the API response, save it to a .json, then create a .csv in the end (To read it easier).
This is my script:
echo "$result" | jq 'del(.id, .Time, .username)' | jq '{andres:[.[]]}' > newresult
Input: sudo bash script.sh usernames.txt
Usernames.txt:
test1
test2
test3
test4
Result:
"id","username"
4,"test4"
Desired Result:
"id","username"
1,"test1"
2,"test2"
3,"test3"
4,"test4"
It creates the files as required, and even saves the result. However, it only outputs 1 Result. I can open the CSV/Json as it's running, and see it's querying for different Usernames, but then when it starts another query, rather than Appending it all to the same file, it deletes the Newresult, Result.json, Results.csv, and creates new ones, meaning in the end, i only end up with a result of one username, rather than a list of 5 for example. Can someone tell me what i'm doing wrong?
Thanks!
Use >> to append to file. Try:
: >results.csv
for ligne in $(seq 1 "$nbrlignes");
do
...
jq -r '
["id", "username"] as $headers
| $headers, (.andres[][] | [.[$headers[]]]) | #csv
' < result.json >> results.csv
done
By using > you overwrite the file each time the loop runs.
Also your script looks like it should be largely rewritten and simplified.

Bash array size not reflecting actual size when used with local builtin command

I have a log file ala.txt looking like that:
dummy FromEndPoint = PW | dummy | ToEndPoint = LALA | dummy
dummy FromEndPoint = PW | dummy | ToEndPoint = PAPA | dummy
dummy FromEndPoint = WF | dummy | ToEndPoint = LALA | dummy
dummy FromEndPoint = WF | dummy | ToEndPoint = KAKA | dummy
I used sed to generate an array containing every combination of FromEndPoint and ToEndPoint. Then I want to iterate through it.
function main {
file="./ala.txt"
local a=`sed 's/^.*FromEndPoint = \([a-zA-Z\-]*\).*ToEndPoint = \([a-zA-Z\-]*\).*$/\1;\2/' $file | sort -u`
echo ${#a[#]} # prints 1
for connectivity in ${a[#]}; do
echo "conn: $connectivity" # iterates 4 times
#conn: PW;LALA
#conn: PW;PAPA
#conn: WF;KAKA
#conn: WF;LALA
done
}
Why echo ${#a[#]} prints 1 if there are 4 elements in the array? How can I get a real size of it?
Bash used: 4.4.12(1)-release
Don't use variables to store multi-line content, use arrays!
In a bash shell you could process substitution feature to make the command output appear as a file content for you to parse over with an array. in bash versions of 4 and above, commands readarray and mapfile can parse the multi-line output given a delimiter without needing to use a for-loop with read
#!/usr/bin/env bash
array=()
while read -r unique; do
array+=( "$unique" )
done< <(sed 's/^.*FromEndPoint = \([a-zA-Z\-]*\).*ToEndPoint = \([a-zA-Z\-]*\).*$/\1;\2/' file | sort -u)
Now printing the length of the array as echo "${#array[#]}" should print the length as expected.
And always separate the initialization of local variables and assigning of the values to it. See Why does local sweep the return code of a command?

How do i print multiple variables in separate lines into a file using shell scripting

i need to select string from one csv file to another properties file using shell
project.csv - this is the file which contains data like below & this may contain N number of lines/data
PN549,projects.pn549
SaturnTV_SW,projects.saturntv_sw
Need to collect each string "pn549" , "saturntv_sw" into a properties file
properties
[projects]
pn549_pt=pn549
saturntv_sw_pt=saturntv_sw
Below is the code i wrote to fetch the string and to print
cat "project.csv" | while IFS='' read -r line; do
Display_Name="$(echo "$line" | cut -d ',' -f 1 | tr -d '"')"
project_name="$(echo "$TEMP_Name" | cut -d '.' -f 2)"
echo "$project_name"
echo "$project_name"_pt="$project_name" > /opt/properties
How do i print multiple lines like i gave in example(properties)
i have got my answer, simply redirected the output

Resources