Shell Script not appending, instead it's erasing contents - bash

My goal is to curl my newly created API with a list of usernames in a .txt file, then receive the API response, save it to a .json, then create a .csv in the end (To read it easier).
This is my script:
echo "$result" | jq 'del(.id, .Time, .username)' | jq '{andres:[.[]]}' > newresult
Input: sudo bash script.sh usernames.txt
Usernames.txt:
test1
test2
test3
test4
Result:
"id","username"
4,"test4"
Desired Result:
"id","username"
1,"test1"
2,"test2"
3,"test3"
4,"test4"
It creates the files as required, and even saves the result. However, it only outputs 1 Result. I can open the CSV/Json as it's running, and see it's querying for different Usernames, but then when it starts another query, rather than Appending it all to the same file, it deletes the Newresult, Result.json, Results.csv, and creates new ones, meaning in the end, i only end up with a result of one username, rather than a list of 5 for example. Can someone tell me what i'm doing wrong?
Thanks!

Use >> to append to file. Try:
: >results.csv
for ligne in $(seq 1 "$nbrlignes");
do
...
jq -r '
["id", "username"] as $headers
| $headers, (.andres[][] | [.[$headers[]]]) | #csv
' < result.json >> results.csv
done
By using > you overwrite the file each time the loop runs.
Also your script looks like it should be largely rewritten and simplified.

Related

How to sort files in paste command with 500 files csv

My question is similar to How to sort files in paste command?
- which has been solved.
I have 500 csv files (daily rainfall data) in a folder with naming convention chirps_yyyymmdd.csv. Each file has only 1 column (rainfall value) with 100,000 rows, and no header. I want to merge all the csv files into a single csv in chronological order.
When I tried this script ls -v file_*.csv | xargs paste -d, with only 100 csv files, it worked. But when tried using 500 csv files, I got this error: paste: chirps_19890911.csv: Too many open files
How to handle above error?
For fast solution, I can divide the csv's into two folder and do the process using above script. But, the problem I have 100 folders and it has 500 csv in each folder.
Thanks
Sample data and expected result: https://www.dropbox.com/s/ndofxuunc1sm292/data.zip?dl=0
You can do it with gawk like this...
Simply read all the files in, one after the other and save them into an array. The array is indexed by two numbers, firstly the line number in the current file (FNR) and secondly the column, which I increment each time we encounter a new file in the BEGINFILE block.
Then, at the end, print out the entire array:
gawk 'BEGINFILE{ ++col } # New file, increment column number
{ X[FNR SEP col]=$0; rows=FNR } # Save datum into array X, indexed by current record number and col
END { for(r=1;r<=rows;r++){
comma=","
for(c=1;c<=col;c++){
if(c==col)comma=""
printf("%s%s",X[r SEP c],comma)
}
printf("\n")
}
}' chirps*
SEP is just an unused character that makes a separator between indices. I am using gawk because BEGINFILE is useful for incrementing the column number.
Save the above in your HOME directory as merge. Then start a Terminal and, just once, make it executable with the command:
chmod +x merge
Now change directory to where your chirps are with a command like:
cd subdirectory/where/chirps/are
Now you can run the script with:
$HOME/merge
The output will rush past on the screen. If you want it in a file, use:
$HOME/merge > merged.csv
First make one file without pasting and change that file into a oneliner with tr:
cat */chirps_*.csv | tr "\n" "," > long.csv
If the goal is a file with 100,000 lines and 500 columns then something like this should work:
paste -d, chirps_*.csv > chirps_500_merge.csv
Additional code can be used to sort the chirps_... input files into any desired order before pasteing.
The error comes from ulimit, from man ulimit:
-n or --file-descriptor-count The maximum number of open file descriptors
On my system ulimit -n returns 1024.
Happily we can paste the paste output, so we can chain it.
find . -type f -name 'file_*.csv' |
sort |
xargs -n$(ulimit -n) sh -c '
tmp=$(mktemp);
paste -d, "$#" >$tmp;
echo $tmp
' -- |
xargs sh -c '
paste -d, "$#"
rm "$#"
' --
Don't parse ls output
Once we moved from parsing ls output to good find, we find all files and sort them.
the first xargs takes 1024 files at a time, creates temporary file, pastes the output into temporary and outputs the temporary file filename
The second xargs does the same with temporary files, but also removes all the temporaries
As the count of files would be 100*500=500000 which is smaller then 1024*1024 we can get away with one pass.
Tested against test data generated with:
seq 1 2000 |
xargs -P0 -n1 -t sh -c '
seq 1 1000 |
sed "s/^/ $RANDOM/" \
>"file_$(date --date="-${1}days" +%Y%m%d).csv"
' --
The problem seems to be much like foldl with maximum size of chunk to fold in one pass. Basically we want paste -d, <(paste -d, <(paste -d, <1024 files>) <1023 files>) <rest of files> that runs kind-of-recursively. With a little fun I came up with the following:
func() {
paste -d, "$#"
}
files=()
tmpfilecreated=0
# read filenames...c
while IFS= read -r line; do
files+=("$line")
# if the limit of 1024 files is reached
if ((${#files[#]} == 1024)); then
tmp=$(mktemp)
func "${files[#]}" >"$tmp"
# remove the last tmp file
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
# start with fresh files list
# with only the tmp file
files=("$tmp")
fi
done
func "${files[#]}"
# remember to clear tmp file!
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
I guess readarray/mapfile could be faster, and result in a bit clearer code:
func() {
paste -d, "$#"
}
tmp=()
tmpfilecreated=0
while readarray -t -n1023 files && ((${#files[#]})); do
tmp=("$(mktemp)")
func "${tmp[#]}" "${files[#]}" >"$tmp"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
tmpfilecreated=1
done
func "${tmp[#]}" "${files[#]}"
if ((tmpfilecreated)); then
rm "${files[0]}"
fi
PS. I want to merge all the csv files into a single csv in chronological order. Wouldn't that be just cut? Right now each column represents one day.
You can try this Perl-one liner. It will work for any number of files matching *.csv under a directory
$ ls -1 *csv
file_1.csv
file_2.csv
file_3.csv
$ cat file_1.csv
1
2
3
$ cat file_2.csv
4
5
6
$ cat file_3.csv
7
8
9
$ perl -e ' BEGIN { while($f=glob("*.csv")) { $i=0;open($FH,"<$f"); while(<$FH>){ chomp;#t=#{$kv{$i}}; push(#t,$_);$kv{$i++}=[#t];}} print join(",",#{$kv{$_}})."\n" for(0..$i) } ' <
1,4,7
2,5,8
3,6,9
$

Optimising my script code for GNU parallels

I have a script which queries successfully an API, but is very slow. It will take around 16 hours to get all the resources. I looked at how I could optimise it, and I thought that using GNU parallels (installed on macos via Brew, version 20180522) would do the trick. But even with using 90 jobs (the API endpoints authorizes 100 connections max), my script is not faster. I'm not sure why.
I call my script like so:
bash script.sh | parallel -j90
The script is the following:
#!bin/bash
# This script downloads the list of French MPs who contributed to a specific amendment.
# The script is initialised with a file containing a list of API URLs, each pointing to a resource describing an amendment
# The main function loops over 3 actions:
# 1. assign to $sign the API url that points to the list of amendment authors
# 2. run the functions auteur and cosignataires and save them in their respective variables
# 3. merge the variable contents and append them as a new line into a csv file
main(){
local file="${1}"
local line
local sign
local auteur_clean
local cosign_clean
while read line
do
sign="${line}/signataires"
auteur_clean=$(auteur $sign)
cosign_clean=$(cosignataires $sign)
echo "${auteur_clean}","${cosign_clean}" >> signataires_15.csv
done < "${file}"
}
# The auteur function takes the $sign variable as an input and
# 1. filters the json returned by the API to get only the author's ID
# 2.use the ID stored in $auteur to query the full author resource and capture the key info, which is then assigned to $auteur_nom
# 3. echo a cleaned version of the info stored in $auteur_nom
auteur(){
local url="${1}"
local auteur
local auteur_nom
auteur=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="auteur") | .id') \
&& auteur_nom=$(curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" \
| jq -r --arg url "https://www.parlapi.fr/rest/an/acteurs_amendements/${auteur}" '$url, .amendement.id, .acteur.id, (.acteur.prenom + " " + .acteur.nom)') \
&& echo "${auteur_nom}" | tr '\n' ',' | sed 's/,$//'
}
# The cosignataires function takes the $sign variable as an input and
# 1. filter the json returned by the API to produce a space separated list of co-authors
# 2. iterates over list of coauthors to get their name and surname, and assign the resulting list to $cosign_nom
# 3. echo a semi-colon separated list of the co-author names
cosignataires(){
local url="${1}"
local cosign
local cosign_nom
local i
cosign=$(curl -s "${url}" | jq '.signataires[] | select(.relation=="cosignataire") | .id' | tr '\n' ' ') \
&& cosign_nom=$(for i in ${cosign}; do curl -s "https://www.parlapi.fr/rest/an/acteurs_amendements/${i}" | jq -r '(.acteur.prenom + " " + .acteur.nom)'; done) \
&& echo "${cosign_nom}" | tr '\n' ';' | sed 's/,$//'
}
main "url_amendements_15.txt"
and the content of url_amendements_15.txt looks like so:
https://www.parlapi.fr/rest/an/amendements/AMANR5L15SEA717460BTC0174P0D1N7
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N90
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N134
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N187
https://www.parlapi.fr/rest/an/amendements/AMANR5L15PO59051B0490P0D1N161
Your script loops over a list of URLs and queries them sequentially. You need to break it up so each API query is done separately, that way parallel will have commands it can execute in parallel.
Change the script so it takes a single URL. Get rid of the main while loop.
main() {
local url=$1
local sign
local auteur_clean
local cosign_clean
sign=$url/signataires
auteur_clean=$(auteur "$sign")
cosign_clean=$(cosignataires "$sign")
echo "$auteur_clean,$cosign_clean" >> signataires_15.csv
}
Then pass url_amendements_15.txt to parallel. Give it the list of URLs that can be processed in parallel.
parallel -j90 script.sh < url_amendements_15.txt

fetching files in a loop as an input to a script

I have a bunch of files and a script which I run on them. That script takes 2 files as an input and all files are in this format: a.txt1 a.txt2
Now the script I use is like this: foo.sh a.txt1 a.txt2
I have to run this script run on 250 pairs (eg. a1.txt1 a1.txt2 to a250.txt1 a250.txt2)
I am doing this manually by entering file names. I was wondering is there any way to automate this process. All these pairs are in same folder, is there any way to loop the process on all pairs?
I hope I made it clear.
Thank you.
To be specific, these are some sample file names:
T39_C.txt2
T39_D.txt1
T39_D.txt2
T40_A.txt1
T40_A.txt2
T40_B.txt1
T40_B.txt2
T40_C.txt1
T40_C.txt2
T40_D.txt1
T40_D.txt2
unmatched.txt1
unmatched.txt2
WT11_A.txt1
WT11_A.txt2
WT11_B.txt1
WT11_B.txt2
WT11_C.txt1
Assuming all files are in pairs (ie, <something>.txt1 and <something>.txt2 then you can do something line this:
1. #!/bin/bash
2.
3. for txt1 in *.txt1; do
4. txt2="${txt1%1}2"
5. # work on $txt1 and $txt2
6. done
In line 3, we use a shell glob to grab all files ending with .txt1. Line 4, we use a substitution to remove the final 1 and replace it with a 2. And the real work is done in line 5.
#FOR EACH FILE IN THE CURRENT DIRECTORY EXCEPT FOR THE FILES WITH .txt2*
for i in ls | sort | grep -v .txt2
do
*#THE FIRST .txt1 file is $i*
first="$i"
*#THE SECOND IS THE SAME EXCEPT WITH .txt2 SO WE REPLACE THE STRING*
second=`echo "$i" | sed 's/.txt1/.txt2/g'`
#WE MAKE THE ASSUMPTION FOO.SH WILL ERROR OUT IF NOT PASSED TWO PARAMETERS
if !(bash foo.sh $first $second); then
{
echo "Problem running against $first $second"
}
else
{
echo "Ran against $first $second"
}
fi
done

Bash add to end of file (>>) if not duplicate line

Normally I use something like this for processes I run on my servers
./runEvilProcess.sh >> ./evilProcess.log
However I'm currently using Doxygen and it produces lots of duplicate output
Example output:
QGDict::hashAsciiKey: Invalid null key
QGDict::hashAsciiKey: Invalid null key
QGDict::hashAsciiKey: Invalid null key
So you end up with a very messy log
Is there a way I can only add the line to the log file if the line wasn't the last one added.
A poor example (but not sure how to do in bash)
$previousLine = ""
$outputLine = getNextLine()
if($previousLine != $outputLine) {
$outputLine >> logfile.log
$previousLine = $outputLine
}
If the process returns duplicate lines in a row, pipe the output of your process through uniq:
$ ./t.sh
one
one
two
two
two
one
one
$ ./t.sh | uniq
one
two
one
If the logs are sent to the standard error stream, you'll need to redirect that too:
$ ./yourprog 2>&1 | uniq >> logfile
(This won't help if the duplicates come from multiple runs of the program - but then you can pipe your log file through uniq when reviewing it.)
Create a filter script (filter.sh):
while read line; do
if [ "$last" != "$line" ]; then
echo $line
last=$line
fi
done
and use it:
./runEvilProcess.sh | sh filter.sh >> evillog

unix ksh retrieve oracle query result

I'm working on a small piece of ksh code for a simple task.
I need to retrieve about 14 millions lines from a table and then generate a xml file using this informations. I don't have any treatement on the information, only some "IF".
The problem is that for writing the file it takes about 30 minutes, and it is not acceptable for me.
This is a piece o code:
......
query="select field1||','||field2||' from table1"
ctl_data=`sqlplus -L -s $ORA_CONNECT #$REQUEST`
for variable in ${ctl_data}
do
var1=echo ${variable} | awk -F, '{ print $1 }'
var2=echo ${variable} | awk -F, '{ print $2 }'
....... write into the file ......
done
For speed up the things I'm writing only 30 lines into the file, so more stuff on one line, so I have only 30 acces to the file.
It is still long, so is not the writing but looping through the results.
Anyone have a ideea about how to improve it ?
Rather than pass from oracle to ksh could you do it all in oracle?
You can use the following to format your output as xml.
select xmlgen.getxml('select field1,field2 from table1') from dual;
You may be able to eliminate the calls to awk:
saveIFS="$IFS"
IFS=,
array=($variable)
IFS="$saveIFS"
var1=${array[0]} # or just use the array's elements in place of var1 and var2
var2=${array[1]}
you can lessen the amount of calls to awk using just one instance. eg
query="select codtit||','||crsspt||' from table1"
.....
sqlplus -L -s $ORA_CONNECT #$REQUEST | awk -F"," 'BEGIN{
print "xml headers here..."
}
{
# generate xml here..
print "<tag1>variable 1 is "$1"</tag1>"
print "<tag2>variable 2 is "$2" and so on..</tag2>"
if ( some condition here is true ){
print "do something here"
}
}'
redirect the above to a new file as necessary using > or >>
I doubt, that this is the most efficient way of dumping data to an xml file. You could try to use groovy for such a task. Take a look at the groovy cookbook at -> http://groovy.codehaus.org/Convert+SQL+Result+To+XML

Resources