Cannot link external data with TopoJSON - topojson

I am trying to link a shapefile europe.shp with a data.csv external file while producing a TopoJSON file. They both have the iso_a3 country code as common ID.
This is the head of data.csv:
iso_a3;anzahl_jets;typen1;typen2;typen3;text;pop
ALB;;;;;;3639453
AND;;;;;;83888
AUT;15;15 Eurofighter;;;;8210281
BEL;81;59 F-16;22 alte Saab;;;10414336
Converting europe.shp to europe.json alone works fine, all properties are preserved.
When using the below statement though, only the properties of europe.shp are preserved (iso_a3 and name_de).
topojson --id-property iso_a3 -o europe.json -p iso_a3,jets=+anzahl_jets,pop=+pop,name_de=name_de --simplify-proportion 0.25 --width 900 --height 600 --external-properties data.csv -- countries=europe.shp
What am I doing wrong?

topojson doesn't parse ;-delimited CSV files. You need to use , or \t (the latter preferably with a file extension .tsv).
See https://github.com/mbostock/topojson/blob/master/bin/topojson#L369-L374

Related

How to add sysdate from bcp

I have a .csv file with the following sample data format:
REFID|PARENTID|QTY|DESCRIPTION|DATE
AA01|1234|1|1st item|null
AA02|12345|2|2nd item|null
AA03|12345|3|3rd item|null
AA04|12345|4|4th item|null
To load the above file into a table I am using below BCP command:
/bcp $TABLE_NAME in $FILE_NAME -S $DB_SERVER -t "|" -F 1 -U $DB_USERNAME -d $DB_NAME
What i am trying to look here is like below (adding sysdate instead of null from bcp)
AA01|1234|1|1st item|3/16/2020
AA02|12345|2|2nd item|3/16/2020
AA03|12345|3|3rd item|3/16/2020
AA04|12345|4|4th item|3/16/2020
Update : I was able to exclude header with #Jamie answer by -F 1 option, but looking for some help on inserting date with bcp. Tried looking some old Q&A, but no luck so far..
To exclude a single header record, you can use the -F option. This will tell BCP which line in the file is the first line to begin loading from. For your sample, -F2 should work fine. However, your command has other issues. See comments.
There is no way to introduce new data using the BCP command as you stated. BCP cannot introduce a date value while copying data into your table. To accomplish this I suggest a default for your date column or to first load the raw data into a table without the date column then you can introduce the date value as you see fit in late processing.

Faster way of Appending/combining thousands (42000) of netCDF files in NCO

I seem to be having trouble properly combining thousands of netCDF files (42000+) (3gb in size, for this particular folder/variable). The main variable that i want to combine has a structure of (6, 127, 118) i.e (time,lat,lon)
Im appending each file 1 by 1 since the number of files is too long.
I have tried:
for i in input_source/**/**/*.nc; do ncrcat -A -h append_output.nc $i append_output.nc ; done
but this method seems to be really slow (order of kb/s and seems to be getting slower as more files are appended) and is also giving a warning:
ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "forecast_period" does not monotonically increase between (input file file1.nc record indices: 17, 18) (output file file1.nc record indices 17, 18) record coordinate values 6.000000, 1.000000
that basically just increases the variable "forecast_period" 1-6 n-times. n = 42000files. i.e. [1,2,3,4,5,6,1,2,3,4,5,6......n]
And despite this warning i can still open the file and ncrcat does what its supposed to, it is just slow, at-least for this particular method
I have also tried adding in the option:
--no_tmp_fl
but this gives an eror:
ERROR: nco__open() unable to open file "append_output.nc"
full error attached below
If it helps, im using wsl and ubuntu in windows 10.
Im new to bash and any comments would be much appreciated.
Either of these commands should work:
ncrcat --no_tmp_fl -h *.nc
or
ls input_source/**/**/*.nc | ncrcat --no_tmp_fl -h append_output.nc
Your original command is slow because you open and close the output files N times. These commands open it once, fill-it up, then close it.
I would use CDO for this task. Given the huge number of files it is recommended to first sort them on time (assuming you want to merge them along the time axis). After that, you can use
cdo cat *.nc outfile

Specific Column Dump from Parquet File using Parquet-tools.jar

I want to dump only a specific column on some text file using parquet-tools-1.8.1.jar.But not able to do so. I am trying below command. Please note my column name has forward slash.
parquet-tools-1.8.1.jar dump --column 'dir1/log1/job12121' '/hdfs-path/to/parquet file with space.parquet' > /home/local/parquet/output.text
Run
hadoop jar parquet-tools-1.8.1.jar parquet.tools.Main dump --column 'dir1/log1/job12121' '/hdfs-path/to/parquet file with space.parquet' > /home/local/parquet/output.text
Please use the following:
hadoop jar parquet-tools-1.8.1.jar dump -c dir1 log1 job12121 -m /hdfs-path/to/parquet file with space.parquet >> /home/local/parquet/output.text
Note:No single quotes for input arguments.

Issue in creating Vectors from text in Mahout

I'm using Mahout 0.9 (installed on HDP 2.2) for topic discovery (Latent Drichlet Allocation algorithm). I have my text file stored in directory
inputraw and executed the following commands in order
command #1:
mahout seqdirectory -i inputraw -o output-directory -c UTF-8
command #2:
mahout seq2sparse -i output-directory -o output-vector-str -wt tf -ng 3 --maxDFPercent 40 -ow -nv
command #3:
mahout rowid -i output-vector-str/tf-vectors/ -o output-vector-int
command #4:
mahout cvb -i output-vector-int/matrix -o output-topics -k 1 -mt output-tmp -x 10 -dict output-vector-str/dictionary.file-0
After executing the second command and as expected it creates a bunch of subfolders and files under the
output-vector-str (named df-count, dictionary.file-0, frequency.file-0, tf-vectors,tokenized-documents and wordcount). The size of these files all looks ok considering the size of my input file however the file under ``tf-vectors` has a very small size, in fact it's only 118 bytes).
Apparently as the
`tf-vectors` is the input to the 3rd command, the third command also generates a file of small size. Does anyone know:
what is the reason of the file under
`tf-vectors` folder to be that small? There must be something wrong.
Starting from the first command, all the generated files have a strange coding and are nor human readable. Is this something expected?
Your answers are as follows:
what is the reason of the file under tf-vectors folder to be that small?
The vectors are small considering you have given maxdf percentage to be only 40%, implying that only terms which have a doc freq(percentage freq of terms occurring throughout the docs) of less than 40% would be taken in consideration. In other words, only terms which occur in 40% of the documents or less would be taken in consideration while generating vectors.
what is the reason of the file under tf-vectors folder to be that small?
There is a command in mahout called the mahout seqdumper which would come to your rescue for dumping the files in "sequential" format to "human" readable format.
Good Luck!!

Read Native format bcp data file

With the Unix shell script, I am doing a bcp out from a table in Server1 using NATIVE format to a file - XXXX.bcpdat, then bcp in the file to a table of same structure in Server2.
The bcp command we have is
bcp "$dbname".."$tablename" out XXXX.bcpdat -n
bcp "$dbname".."$tablename" in XXXX.bcpdat -n -b10000
This bcp_out & bcp in works as expected from/into tables.
But i want to da an urgent change here -
I want to get the total number of rows (a row may have 120 or 30 or 40 records)in the bcp data file (XXXX.bcpdat)
But with the file in Native format i couldn differentiate each row & how its being separated. If i pass head -10 XXXX.bcpdat or tail -10 XXXX.bcpdat it prints everything in the file. "wc -l" or "awk" or "cut" is not helping me to get the count of rows from the file. There is no differentiation where a row ends like how it is in character load of bcp. It would really be great if someone help me at the earliest, how i can get the total number of rows (not records) that is in the bcpdat file. Thanks a loot in advance.

Resources