Input a dataset into Vowpalwabbit - vowpalwabbit

If I have data set in tsv format on my desktop, How can I give it as input into vowpal I tried vw -d/Desktop/Boston.tsv it cannot the read the file. I am new to vowpal and shell scripting. please help me with this

You should familiarize yourself with VW input format.
Then you can take a look on set of bash commands used by one of kagglers (xbsd). It's for kaggle's contest specific dataset which was in csv. I believe you can adapt it for tsv (it's the tab separated csv, isn't it?) by changing perl -wnlaF',' to perl -wnlaF'\t'. And you should adjust field indexes in $F[] for your specific dataset.
Then you can use it as vw -d Boston.vw or just vw Boston.vw

Related

PYSPARK - Reading, Converting and splitting a EBCDIC Mainframe file into DataFrame

We have an EBCDIC Mainframe format file which is already loaded into Hadoop HDFS Sytem. The File has the Corresponding COBOL structure as well. We have to Read this file from HDFS, Convert the file data into ASCII format and need to split the data into Dataframe based on its COBOL Structure. I've tried some options which didn't seem to work. Could anyone please suggest us some proven or working ways.
For python, take a look at the Copybook package (https://github.com/zalmane/copybook). It supports most features of Copybook includes REDEFINES and OCCURS as well as a wide variety of PIC formats.
pip install copybook
root = copybook.parse_file('sample.cbl')
For parsing into a PySpark dataframe, you can use a flattened list of fields and use a UDF to parse based on the offsets:
offset_list = root.to_flat_list()
disclaimer : I am the maintainer of https://github.com/zalmane/copybook
Find the COBOL Language Reference manual and research functions DISPLAY-OF and National-Of. The link : https://www.ibm.com/support/pages/how-convert-ebcdic-ascii-or-ascii-ebcdic-cobol-program.

read a .fit file on Linux

How could I read Garmin's .fit file on Linux. I'd like to use it for some data analysis but the file is a binary file.
I have visited http://garmin.kiesewetter.nl/ but the website does not seem to work.
Thanks
You can use GPSbabel to do this. It's a command-line tool, so you end up with something like:
gpsbabel -i garmin_fit -f {filename}.fit -o csv -F {output filename}.csv
and you'll get a text file with all the lat/long coordinates.
What's trickier is getting out other data, ie: if you want speed, time, or other information from the .fit file. You can easily get those into a .gpx, where they're in xml and human-readable, but I haven't yet found a single line solution for getting that data into a csv.
The company that created ANT made an SDK package available here:
https://www.thisisant.com/resources/fit
When unzipping this, there is a java/FitCSVTool.jar file. Then:
java -jar java/FitCSVTool.jar -b input.fit output.csv
I tested with a couple of files and it seems to work really well. Then of course the format of the csv can be a little bit complex.
For example, latitude and longitude are stored in semicircles, so it should be multiplied by 180/(2^31) to give GPS coordinates.
You need to convert the file to a .csv, the Garmin repair tool at http://garmin.kiesewetter.nl/ will do this for you. I've just loaded the site fine, try again it may have been temporarily down.
To add a little more detail:
"FIT or Flexible and Interoperable Data Transfer is a file format used for GPS tracks and routes. It is used by newer Garmin fitness GPS devices, including the Edge and Forerunner." From the OpenStreetMap Wiki http://wiki.openstreetmap.org/wiki/FIT
There are many tools to convert these files to other formats for different uses, which one you choose depends on the use. GPSBabel is another converer tool that may help. gpsbabel.org (I can't post two links yet :)
This page parses the file and lets you download it as tables. https://www.fitfileviewer.com/ The fun bit is converting the timestamps from numbers to readable timestamps Garmin .fit file timestamp

How to write a Custom Input Format

I am a newbie to Hadoop and I have a situation where only one line per 4 lines of the input text is relevant. Currently I am using the default TextInputFormat and a conditional logic to skip all the other three lines which is irrelevant.
How can I use a Custom Input Format to handle this. Since Am new to hadoop I don't know much about CustomInputFormat. Any help would be appreciated. Thanks !
I think you can use NLineInputFormat where you can specify how many line constructs one record. This could be easy & ready to use solution.
If you want to implement your own input format then it you would probably implement custom input format & record reader to specify what constructs your one record.
below is one of of the example
http://deep-developers.blogspot.in/2014/06/custom-input-split-and-custom.html

Is MATLAB performace better than bash scripts for text manipulation?

I have a program (for simulation of some physical systems) that gives very big (more than 1GB) text files as output. I have to extract the desired results (numbers) from this text file. Currently I've written bash scripts for this purpose that, for example, search the text files for some expressions and write the number after that expression in a separate file; e.g.:
grep $EXP | awk '{print $14}' > tmp
Unfortunately, these bash scripts are very time-consuming for large input text files. So I am considering to use another language for searching the text files. As there are many scripts that I have to rewrite, does writing the scripts in MATLAB give me a considarable speed-up?
As a side question, are there better options than MATLAB? (probably compiled languages like C?)

How do I plot some data (using xmgrace in the terminal) using dots, not lines, without explicitly changing it in the GUI?

i'm using xmgrace in the terminal, and want the data to be displayed directly as dots instead of lines. Achieving this in the GUI is simple, but I have to read in multiple files, and do not want to change it every time i start xmgrace. Can I add a command to the files that are read in? Or can I use an option in the terminal when I start xmgrace?
The correct way to set the appearance of a plot from the commandline is to use an existing parameter file, specified using the flag
-param settings.par
The parameter file can be stored beforehand, using the GUI to modify the appearance of an existing, similar plot. Modify the plot as you like, then save the appearance settings in a parameter file (convention is to use the extension .par) using Plot > Save Parameters.
A typical example command would then be
xmgrace -block data2.dat -bxy 1:4 -block data2.dat -bxy 1:6 -param settings.par
In my experience, calling the
-param
flag last thing in your command works best.
There really is no need to be manually text-editing your grace plot files (.agr) to achieve this.
xmgrace has a full and complex language for expressing the configuration of the look and feel for the graph. There are two ways to go about what you described. The simple way is to load the dataset into xmgrace, change everything to make it look the way you want, then save the dataset. You will see the dataset now has tons of lines describing the configuration "#g0 on" "# s0 linestyle 1" etc with your dataset at the end, terminated by a &.
To replicate that graph, spit out the saved header, insert your data, and the insert the trailing &. Feed the result into xmgrace and everything will be all set up. Once you get comfortable you can start doing dynamic substitutions to rename the graph or change the symbol or whatever. See /usr/share/grace/examples for examples of what grace can do (and the config files which generate that).
The more complex method is to load the dataset, save it immediately, change it to look the way you want, and then save it again under a different name. Run diff on the two files and you will get a set of changes. You might need at most a handful of other lines from the non-changing portion, but that is somewhat rare. This produces the minimal set of fixed headers you need to prepend to the dataset. It usually isn't worth the effort to reduce the prefix size.

Resources