Where files from tap are kept in Meltano - etl

I have the following combination of tap/target in Meltano: tap-marketo and target-s3-parquet.
I want to extract data from tap-marketo from data A to date B in the past.
I saw that we can only define start_date and max_export_days.
I have tried to start with start_date A and stop the run once I reach B. But this does not work.
The loader only emit the state once their work is completely done, and the target is not called. So a load was not done.
I also saw that, the export is being done.
{'run_id': '46ba5256-7019-48c7-890a-28746bb5272a', 'state_id': '2023-02-09T152428--tap-marketo--target-s3-parquet', 'stdio': 'stderr', 'cmd_type': 'extractor', 'name': 'tap-marketo', 'event': 'INFO GET: https://XXXXXXX/bulk/v1/activities/export/6636daf1-ad1e-41e1-b8d5-cdd31de5d4e0/file.json', 'level': 'info', 'timestamp': '2023-02-09T17:35:02.098016Z'}
But where do I find this file in my container?
I want to invoke the target separately but need to give the --input.
# meltano invoke target-s3-parquet --help
Environment 'dev' is active
Usage: target-s3-parquet [OPTIONS]
Execute the Singer target.
Options:
--input FILENAME A path to read messages from instead of from
standard in.
--config TEXT Configuration file location or 'ENV' to use
environment variables.
--format [json|markdown] Specify output style for --about
--about Display package metadata and settings.
--version Display the package version.
--help Show this message and exit.

To invoke the tap and target separately
meltano invoke tap-marketo > ./outfile.singer.jsonl
cat ./outfile.singer.jsonl | meltano invoke target-s3-parquet
Which is equivalent to:
meltano invoke tap-marketo > ./outfile.singer.jsonl
meltano invoke target-s3-parquet --input=./outfile.singer.jsonl
In both of the above cases, you can retry just the second step.
However, if you invoke both together, using meltano run tap-marketo target-s3-parquet or similar, the intermediate file will not be stored on disk, and you would not be able to replay just the target-side processing.
Why these files aren't stored on disk by default
The stream of messages you'll see in the examples above will necessarily contain potentially secret or confidential data, and the volume contained within the stream can be extremely large, since it contains the records themselves as well as metadata used for coordinating between the tap and target. For this reason, this stream of messages from tap to target is not stored to disk during a normal sync operation.

Related

Avoid mass e-mail notification in error analysis bash script

I am selecting error log details from a docker container and decide within a shell script, how and when to alert about the issue by discord and/or email.
Because I am receiving the email alerts too often with the same information in the email body, I want to implement the following two adjustments:
Fatal error log selection:
FATS="$(docker logs --since 24h $NODENAME 2>&1 | grep 'FATAL' | grep -v 'INFO')"
Email sent, in case FATS has some content:
swaks --from "$MAILFROM" --to "$MAILTO" --server "$MAILSERVER" --auth LOGIN --auth-user "$MAILUSER" --auth-password "$MAILPASS" --h-Subject "FATAL ERRORS FOUND" --body "$FATS" --silent "1"
How can I send the email only in the case, FATS has another content than the previous run of the script? I have thought about a hash about its content, which is stored and read in a text file. If the hash is the same than the previous script run, the email will be skipped.
Another option could be a local, temporary variable in the global user's bash profile, so that there is no file to be stored on the file system (to avoid read / writes).
How can I do that?
When you are writing a script for your monitoring, add functions for additional functionality, like:
logging all the alerts that have been send
make sure you don't send more than 1 alert each hour
consider sending warnings only during working hours
escalate a message when it fails N times without intermediate success
possible send an alert to different receivers (different email adresses or also to sms or teams)
make an interface for an operator so he can look back when something went wrong the first time.
When you have control which messages you send, it is easy to filter duplicate meassages (after changing --since).
I‘ve chosen the proposal of #ralf-dreager and reduced selection to 1d and 1h. Consequently, I‘ve changed my monitoring script to either go through the results of 1d or just 1h, without the need to select each time again and again. Huge performance improvement and no need to store anything else in a variable or on the file system.
FATS="$(docker logs --since 1h $NODENAME 2>&1 | grep 'FATAL' | grep -v 'INFO')"

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

Store the log output into a file with cmdenv-output-file

I need to recover the content of the show log module of Omnet++/Tkenv into a file, I added in the omnetpp.ini:
cmdenv-express-mode = false
cmdenv-output-file = log.txt
but I have two types of problems:
1) after the simulation, I did not find the "log.txt" If I do not create it
2) and when I created it before launching the simulation under ../omnetpp-4.6/log.txt also I find it empty
I used EV << to display the content of variables that I used, I need to resolve this problem in order to analyze the traffic so how can I do that please?
You have to start your simulation in Cmdenv mode. To do that go to Run | Run Configurations | select your configuration, then select Command line as User interface. The log file is created in simulations directory by default.

Extract all Change Packages for all the files in a PTC project

I am trying to get a list of all the change packages used to update all the files in a PTC project. I used the following command:
si viewproject --recurse --fields=name,creationcpid,cpid,memberrev,indent --project=%Project% --hostname=%Host_name% --port=%port1% -Y
But I do not get all the CP used, only the first one. I also tried the command:
si rlog --recurse --format="{membername},{memberrev},{revision},{cpid},{author}\n" --noHeaderFormat --project=%Project% --hostname=%Host_name% --port=%port1%
Using the following cli command you will get all change packaged used by current user
si viewcps
But, viewcps accept --filter= where you can specify the project
si viewcps --hostname=%Host_name% --port=%port1% --filter=project:%Project%
This command need to be called recursively for each sub project because will return only change packages from first level in the specified project.
Usage: si viewcps options... issue|issue:change package id...; options are:
--fields=field1[:width1],field2[:width2]... where fieldn can be any of: closeddate,cptype,creationdate,deployrequestid,deployrequeststate,deploytarget,description,id,issue,propagated,propagatedby,siserver,stage,stagingsystem,state,summary,user The fields to be displayed
--filter=user:name
issueid:issue
state[:closed|:open|:submitted|:accepted|:rejected|:discarded|:commitfailed]
closeddate:<date>
creationdate:<date>
membertype[:member|:subproject]
member:<expression>
project:<expression>
variant:<expression>
mainline
description:<expression>
summary:<expression>
typemodifier[:committed|:pending]
type[:add|:addfromarchive|:drop|:import|:exclusivelock|:nonexclusivelock|:renamefrom|:renameto|:movememberfrom|:movememberto|:update|:updatearchive|:updaterevision|:createsubproject|:addsubproject|:addsharedsubproject|:configuresubprojectfrom|:configuresubprojectto|:movesubprojectfrom|:movesubprojectto|:dropsubproject]
hasissue
pendingreviewby:name
acceptedby:name[;<date>]
rejectedby:name[;<date>]
cptype[:development|:propagation|:deploy|:staging|:resolution]
stagingsystem:<expression>
stage:<expression>
deploytarget:<expression>
deployrequeststate[|:cancelled|:cleanedup|:cleaningup|:cleanupfailed|:created|:deployed|:executed|:executing|:packageactionsfailed|:packagecontentfailed|:packagingactions|:packagingcontent|:prepared|:preparing|:queuedonsource|:queuedontarget|:readytodeploy|:readytotransfer|:rollbackfailed|:rolledback|:rollingback|:stopped|:transferfailed|:transferring]
deployrequestid:<expression> The filter used to select change packages
--height=value The height in pixels of the windows
--myReviews Show the change packages awaiting review by current user
--query=value The query used to select change packages
--width=value The width in pixels of the windows
-x value The x location in pixels of the window
-y value The y location in pixels of the window
-? Shows the usage for a command
--[no]batch Control batch mode (no user interaction in batch mode)
--cwd=value Act as if command executed in specified directory
-F value Read the selection from a specified file
--forceConfirm=[yes|no] Specify an answer to all confirmation questions
-g User interaction should happen via the GUI
--gui User interaction should happen via the GUI
--hostname=value Hostname of server
-N Responds to all confirmations with "no"
--no Responds to all confirmations with "no"
--password=value Credentials (e.g., password) to login with
--[no]persist Control persistence of CLI views
--port=value TCP/IP port number of server
--quiet Control status display
--selectionFile=value Read the selection from a specified file
--settingsUI=[gui|default] Control UI for command options
--status=[none|gui|default] Control status display
--usage Shows the usage for a command
--user=value Username to login to server with
-Y Responds to all confirmations with "yes"
--yes Responds to all confirmations with "yes"

command to identify the clearcase ucm view which is out of sync from stream?

In clearcase UCM after rebase or any stream configuration change using one view the other view require "setcs -stream(CLI)" or "synchronize with stream(GUI)" button is enabled in view properties window. How to identify my view is out of sync from stream through commandline? What is the command to identify that my view is out of sync?
Thanks VonC.
The below method is failed (stream config spec and view config spec is not in order sometimes).
cleartool catcs will return some uuid which we cannot compare with foundation baselines.
I have achieved this by comparing cleartool dump -l <streamname> output with cleartool catcs output.
Found the another method.
The ucmutil command ucmutil lspvar -pvar SUM_CSPEC_ID <streamname> will return the config_spec_id of the stream which can be compared with cleartool catcs | grep -i identity output. The config_spec_id is mentioned in view config spec as identity UCM.Stream oid:uuid#vobuuid:uuid config_spec_id in the second line.
Generally, the output return by a cleartool ls within an out of sync view reflects that out-of-sync status.
you can also try, as documented in this technote, a cleartool checkout.
If it returns this error message:
cleartool: Error: Checkout is currently disabled for element "element_name".
Its config spec rule information is currently unavailable
due to either an aborted update or an update in progress.
... that means the view is out of sync.
Don't forget, in some cases, to do first a:
cleartool chstream -generate yourStream#\pvob
That is useful if a component has switched from non-writable to writable.
But if just foundation baselines have changed, then cleartool setcs -stream is enough.
if all my components are read only in that stream and view(where I do only build). In that case how to check my view is out of sync?
One way would be to get:
all foundation baselines of the stream (with fmt_ccase)
cd /path/to/my/view
myStream = $(cleartool lsstream -cview)
myFoundationBaselines = $(cleartool descr -fmt "%[found_bls]CXp" $myStream)
compare those baselines with the ones of the view
(grep each baselines found in the previous step in your config spec)
cd /path/to/your/view
cleartool catcs
In short, there is no direct native way: you need to script it.
You will see the name of the baselines currently used by the view by grepping in a catcs for -mkbranch.
Those are the baseline you need to compare with the foundation baselines.

Resources