gpload utility: For Bulk data Loading -Source hadoop into greenplum - hadoop

We have small Hadoop and greenplum cluster. By using gpload merge statement want to put Hadoop data into greenplum.
Please advice process on it.
Question:
Do I need to install gpload utility on Hadoop based Linux vm?
Then, I have to schedule the merge based gpload script at regular interval?
Is it possible to ingest Hadoop file by running gpload in greenplum vm's only?
Input gpload.yml
VERSION: 1.0.0.1
DATABASE: test
USER: gpadmin
HOST: gpdbhostname
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- gpdbhostname
PORT: 8080
FILE:
- /home/gpadmin/demo/input_table.txt
- COLUMNS:
- id: bigint
- time: timestamp
- FORMAT: text
- DELIMITER: ';'
- NULL_AS: ''
OUTPUT:
- TABLE: output_table
- MODE: merge
- MATCH_COLUMNS:
- id
- UPDATE_COLUMNS:
- time
~
In this case, What will be my gpload.yml, if I would like to write source hdfs csv file into greenplum regular table via gpload merge script.

Related

How to create adittional buckets on influxdb docker initialize

i don't know how to approach my problem because i don`t find similar cases to have an example.
I want to setup influx with 2 buckets to save telegraf data but only setups with init bucket.
These are the two influxdb services in my docker composer file:
influxdb:
image: influxdb:latest
volumes:
- ./influxdbv2:/root/.influxdbv2
environment:
# Use these same configurations parameters in your telegraf configuration, mytelegraf.conf.
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=User
- DOCKER_INFLUXDB_INIT_PASSWORD=****
- DOCKER_INFLUXDB_INIT_ORG=org
- DOCKER_INFLUXDB_INIT_BUCKET=data
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=****
ports:
- "8086:8086"
influxdb_cli:
image: influxdb:latest
links:
- influxdb
volumes:
# Mount for influxdb data directory and configuration
- ./influxdbv2:/root/.influxdbv2
entrypoint: ["./entrypoint.sh"]
restart: on-failure:10
depends_on:
- influxdb
when inits runs influxdb setup correctly but doesn`t run the script and telegraf returns 404 when trying to write to buckets.
I ran into the same issue today and as far as I am aware you cannot currently initialize two buckets with the DOCKER_INFLUXDB_INIT_BUCKET environment variable.
So I created a shellscript called createSecondBucket.sh that I found in another answer for this question. It uses the influx cli to create a new bucket. The script looks like this:
#!/bin/sh
set -e
influx bucket create -n YOUR_BUCKET_NAME -o YOUR_ORG_NAME -r 0
Note that I had to change the line endings to unix (LF) to run the script without errors.
Inside my Dockerfile I added the following lines:
COPY ./createSecondBucket.sh /docker-entrypoint-initdb.d
RUN chmod +x /docker-entrypoint-initdb.d/createSecondBucket.sh
which have the effect that the script is executed after the container starts for the first time. I found this information on the MongoDB dockerhub page which you can find here under the "Initializing a fresh instance" headline.

How to use gpload utility?

I have YAML file below:
---
VERSION: 1.0.0.1
DATABASE: xxx
USER: xxx
HOST: xxx
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- 192.168.0.21
PORT: 8081
FILE:
- /home/root/test_input.txt
- COLUMNS:
- age: int4
- name: varchar
- surname: varchar
- FORMAT: text
- DELIMITER: '|'
- ERROR_LIMIT: 2
- LOG_ERRORS: true
OUTPUT:
- TABLE: sf_dfs.test_gpload
- MODE: INSERT
PRELOAD:
- REUSE_TABLES: true
But i recieve a error: error when connecting to gpfdist http://192.168.0.21:8081//home/root/test_input.txt, quit after 11 tries (seg0 slice1 192.168.0.23:6000 pid=2021845)
encountered while running INSERT INTO
Maybe somebody have experience about this program?
Looks like it is a port issue. If the database is up then please rerun the job with different port. Ensure that firewall is not blocking this port.
A couple of questions:
Are you running gpload as root? root generally does not have access permissions to the database. It needs to be run as gpadmin or a superuser.
The input file is in /home/root. If you are running as gpadmin, can gpadmin access this file? Permissions on the file?
Finally, does the target table exist in the database (sf_dfs.test_gpload)? Was it created and distributed across all segments? The error would seem to indicate the table is not there.

How to configure GCS as filebeat input

We are storing our audit logs in GCS bucket. we would like to ingest them to Elasticsearch when required - not regularly - using filebeat. I have checked S3 option where it let us use s3 like storages as input using providers.
I'm using following configuration but it is not writing any data however when I test the filebeat configuration it is fine however input is stopped working.
Here is warning from logs
WARN [aws-s3] awss3/config.go:54 neither queue_url nor bucket_arn were provided, input aws-s3 will stop
INFO [crawler] beater/crawler.go:141 Starting input (ID: 17738867761700079737)
INFO [crawler] beater/crawler.go:108 Loading and starting Inputs completed. Enabled inputs: 1
INFO [input.aws-s3] compat/compat.go:111 Input aws-s3 starting {"id": "F62D1E3EA5C30879"}
INFO [input.aws-s3] compat/compat.go:124 Input 'aws-s3' stopped {"id": "F62D1E3EA5C30879"}
I doubt my input configuration is not right in someway. Please check the following and help me understand what's wrong
filebeat.inputs:
- type: aws-s3
non_aws_bucket_name: test-bucket
number_of_workers: 5
bucket_list_interval: 300s
access_key_id: xxxxx
secret_access_key: xxxxxxxx
endpoint: https://storage.googleapis.com
output.elasticsearch:
hosts: "https://es-test-xxx.aivencloud.com"
username: "avnadmin"
password: "xxxxx"
indices:
- index: 'restore-test'

Filebeat & test inputs

I'm working on a Filebeat solution and I'm having a problem setting up my configuration. Let me explain my setup:
I have a app that produces a csv file that contains data that I want to input in to ElasticSearch using Filebeats.
I'm using Filebeat 5.6.4 running on a windows machine.
Provided below is my filebeat.ymal configuration:
filebeat.inputs:
- type: log
enabled: true
paths:
- C:\App\fitbit-daily-activites-heart-rate-*.log
output.elasticsearch:
hosts: ["http://esldemo.com:9200"]
index: "fitbit-daily-activites-heartrate-%{+yyyy.MM.dd}"
setup.template:
name: "fitbit-daily-activites-heartrate"
pattern: "fitbit-daily-activites-heartrate-*"
fields: "fitbit-heartrate-fields.yml"
overwrite: false
settings:
index.number_of_shards: 1
index.number_of_replicas: 0
And my data looks like this:
0,2018-12-13 00:00:02.000,66.0,$
1,2018-12-13 00:00:07.000,66.0,$
2,2018-12-13 00:00:12.000,67.0,$
3,2018-12-13 00:00:17.000,67.0,$
4,2018-12-13 00:00:27.000,67.0,$
5,2018-12-13 00:00:37.000,66.0,$
6,2018-12-13 00:00:52.000,66.0,$
I'm trying to figure out why my configuration is not picking up my data and outputting it to ElasticSearch. Please help.
There are some differences in the way you configure Filebeat in versions 5.6.X and in the 6.X branch.
For 5.6.X you need to configure your input like this:
filebeat.prospectors:
- input_type: log
paths:
- 'C:/App/fitbit-daily-activites-heart-rate-*.log'
You also need to put your path between single quotes and use forward slashes.
Filebeat 5.6.X configuration

Filebeat is processing all the logs instead of the specified application logs

I have an app server, where I have configured filebeat(through Chef) to extract the logs and publish it to logstash(a separate ELK server), and subsequently to ES and Kibana.
I have configured filebeat to process logs only from /opt/app_logs/*.log, but it seems it is reading logs from other locations too, because in the /etc/filebeat configuration directory, I have filebeat.full.yml and other yml files generated automatically, and they seem to have all those other file locations, thus due to such a huge amount of logs, logstash service is getting out of memory within minutes with logstash.log. How can I not autogenerate the other yml files?
I tried to remove this file and also tried to comment out all the /var/log paths from the prospectors, but then filebeat itself is not starting.
filebeat.yml file:
filebeat:
prospectors: []
registry_file: "/var/lib/filebeat/registry"
config_dir: "/etc/filebeat"
output:
logstash:
hosts:
- elk_host:5044
index: logstash-filebeat
shipper:
name: serverA
tags:
- A
logging:
to_files: 'true'
files:
path: "/var/log/filebeat"
name: filebeat_log
rotateeverybytes: '10485760'
level: info
prospectors:
- paths:
- "/opt/app_logs/*.log"
encoding: plain
input_type: log
ignore_older: 24h
The main problem with your configuration is that for Filebeat 1.2.3 you have the prospectors list defined twice and second one is not in the correct location.
The second problem is that you have defined the config_dir as /etc/filebeat. config_dir is used to specify an additional directory where to look for config files. It should never be set to /etc/filebeat because this is where the main config file should be located. See https://stackoverflow.com/a/39987501/503798 for usage information.
A third problem is that you have used string types in to_files and rotateeverybytes. They should be boolean and integer types respectively.
Here's how the config should look for Filebeat 1.x.
filebeat:
registry_file: "/var/lib/filebeat/registry"
config_dir: "/etc/filebeat/conf.d"
prospectors:
- paths:
- "/opt/app_logs/*.log"
encoding: plain
input_type: log
ignore_older: 24h
output:
logstash:
hosts:
- elk_host:5044
index: logstash-filebeat
shipper:
name: serverA
tags:
- A
logging:
to_files: true
files:
path: "/var/log/filebeat"
name: filebeat_log
rotateeverybytes: 10485760
level: info
I highly recommend that you upgrade to Filebeat 5.x because it has better configuration validation using filebeat -configtest.

Resources