promtail stop sending logs with log [debug] "new file does not match glob" - loki

my promtail svc is rinnung on windows server, service start correctly and logs reach loki and i can see on grafana, but after a few minutes working fine, promtail stop sending logs and the only message i se on console is:
level=info ts=2022-09-01T21:43:40.914632Z caller=filetargetmanager.go:177 msg="received file watcher event" name=C:\data\test\logs\positions.yaml-new op=CRE
level=debug ts=2022-09-01T21:43:40.9156041Z caller=filetargetmanager.go:383 msg="new file does not match glob" filename=C:\data\test\logs\positions.yaml-new
Here's my promtail config:
server:
http_listen_port: 9080
grpc_listen_port: 0
grpc_server_max_concurrent_streams: 50
log_level: debug
positions:
filename: C:\data\test\logs\positions.yaml
ignore_invalid_yaml: true
clients:
- url: https://loki.test.io/loki/api/v1/push
tenant_id: classic

Related

"No logs found" in grafana

I installed Loki, grafana and promtail and all three runing. on http://localhost:9080/targets Ready is True, but the logs are not displayed in Grafana and show in the explore section "No logs found"
promtail-local-config-yaml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
host: ward_workstation
agent: promtail
__path__: D:/LOGs/*log
loki-local-config.yaml:
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
How can i solve this problem?
Perhaps you are using Loki in Windows ?
In your promtail varlogs job ,the Path "D:/LOGs/*log" is obviously wrong, you cannot access the windows file from your docker directly.
You shoud mount your windows file to your docker like this:
promtail:
image: grafana/promtail:2.5.0
volumes:
- D:/LOGs:/var/log
command: -config.file=/etc/promtail/config.yml
networks:
- loki
Then everything will be ok.
Note that, in your promtail docker the config is like this:
you can adjust both to make a match...
Here's a general advice how to debug Loki according to the question's title:
(1) Check promtail logs
If you discover such as error sending batch you need to fix your Promtail configuration.
level=warn ts=2022-10-12T16:26:20.667560426Z caller=client.go:369 component=client host=monitor:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://loki:3100/loki/api/v1/push\": dial tcp: lookup *Loki* on 10.96.0.10:53: no such host"
(2) Open the Promtail config page and check, if Promtail has read your given configuration: http://localhost:3101/config
(3) Open the Promtail targets page http://localhost:3101/targets and check
if your service is listed as Ready
if the log file contains the wanted contents and is readable by Promtail. If you're using docker or kubernetes I would log into the Promtail Container and would try to read the logfile manually.
To the specific problem of the questioner:
The questioner said, that the services are shown as READY in the targets page. So I recommend to check (1) Promtail configuration and (3b) access to log files (as Frank).

OpenTelemetry Export Traces to Elastic APM and Elastic OpenDistro

I am trying to instrument by python app (django based) to be able to push transaction traces to Elastic APM which I can later view using the Trace Analytic in OpenDistro Elastic.
I have tried the following
Method 1:
pip install opentelemetry-exporter-otlp
Then, in the manage.py file, I added the following code to directly send traces to elastic APM.
span_exporter = OTLPSpanExporter(
endpoint="http://localhost:8200",
insecure=True
)
When I run the code I get the following error:
Transient error StatusCode.UNAVAILABLE encountered while exporting span batch, retrying in 1s.
Transient error StatusCode.UNAVAILABLE encountered while exporting span batch, retrying in 2s.
Method 2:
I tried using OpenTelemetry Collector in between since method 1 didn't work.
I configured my collector in the following way:
extensions:
memory_ballast:
size_mib: 512
zpages:
endpoint: 0.0.0.0:55679
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
# 75% of maximum memory up to 4G
limit_mib: 1536
# 25% of limit up to 2G
spike_limit_mib: 512
check_interval: 5s
exporters:
logging:
logLevel: debug
otlp/elastic:
endpoint: "198.19.11.22:8200"
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging, otlp/elastic]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [logging]
extensions: [memory_ballast, zpages]
And configured my code to send traces to collector like this -
span_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True
)
Once I start the program, I get the following error in the collector logs -
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector#v0.35.0/exporter/exporterhelper/queued_retry.go:304
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
go.opentelemetry.io/collector#v0.35.0/exporter/exporterhelper/traces.go:116
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector#v0.35.0/exporter/exporterhelper/queued_retry.go:155
go.opentelemetry.io/collector/exporter/exporterhelper/internal.ConsumerFunc.Consume
go.opentelemetry.io/collector#v0.35.0/exporter/exporterhelper/internal/bounded_queue.go:103
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*BoundedQueue).StartConsumersWithFactory.func1
go.opentelemetry.io/collector#v0.35.0/exporter/exporterhelper/internal/bounded_queue.go:82
2022-01-05T17:36:55.349Z error exporterhelper/queued_retry.go:304 Exporting failed. No more retries left. Dropping data. {"kind": "exporter", "name": "otlp/elastic", "error": "max elapsed time expired failed to push trace data via OTLP exporter: rpc error: code = Unavailable desc = connection closed", "dropped_items": 1}
What am I possibly missing here?
NOTE: I am using the latest version of opentelemetry sdk and apis and latest version of collector.
Okay, So the way to work with Open Distro version of Elastic to get traces is:
To avoid using the APM itself.
OpenDistro provides a tool called Data Prepper which must be used in order to send data(traces) from Otel-Collector to OpenDistro Elastic.
Here is the configuration I did for the Otel-Collector to send data to Data Prepper:
... # other configurations like receivers, etc.
exporters:
logging:
logLevel: debug
otlp/data-prepper:
endpoint: "http://<DATA_PREPPER_HOST>:21890"
tls:
insecure: true
... # Other configurations like pipelines, etc.
And this is how I configured Data Prepper to receive data from Collector and send it to Elastic
entry-pipeline:
delay: "100"
source:
otel_trace_source:
ssl: false
sink:
- pipeline:
name: "raw-pipeline"
raw-pipeline:
source:
pipeline:
name: "entry-pipeline"
prepper:
- otel_trace_raw_prepper:
sink:
- elasticsearch:
hosts: [ "http://<ELASTIC_HOST>:9200" ]
trace_analytics_raw: true

How to make promtail read new log written to log file which was read already?

I have a very simple test setup. Data flow is as follows:
sample.log -> Promtail -> Loki -> Grafana
I am using this log file from microsoft: sample log file download link
I have promtail config as follows:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: C:\Users\user\Desktop\tmp\positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: testing_logging_a_log_file
static_configs:
- targets:
- localhost
labels:
job: testing_logging_a_log_file_labels_job_what_even_is_this
host: testing_for_signs_of_life_probably_my_computer_name
__path__: C:\Users\user\Desktop\sample.log
- job_name: testing_logging_a_log_file_with_no_timestamp_test_2
static_configs:
- targets:
- localhost
labels:
job: actor_v2
host: ez_change
__path__: C:\Users\user\Desktop\Actors_2.txt
Loki config:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
max_transfer_retries: 0
schema_config:
configs:
- from: 2018-04-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 168h
storage_config:
boltdb:
directory: C:\Users\user\Desktop\tmp\loki\index
filesystem:
directory: C:\Users\user\Desktop\tmp\loki\chunks
limits_config:
enforce_metric_name: false
reject_old_samples: True
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
The sample files are read properly the first time. I can query WARN logs with: {host="testing_for_signs_of_life_probably_my_computer_name"} |= "WARN"
Problem arises when I manually add a new log line to the sample.log file. (To emulate log lines written to the file)
2012-02-03 20:11:56 SampleClass3 [WARN] missing id 42334089511
This new line is not visible in Grafana. Is there any particular config I must to know to make this happen?
It was a problem with the network, if you remove the loki port and don't configure any network you can access it by putting http://loki:3100 in your grafana panel.
Yes, it's weird, when I append a line to a existed log file, it can't be seen in grafana explore.but ....try to do it again , append one more line, now the previous line is show in grafana
it happens when you using notepad, works well on notepad++

Loki not alerting Alertmanager

I am new with Loki and have made an alert in Loki but I don't see any notification in the Alertmanager. Loki is working fine (collecting logs), Alertmanager also (getting alerts from other sources), but the logs from loki don't get pushed to alertmanager.
Loki config:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
filesystem:
directory: /loki/chunks
compactor:
working_directory: /loki/boltdb-shipper-compactor
shared_store: filesystem
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
ruler:
storage:
type: local
local:
directory: etc/loki/rules
rule_path: /etc/loki/
alertmanager_url: http://171.11.3.160:9093
ring:
kvstore:
store: inmemory
enable_api: true
Docker-compose Loki:
loki:
image: grafana/loki:2.0.0
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki/etc/local-config.yaml:/etc/loki/local-config.yaml
- ./loki/etc/rules/rules.yaml:/etc/loki/rules/rules.yaml
command:
- '--config.file=/etc/loki/local-config.yaml'
Loki rules:
groups:
- name: rate-alerting
rules:
- alert: HighLogRate
expr: |
count_over_time(({job="grafana"})[1m]) >=0
for: 1m
Does anybody know what's the problem?
I got it working atlast .
Below is my ruler config
ruler:
storage:
type: local
local:
directory: /etc/loki/rulestorage
rule_path: /etc/loki/rules
alertmanager_url: http://alertmanager:9093
ring:
kvstore:
store: inmemory
enable_api: true
enable_alertmanager_v2: true
Created below directories
/etc/loki/rulestorage/fake
/etc/loki/rules/fake
Copied alert_rules.yaml under /etc/loki/rulestorage/fake
Gave full permission for loki user under /etc/loki/rulestorage/fake
Boom
The config looks good, similar as mine. I would troubleshoot it with following steps:
Exec to docker container and check if the rules file is not empty cat /etc/loki/rules/rules.yaml
Check the logs of loki. When rules are loaded properly logs like this will pop up:
level=info ts=2021-05-06T11:18:33.355446729Z caller=module_service.go:58 msg=initialising module=ruler
level=info ts=2021-05-06T11:18:33.355538059Z caller=ruler.go:400 msg="ruler up and running"
level=info ts=2021-05-06T11:18:33.356584674Z caller=mapper.go:139 msg="updating rule file" file=/data/loki/loki-stack-alerting-rules.yaml
During runtime loki also logs info messages about your rule (I will show you the one I am running, but slightly shortened)(notice status=200 and non-empty bytes=...):
level=info
ts=...
caller=metrics.go:83
org_id=...
traceID=...
latency=fast
query="sum(rate({component=\"kube-apiserver\"} |~ \"stderr F E.*failed calling webhook \\\"webhook.openpolicyagent.org\\\". an error on the server.*has prevented the request from succeeding\"[1m])) > 1"
query_type=metric
range_type=instant
length=0s
step=0s
duration=9.028961ms
status=200
throughput=40MB
total_bytes=365kB
Then make sure you can access alertmanager http://171.11.3.160:9093 from loki container without any issues (there can be a networking problem or you have set up basic authentication, etc.).
If the rule you set up (which you can test from grafana explore window) will exceed the threshold you set for 1 minute the alert should show up in alertmanager. It will be most likely ungrouped as you didn't add any labels to it.

fluentd to elasticsearch via kubernetes-ingress

I have configured ElasticSearch on a Kubernetes cluster. In Kubernetes cluster for application, I have fluentd configured, using THIS helm chart, with the following parameters:
spec:
containers:
- env:
- name: FLUENTD_ARGS
value: --no-supervisor -q
- name: OUTPUT_HOST
value: x.x.x.x
- name: OUTPUT_PORT
value: "80"
- name: OUTPUT_PATH
value: /elastic
- name: LOGSTASH_PREFIX
value: logstash
- name: OUTPUT_SCHEME
value: http
- name: OUTPUT_SSL_VERIFY
value: "false"
- name: OUTPUT_SSL_VERSION
value: TLSv1_2
- name: OUTPUT_TYPE_NAME
value: _doc
- name: OUTPUT_BUFFER_CHUNK_LIMIT
value: 2M
- name: OUTPUT_BUFFER_QUEUE_LIMIT
value: "8"
- name: OUTPUT_LOG_LEVEL
value: info
In ElasticSearch cluster I have nginx-ingress controller configured and I want fluentd to send logs to Elasticsearch via this nginx ingress. In "OUTPUT_HOST" I am using nginx-ingress public IP. In "OUTPUT_PORT" I have used "80" as nginx is listening on 80.
I am getting the following error in fluentd:
2019-11-06 07:16:46 +0000 [warn]: [elasticsearch] failed to flush the buffer.
retry_time=40 next_retry_seconds=2019-11-06 07:17:18 +0000 chunk="596a7f6afffad60f2b28a5e13f"
error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (
{:host=>\"x.x.x.x\", :port=>80, :scheme=>\"http\", :path=>\"/elastic\"}): [405]
{\"error\":\"Incorrect HTTP method for uri [/] and method [POST], allowed: [HEAD, GET, DELETE]\",\"status\":405}"
I can guess by looking at the log, it is considering "/elastic" as index.
A mentioned HERE is used the annotation "nginx.ingress.kubernetes.io/rewrite-target: /" but problem persists.
After this, I changed nginx-ingress to listen to calls at "/" instead of "/elastic". changed "OUTPUT_PATH" in fluentD config too.
I could see the error I was getting earlier is gone but I would still like to use "/elastic" instead of "/". I am not sure what nginx config I need to change to achieve this. Please help me here.
After this, I got a "request entity too large" error which was resolved by adding - "nginx.ingress.kubernetes.io/proxy-body-size: 100m" in annotations. By default its 1M and for fluentD by default it's 2M. It was bound to fail.
Now I am getting errors like:
2019-11-06 10:01:08 +0000 [warn]: dump an error event:
error_class=Fluent::Plugin::ConcatFilter::TimeoutError error="Timeout flush: kernel:default" location=nil tag="kernel"
time=2019-11-06 10:01:08.267224927 +0000 record=
{
"transport"=>"kernel",
"syslog_facility"=>"0",
"syslog_identifier"=>"kernel",
"boot_id"=>"6e4ca7b1c1a11b74151a12979",
"machine_id"=>"89436ac666fa120304f2077f3bf2",
"priority"=>"6",
"hostname"=>"gke-dev--node-pool",
"message"=>"
cbr0: port 9(vethe75a241b) entered disabled statedevice vethe75a241b left promiscuous mode
cbr0: port 9(vethe75a241b) entered disabled state
IPv6: ADDRCONF(NETDEV_UP): veth630f6cb0: link is not ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth630f6cb0: link becomes ready
cbr0: port 9(veth630f6cb0) entered blocking state
cbr0: port 9(veth630f6cb0) entered disabled statedevice veth630f6
cb0 entered promiscuous mode
cbr0: port 9(veth630f6cb0) entered blocking state
cbr0: port 9(veth630f6cb0) entered forwarding state",
"source_monotonic_timestamp"=>"61153348174"
}
Regarding the nginx config: here is official documentation regarding the Rewrite. You can adjust it to your needs alongside OUTPUT_PATH in fluentD config as you already mentioned.
Regarding the error event: The timeout flush just indicates that a flush has happened. Use the timeout_label to process entries where the flush has occurred. It's usually better to dispatch the message instead of emitting an error event.

Resources