mmap: invalid argument #8114

zetaab · 2020-10-27T07:48:23Z

What did you do? I am running prometheus operator in kubernetes cluster. However, we see daily that somewhere in our clusters the prometheus starts to crashloop. The reason for that is always the same "mmap: invalid argument".

Example:

level=error ts=2020-10-27T07:35:48.369Z caller=main.go:798 err="opening storage failed: mmap files, file: /prometheus/chunks_head/000486: mmap: invalid argument"

What did you expect to see? I expect that prometheus does not crash so often.

What did you see instead? Under which circumstances?

Environment Kubernetes

System information:

running inside kubernetes, host OS is debian buster. The used docker image is quay.io/prometheus/prometheus:v2.22.0

Prometheus version:

prometheus, version 2.22.0 (branch: HEAD, revision: 0a7fdd3)
build user: root@6321101b2c50
build date: 20201015-12:29:59
go version: go1.15.3
platform: linux/amd64

Prometheus configuration file: https://gist.github.com/zetaab/12bc84b99dc54ccd72ed01d32ab0077d
Logs:

level=info ts=2020-10-27T07:40:56.356Z caller=main.go:353 msg="Starting Prometheus" version="(version=2.22.0, branch=HEAD, revision=0a7fdd3b76960808c3a91d92267c3d815c1bc354)"
level=info ts=2020-10-27T07:40:56.356Z caller=main.go:358 build_context="(go=go1.15.3, user=root@6321101b2c50, date=20201015-12:29:59)"
level=info ts=2020-10-27T07:40:56.356Z caller=main.go:359 host_details="(Linux 4.19.0-10-cloud-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64 prometheus-k8s-0 (none))"
level=info ts=2020-10-27T07:40:56.357Z caller=main.go:360 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-10-27T07:40:56.357Z caller=main.go:361 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-10-27T07:40:56.361Z caller=main.go:712 msg="Starting TSDB ..."
level=info ts=2020-10-27T07:40:56.361Z caller=web.go:516 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-10-27T07:40:56.362Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1602828000000 maxt=1603022400000 ulid=01EMY3MDASWV2NF64FEPK2JW5A
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603022400000 maxt=1603216800000 ulid=01EN3X13TBHJWCCYBF0T3AW75Q
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603216800000 maxt=1603281600000 ulid=01EN5TSPS9HDH7BXXDDCN5G2S7
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603281600000 maxt=1603346400000 ulid=01EN7RK7M15AFDPGY7EPK2NK8F
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603346400000 maxt=1603411200000 ulid=01EN9PCWS2GN4J19NTP1QTZM0E
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603411200000 maxt=1603476000000 ulid=01ENBM68EWWPDYV9PCC5JSCH6E
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603476000000 maxt=1603540800000 ulid=01ENDHZTC23A7PH45YZSFVHJNJ
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603540800000 maxt=1603605600000 ulid=01ENFFS6EHJM2HGK3P512P52PY
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603605600000 maxt=1603670400000 ulid=01ENHDJWJS2JFP9WT6TCHBBYNA
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603670400000 maxt=1603735200000 ulid=01ENKBCFGNJSKRM07855BJ32DD
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603756800000 maxt=1603764000000 ulid=01ENKZZ266YF8SS4QN3V377JW8
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603735200000 maxt=1603756800000 ulid=01ENKZZ84YYFGZY7G8AB1WYHMM
level=info ts=2020-10-27T07:40:56.366Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603764000000 maxt=1603771200000 ulid=01ENM6TSDY7NNV4HF2TY22GCQD
level=info ts=2020-10-27T07:40:56.366Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603771200000 maxt=1603778400000 ulid=01ENMDPGNMGQ0039C7P1HYA0VX
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:585 msg="Stopping scrape discovery manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:599 msg="Stopping notify discovery manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:621 msg="Stopping scrape manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=manager.go:924 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-10-27T07:40:56.374Z caller=manager.go:934 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-10-27T07:40:56.374Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2020-10-27T07:40:56.374Z caller=main.go:789 msg="Notifier manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:615 msg="Scrape manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:595 msg="Notify discovery manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:581 msg="Scrape discovery manager stopped"
level=error ts=2020-10-27T07:40:56.375Z caller=main.go:798 err="opening storage failed: mmap files, file: /prometheus/chunks_head/000486: mmap: invalid argument"

We are running chaosmonkey in our Kubernetes clusters, which means that we are testing highly availability of the applications by removing single nodes regurarly two times a week. And it looks like that can be one of the issues, so it seems that prometheus cannot handle such situations. Cloud native applications should handle situations like this.

The text was updated successfully, but these errors were encountered:

roidelapluie · 2020-10-27T08:15:13Z

Which storage are you using? Was the application running 2.22 before the restart?

zetaab · 2020-10-27T08:18:37Z

this Kubernetes cluster is running in OpenStack and we are using cinder storage for prometheus volumes. Application is running 2.22 before restart. I think the problem here is now if something happens to virtualmachine (kubernetes node) where the prometheus is running.

roidelapluie · 2020-10-27T08:20:11Z

Probably /prometheus/chunks_head/000486 is an empty file. That should not happen using posix storage as we create them atomically.

zetaab · 2020-10-27T08:23:28Z

root@nodes-z1-1-edn-esptnl-telco-dev-k8s-local:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-7ba69015-5adc-4867-a25d-f93eb2074f06/globalmount/prometheus-db# ls -l chunks_head/000486
-rw-rw-r-- 1 debian 2000 0 Oct 27 07:00 chunks_head/000486

yeah its empty file. There are currently two ways to fix this issue what we have been using 1) delete all prometheus datas and start from the beginning 2) delete this single chunks_head file. (in both cases we restart prometheus after that)

Could it be a solution that prometheus could handle these empty files itself automatically (like just remove/ignore them)?

codesome · 2020-10-27T08:45:37Z

Turns out the "atomic write" that we did was not enough, faulty disk and abrupt shutdown combined can still cause this issue. The fix for this is already merged in master (#8061) which does a read repair. It might make sense to do a patch release v2.22.1 with it maybe, @brancz WDYT? (Should we be patching older versions? This issue exists from v2.19.x onwards)

zetaab · 2020-10-29T12:27:37Z

I am refreshing this issue quite many times per day and waiting for fix :)

codesome · 2020-10-29T13:36:44Z

Should be fixed with #8061

roidelapluie · 2020-10-29T13:52:57Z

Turns out the "atomic write" that we did was not enough, faulty disk and abrupt shutdown combined can still cause this issue. The fix for this is already merged in master (#8061) which does a read repair. It might make sense to do a patch release v2.22.1 with it maybe, @brancz WDYT? (Should we be patching older versions? This issue exists from v2.19.x onwards)

As 2.22 is small, I think we can fix it here, I don't feel the need for a 2.21 point release.

zetaab · 2020-10-29T14:04:14Z

@codesome yeah but we use prometheus-operator, so I would not like to compile release myself

brancz · 2020-10-31T07:52:20Z

I'll put cutting a patch release on my list for next week.

brian-brazil · 2020-10-31T10:08:10Z

We might want to include #8104, as that can cause scraping of Prometheus itself to fail.

brian-brazil · 2020-11-02T15:50:00Z

Fixed by #8061

This should prevent the problem (mmap: invalid argument) outlined in this issue: prometheus/prometheus#8114

roidelapluie assigned codesome Oct 27, 2020

brian-brazil closed this as completed Nov 2, 2020

wyb1 added a commit to wyb1/gardener that referenced this issue Nov 23, 2020

Upgrade Prometheus to v2.22.2

01d561c

This should prevent the problem (mmap: invalid argument) outlined in this issue: prometheus/prometheus#8114

wyb1 mentioned this issue Nov 23, 2020

Upgrade Prometheus to v2.22.2 gardener/gardener#3213

Merged

prometheus locked as resolved and limited conversation to collaborators Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmap: invalid argument #8114

mmap: invalid argument #8114

zetaab commented Oct 27, 2020 •

edited

roidelapluie commented Oct 27, 2020

zetaab commented Oct 27, 2020 •

edited

roidelapluie commented Oct 27, 2020

zetaab commented Oct 27, 2020 •

edited

codesome commented Oct 27, 2020

zetaab commented Oct 29, 2020

codesome commented Oct 29, 2020

roidelapluie commented Oct 29, 2020

zetaab commented Oct 29, 2020

brancz commented Oct 31, 2020

brian-brazil commented Oct 31, 2020

brian-brazil commented Nov 2, 2020

mmap: invalid argument #8114

mmap: invalid argument #8114

Comments

zetaab commented Oct 27, 2020 • edited

roidelapluie commented Oct 27, 2020

zetaab commented Oct 27, 2020 • edited

roidelapluie commented Oct 27, 2020

zetaab commented Oct 27, 2020 • edited

codesome commented Oct 27, 2020

zetaab commented Oct 29, 2020

codesome commented Oct 29, 2020

roidelapluie commented Oct 29, 2020

zetaab commented Oct 29, 2020

brancz commented Oct 31, 2020

brian-brazil commented Oct 31, 2020

brian-brazil commented Nov 2, 2020

zetaab commented Oct 27, 2020 •

edited

zetaab commented Oct 27, 2020 •

edited

zetaab commented Oct 27, 2020 •

edited