Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap: invalid argument #8114

Closed
zetaab opened this issue Oct 27, 2020 · 12 comments
Closed

mmap: invalid argument #8114

zetaab opened this issue Oct 27, 2020 · 12 comments
Assignees

Comments

@zetaab
Copy link

zetaab commented Oct 27, 2020

What did you do? I am running prometheus operator in kubernetes cluster. However, we see daily that somewhere in our clusters the prometheus starts to crashloop. The reason for that is always the same "mmap: invalid argument".

Example:

level=error ts=2020-10-27T07:35:48.369Z caller=main.go:798 err="opening storage failed: mmap files, file: /prometheus/chunks_head/000486: mmap: invalid argument"

What did you expect to see? I expect that prometheus does not crash so often.

What did you see instead? Under which circumstances?

Environment Kubernetes

  • System information:

running inside kubernetes, host OS is debian buster. The used docker image is quay.io/prometheus/prometheus:v2.22.0

  • Prometheus version:

prometheus, version 2.22.0 (branch: HEAD, revision: 0a7fdd3)
build user: root@6321101b2c50
build date: 20201015-12:29:59
go version: go1.15.3
platform: linux/amd64

level=info ts=2020-10-27T07:40:56.356Z caller=main.go:353 msg="Starting Prometheus" version="(version=2.22.0, branch=HEAD, revision=0a7fdd3b76960808c3a91d92267c3d815c1bc354)"
level=info ts=2020-10-27T07:40:56.356Z caller=main.go:358 build_context="(go=go1.15.3, user=root@6321101b2c50, date=20201015-12:29:59)"
level=info ts=2020-10-27T07:40:56.356Z caller=main.go:359 host_details="(Linux 4.19.0-10-cloud-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64 prometheus-k8s-0 (none))"
level=info ts=2020-10-27T07:40:56.357Z caller=main.go:360 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2020-10-27T07:40:56.357Z caller=main.go:361 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-10-27T07:40:56.361Z caller=main.go:712 msg="Starting TSDB ..."
level=info ts=2020-10-27T07:40:56.361Z caller=web.go:516 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-10-27T07:40:56.362Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1602828000000 maxt=1603022400000 ulid=01EMY3MDASWV2NF64FEPK2JW5A
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603022400000 maxt=1603216800000 ulid=01EN3X13TBHJWCCYBF0T3AW75Q
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603216800000 maxt=1603281600000 ulid=01EN5TSPS9HDH7BXXDDCN5G2S7
level=info ts=2020-10-27T07:40:56.363Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603281600000 maxt=1603346400000 ulid=01EN7RK7M15AFDPGY7EPK2NK8F
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603346400000 maxt=1603411200000 ulid=01EN9PCWS2GN4J19NTP1QTZM0E
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603411200000 maxt=1603476000000 ulid=01ENBM68EWWPDYV9PCC5JSCH6E
level=info ts=2020-10-27T07:40:56.364Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603476000000 maxt=1603540800000 ulid=01ENDHZTC23A7PH45YZSFVHJNJ
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603540800000 maxt=1603605600000 ulid=01ENFFS6EHJM2HGK3P512P52PY
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603605600000 maxt=1603670400000 ulid=01ENHDJWJS2JFP9WT6TCHBBYNA
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603670400000 maxt=1603735200000 ulid=01ENKBCFGNJSKRM07855BJ32DD
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603756800000 maxt=1603764000000 ulid=01ENKZZ266YF8SS4QN3V377JW8
level=info ts=2020-10-27T07:40:56.365Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603735200000 maxt=1603756800000 ulid=01ENKZZ84YYFGZY7G8AB1WYHMM
level=info ts=2020-10-27T07:40:56.366Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603764000000 maxt=1603771200000 ulid=01ENM6TSDY7NNV4HF2TY22GCQD
level=info ts=2020-10-27T07:40:56.366Z caller=repair.go:56 component=tsdb msg="Found healthy block" mint=1603771200000 maxt=1603778400000 ulid=01ENMDPGNMGQ0039C7P1HYA0VX
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:585 msg="Stopping scrape discovery manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:599 msg="Stopping notify discovery manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:621 msg="Stopping scrape manager..."
level=info ts=2020-10-27T07:40:56.373Z caller=manager.go:924 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-10-27T07:40:56.374Z caller=manager.go:934 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-10-27T07:40:56.374Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2020-10-27T07:40:56.374Z caller=main.go:789 msg="Notifier manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:615 msg="Scrape manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:595 msg="Notify discovery manager stopped"
level=info ts=2020-10-27T07:40:56.373Z caller=main.go:581 msg="Scrape discovery manager stopped"
level=error ts=2020-10-27T07:40:56.375Z caller=main.go:798 err="opening storage failed: mmap files, file: /prometheus/chunks_head/000486: mmap: invalid argument"

We are running chaosmonkey in our Kubernetes clusters, which means that we are testing highly availability of the applications by removing single nodes regurarly two times a week. And it looks like that can be one of the issues, so it seems that prometheus cannot handle such situations. Cloud native applications should handle situations like this.

@roidelapluie
Copy link
Member

Which storage are you using? Was the application running 2.22 before the restart?

@zetaab
Copy link
Author

zetaab commented Oct 27, 2020

this Kubernetes cluster is running in OpenStack and we are using cinder storage for prometheus volumes. Application is running 2.22 before restart. I think the problem here is now if something happens to virtualmachine (kubernetes node) where the prometheus is running.

@roidelapluie
Copy link
Member

Probably /prometheus/chunks_head/000486 is an empty file. That should not happen using posix storage as we create them atomically.

@zetaab
Copy link
Author

zetaab commented Oct 27, 2020

root@nodes-z1-1-edn-esptnl-telco-dev-k8s-local:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-7ba69015-5adc-4867-a25d-f93eb2074f06/globalmount/prometheus-db# ls -l chunks_head/000486
-rw-rw-r-- 1 debian 2000 0 Oct 27 07:00 chunks_head/000486

yeah its empty file. There are currently two ways to fix this issue what we have been using 1) delete all prometheus datas and start from the beginning 2) delete this single chunks_head file. (in both cases we restart prometheus after that)

Could it be a solution that prometheus could handle these empty files itself automatically (like just remove/ignore them)?

@codesome
Copy link
Member

Turns out the "atomic write" that we did was not enough, faulty disk and abrupt shutdown combined can still cause this issue. The fix for this is already merged in master (#8061) which does a read repair. It might make sense to do a patch release v2.22.1 with it maybe, @brancz WDYT? (Should we be patching older versions? This issue exists from v2.19.x onwards)

@zetaab
Copy link
Author

zetaab commented Oct 29, 2020

I am refreshing this issue quite many times per day and waiting for fix :)

@codesome
Copy link
Member

Should be fixed with #8061

@roidelapluie
Copy link
Member

Turns out the "atomic write" that we did was not enough, faulty disk and abrupt shutdown combined can still cause this issue. The fix for this is already merged in master (#8061) which does a read repair. It might make sense to do a patch release v2.22.1 with it maybe, @brancz WDYT? (Should we be patching older versions? This issue exists from v2.19.x onwards)

As 2.22 is small, I think we can fix it here, I don't feel the need for a 2.21 point release.

@zetaab
Copy link
Author

zetaab commented Oct 29, 2020

@codesome yeah but we use prometheus-operator, so I would not like to compile release myself

@brancz
Copy link
Member

brancz commented Oct 31, 2020

I'll put cutting a patch release on my list for next week.

@brian-brazil
Copy link
Contributor

We might want to include #8104, as that can cause scraping of Prometheus itself to fail.

@brian-brazil
Copy link
Contributor

Fixed by #8061

wyb1 added a commit to wyb1/gardener that referenced this issue Nov 23, 2020
This should prevent the problem (mmap: invalid argument) outlined in this issue:
prometheus/prometheus#8114
@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants