June 25th, 2018 outage
Initial outage reported around 0500 on the 25th. helpdesk/vta#176 (closed)
The logs were saturated with these Heartbeat checks from every storage node and cephmon was marking the osd's as down even though they were up and running. Essentially about 70% of storage was down and since these heartbeats were saturating the network it was causing timeout issues for everyone trying to connect and use VTA.
The failures seem to start when they are communicating with peers not cephmon.
2018-06-25 11:32:20.389150 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.23:6825 osd.81 since back 2018-06-25 11:31:49.437379 front 2018-06-25 11:32:20.366564 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389158 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.23:6819 osd.83 since back 2018-06-25 11:32:16.960584 front 2018-06-25 11:31:49.437379 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389165 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.30:6803 osd.115 since back 2018-06-25 11:32:20.366564 front 2018-06-25 11:31:35.776403 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389174 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.31:6823 osd.116 since back 2018-06-25 11:32:20.366564 front 2018-06-25 11:31:35.776403 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389181 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.23:6820 osd.121 since back 2018-06-25 11:31:49.437379 front 2018-06-25 11:32:16.960584 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389189 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.28:6835 osd.183 since back 2018-06-25 11:32:20.366564 front 2018-06-25 11:31:35.776403 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389197 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.21:6812 osd.187 since back 2018-06-25 11:32:20.366564 front 2018-06-25 11:31:49.437379 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389205 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.22:6802 osd.206 since back 2018-06-25 11:31:57.043634 front 2018-06-25 11:31:57.043634 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389214 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.22:6819 osd.271 since back 2018-06-25 11:32:16.960584 front 2018-06-25 11:31:57.043634 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389223 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.29:6811 osd.363 since back 2018-06-25 11:31:35.776403 front 2018-06-25 11:32:20.366564 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:20.389230 7fb677046700 -1 osd.22 52413 heartbeat_check: no reply from 10.30.6.29:6813 osd.370 since back 2018-06-25 11:31:35.776403 front 2018-06-25 11:32:20.366564 (cutoff 2018-06-25 11:32:00.389114)
2018-06-25 11:32:21.343545 7fb67b838700 0 -- 10.40.6.34:6810/5008167 >> 10.40.6.29:6814/5007388 conn(0x561757b4d800 :-1 s=STATE_OPEN pgs=2373 cs=1 l=0).fault initiating reconnect
2018-06-25 11:32:21.349505 7fb67b037700 0 -- 10.40.6.34:6810/5008167 >> 10.40.6.29:6804/5006630 conn(0x5617661e0000 :-1 s=STATE_OPEN pgs=2467 cs=1 l=0).fault initiating reconnect
2018-06-25 11:32:21.351155 7fb67b037700 0 -- 10.40.6.34:6810/5008167 >> 10.40.6.29:6804/5006630 conn(0x5617661e0000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2467 cs=2 l=0).handle_connect_reply connect got RESETSESSION
2018-06-25 11:32:21.359033 7fb66d032700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.22 down, but it is still running
2018-06-25 11:32:21.359052 7fb66d032700 0 log_channel(cluster) log [DBG] : map e52416 wrongly marked me down at e52416
2018-06-25 11:32:21.359059 7fb66d032700 0 osd.22 52416 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
2018-06-25 11:32:21.359066 7fb66d032700 1 osd.22 52416 start_waiting_for_healthy```
Then after it tries to remap a bunch of stuff it eventually errors out
```2018-06-25 11:32:21.406221 7fb666825700 1 osd.22 pg_epoch: 52416 pg[2.2f6( v 39962'243 (0'0,39962'243] local-lis/les=52369/52370 n=163 ec=1628/1628 lis/c 52369/52369 les/c/f 52370/52370/0 52416/52416/52369) [29] r=-1 lpr=52416 pi=[52369,52416)/1 luod=0'0 crt=39962'243 lcod 0'0 active] start_peering_interval up [29,22,228] -> [29], acting [29,22,228] -> [29], acting_primary 29 -> 29, up_primary 29 -> 29, role 1 -> -1, features acting 2305244844532236283 upacting 2305244844532236283
2018-06-25 11:32:21.406283 7fb666825700 1 osd.22 pg_epoch: 52416 pg[2.2f6( v 39962'243 (0'0,39962'243] local-lis/les=52369/52370 n=163 ec=1628/1628 lis/c 52369/52369 les/c/f 52370/52370/0 52416/52416/52369) [29] r=-1 lpr=52416 pi=[52369,52416)/1 crt=39962'243 lcod 0'0 unknown NOTIFY] state<Start>: transitioning to Stray
2018-06-25 11:32:21.406383 7fb66d032700 0 osd.22 52416 _committed_osd_maps shutdown OSD via async signal
2018-06-25 11:32:21.406531 7fb65e815700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2018-06-25 11:32:21.406555 7fb65e815700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
2018-06-25 11:32:21.406559 7fb65e815700 -1 osd.22 52416 *** Got signal Interrupt ***
2018-06-25 11:32:21.406567 7fb65e815700 0 osd.22 52416 prepare_to_stop starting shutdown
2018-06-25 11:32:21.406570 7fb65e815700 -1 osd.22 52416 shutdown
2018-06-25 11:32:23.546805 7fb65e815700 1 bluestore(/var/lib/ceph/osd/ceph-22) umount
2018-06-25 11:32:23.676285 7fb65e815700 1 stupidalloc 0x0x561752b28070 shutdown
2018-06-25 11:32:23.678649 7fb65e815700 1 freelist shutdown
2018-06-25 11:32:23.678691 7fb65e815700 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling all background work
2018-06-25 11:32:23.687361 7fb65e815700 4 rocksdb: [/build/ceph-12.2.5/src/rocksdb/db/db_impl.cc:343] Shutdown complete2018-06-25 11:32:23.748882 7fb65e815700 1 bluefs umount
2018-06-25 11:32:23.748895 7fb65e815700 1 stupidalloc 0x0x561752b29960 shutdown
2018-06-25 11:32:23.748898 7fb65e815700 1 stupidalloc 0x0x561752b298f0 shutdown
2018-06-25 11:32:23.748991 7fb65e815700 1 bdev(0x561752775200 /dev/nvme3n1p5) close
2018-06-25 11:32:23.931091 7fb65e815700 1 bdev(0x561752775440 /var/lib/ceph/osd/ceph-22/block) close
2018-06-25 11:32:24.103045 7fb65e815700 1 bdev(0x561752774fc0 /var/lib/ceph/osd/ceph-22/block) close```
I did not see anything wrong with the network, but all this seemed to indicate a network issue. After some more digging I found that the ceph-mgr had exceeded its open file limits and was failing.
```os_release_parse - failed to open /etc/os-release: (24) Too many open files
-10> 2018-06-25 12:16:41.491341 7f3f537f9680 -1 distro_detect - /etc/os-release is required```
```2018-06-25 12:16:40.453733 7f02a74f6700 1 mgr send_beacon standby
2018-06-25 12:16:41.433024 7f02aa4fc700 1 mgr handle_mgr_map Activating!
2018-06-25 12:16:41.433473 7f02aa4fc700 1 mgr handle_mgr_map I am now activating
2018-06-25 12:16:41.434222 7fb4f2dd6700 -1 mgr handle_mgr_map I was active but no longer am
2018-06-25 12:16:41.434299 7fb4f2dd6700 1 mgr respawn e: 'ceph-mgr'
2018-06-25 12:16:41.434344 7fb4f2dd6700 1 mgr respawn 0: 'ceph-mgr'
2018-06-25 12:16:41.434386 7fb4f2dd6700 1 mgr respawn 1: '-i'
2018-06-25 12:16:41.434426 7fb4f2dd6700 1 mgr respawn 2: 'cephmon-0'
2018-06-25 12:16:41.434516 7fb4f2dd6700 1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2018-06-25 12:16:41.434561 7fb4f2dd6700 1 mgr respawn exe_path /proc/self/exe
2018-06-25 12:16:41.470413 7f3f537f9680 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 9087
2018-06-25 12:16:41.474413 7f3f537f9680 0 pidfile_write: ignore empty --pid-file
2018-06-25 12:16:41.482594 7f3f4db70700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2018-06-25 12:16:41.491235 7f3f537f9680 1 mgr send_beacon standby
2018-06-25 12:16:41.491321 7f3f537f9680 -1 os_release_parse - failed to open /etc/os-release: (24) Too many open files
2018-06-25 12:16:41.491341 7f3f537f9680 -1 distro_detect - /etc/os-release is required
2018-06-25 12:16:41.491344 7f3f537f9680 -1 distro_detect - can't detect distro
2018-06-25 12:16:41.491349 7f3f537f9680 -1 distro_detect - can't detect distro_description
2018-06-25 12:16:41.497327 7f3f537f9680 -1 /build/ceph-12.2.5/src/global/signal_handler.cc: In function 'SignalHandler::SignalHandler()' thread 7f3f537f9680 time 2018-06-25 12:16:41.491465
/build/ceph-12.2.5/src/global/signal_handler.cc: 235: FAILED assert(r == 0)```
ceph-mgr process id
```root@cephmon-0:~# cat /proc/9113/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 125601 125601 processes
Max open files 8192 8192 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 125601 125601 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us```
commits d0be77e7 and 4432ba27 resolved the issue. Raised the openfile limits to 500000 on all 3 cephmon's