健康检查与监控指标

监控指标与阈值

核心服务

核心服务包括:backend-server、room-server、fusion-server、socket-server

监控核心服务的健康状态,主要通过K8S健康检查,配置存活与就绪检查探针

Backend-server健康检查:

接口:/api/v1/actuator/health/liveness

  • 返回示例:{ "status":"UP" }




Room-server & Funsion-server 健康检查:

接口:/actuator/health

返回示例:

{
    "success":true,
    "code":200,
    "message":"SUCCESS",
    "data":{
        "status":"ok",
        "info":{
            "database":{  //mysql 状态
                "status":"up"
            },
            "memory_rss":  // room-server 服务占物理内存总量
                "status":"up",
                "totalMem":63240.13671875
            },
            "memory_heap":{ // room-server 堆内存使用量
                "status":"up",
                "memoryUsageMem":286.9375
            },
            "redis":{     //redis 状态
                "status":"up"
            }
        },
        "error":{

        },
        "details":{
            "database":{
                "status":"up"
            },
            "memory_rss":{  
                "status":"up",
                "totalMem":63240.13671875
            },
            "memory_heap":{
                "status":"up",
                "memoryUsageMem":286.9375
            },
            "redis":{
                "status":"up"
            }
        }
    }
}



Socket-server 健康检查:

接口:/socket/health

返回示例:

{
    "status":"ok",  //总量状态ok
    "info":{
        "dns":{
            "status":"up"
        },
        "memory":{
            "status":"up",
            "rss":{
                "rssRatio":0,
                "rss":173719552,
                "sysTotal":66312089600,
                "status":"up"
            },
            "heap":{
                "heapUsedRatio":0.89,
                "heapUsed":90661608,
                "heapTotal":102019072,
                "status":"up"
            }
        },
        "redis":{
            "status":"up",
            "redisStatus":"ready"
        },
        "server":{
            "status":"up",
            "name":"socket-server/10.250.178.176"
        }
    },
    "error":{

    },
    "details":{
        "dns":{
            "status":"up"
        },
        "memory":{
            "status":"up",
            "rss":{
                "rssRatio":0,
                "rss":173719552,
                "sysTotal":66312089600,
                "status":"up"
            },
            "heap":{
                "heapUsedRatio":0.89,
                "heapUsed":90661608,
                "heapTotal":102019072,
                "status":"up"
            }
        },
        "redis":{
            "status":"up",
            "redisStatus":"ready"
        },
        "server":{
            "status":"up",
            "name":"socket-server/10.250.178.176"
        }
    }
}


imageproxy-server 健康检查:

接口: /metrics

返回示例(采用promethues风格):

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.78e-05
go_gc_duration_seconds{quantile="0.25"} 5.16e-05
go_gc_duration_seconds{quantile="0.5"} 5.37e-05
go_gc_duration_seconds{quantile="0.75"} 6.79e-05
go_gc_duration_seconds{quantile="1"} 0.0020545
go_gc_duration_seconds_sum 12.353425121
go_gc_duration_seconds_count 31153
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 11
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.16.15"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.2697e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 5.2396205184e+10
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.845734e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 7.2429544e+07
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 4.574895445222022e-06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 1.1059832e+08
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.2697e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 3.271155712e+09
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.4819328e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 183879
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 3.268911104e+09
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.28597504e+09
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6762895881006994e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 7.2613423e+07
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 19200
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 32768
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 231064
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 1.327104e+06
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.7165856e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 2.266226e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 2.359296e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 2.359296e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 3.404404488e+09
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 21
# HELP http_request_duration_seconds Request response times
# TYPE http_request_duration_seconds summary
http_request_duration_seconds_sum 3235.3575712299976
http_request_duration_seconds_count 4591
# HELP imageproxy_remote_fetch_errors_total Total remote image fetch errors
# TYPE imageproxy_remote_fetch_errors_total counter
imageproxy_remote_fetch_errors_total 0
# HELP imageproxy_requests_served_from_cache_total Number of requests served from cache.
# TYPE imageproxy_requests_served_from_cache_total counter
imageproxy_requests_served_from_cache_total 0
# HELP imageproxy_transformation_duration_seconds Time taken for image transformations in seconds.
# TYPE imageproxy_transformation_duration_seconds summary
imageproxy_transformation_duration_seconds_sum 506.80838228699974
imageproxy_transformation_duration_seconds_count 1622
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1186.6
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 10
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.4159872e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.67240040416e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.071796736e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0


数据库

TPS监控

计算公式:(Com_commit + Com_rollback) / Uptime。


QPS监控

计算公式:Queries/Uptime。


CPU 使用率监控

服务进程CPU使用率(200%代表使用2个CPU核)。单位:百分比。

低于40%


内存使用率监控

集群内存使用率(占操作系统总数的百分比)。单位:百分比。



Promethues 监控规则


web-server 监控

数据预览: kube_pod_status_phase{namespace = "应用命名空间",pod_name =~ "web-server-*",phase=~"Pending|Unknown|Failed"} > 0





fusion-server 状态结果监测:

fusion-server 状态pendingUnknow大于>0 ,





Pod 重启次数告警阈值:

策略: pod 在5min 内重启超过3次