Monit issue


Stephen Knight <sknight@...>
 

Hi guys,

I have a boshrelease, which includes Dante proxy server. The release builds
and the code compiles, installs and runs on the hosts. However, I have an
issue with Monit.

When I initially run deploy, although the machines build and the correct
ports are listening (as in Dante started on port 1081), Bosh first reports
this error during build time:

"""
Failed updating job socks > socks/0: `socks/0' is not running after update
(00:01:36)
Failed updating job socks (00:01:36)

Error 400007: `socks/1' is not running after update
"""

When I log on to the stemcell and run "monit summary":

"""
-bash-4.2# monit summary
The Monit daemon 5.2.4 uptime: 1m

Process 'socksd' Execution failed
Process 'stunnel' not monitored
System 'system_61693bef-3e13-4be5-bbde-90154639f452' running
"""

However, even with these errors if I run lsof I see the application running:

"""
-bash-4.2# lsof -i:1081
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sockd 11124 root 7u IPv4 24814 0t0 TCP
61693bef-3e13-4be5-bbde-90154639f452:pvuniwien (LISTEN)
"""

I can't figure out what the issue might be, one suspicious thing is
appearing in the logs, this message keeps recurring:

/var/vcap/sys/log/socksd/*.err.log

"""
Oct 22 10:47:23 (1445510843.376364) sockd[11314]: error: serverinit():
failed to bind internal addresses: Address already in use
Oct 22 10:47:23 (1445510843.376375) sockd[11314]: alert: mother[1/1]:
shutting down
Oct 22 10:48:33 (1445510913.391384) sockd[11364]: warning: checkconfig():
setting the unprivileged uid to 0 is not recommended for security reasons
Oct 22 10:48:33 (1445510913.391490) sockd[11364]: warning: bindinternal():
bind of address 192.168.100.111.1081 (address #1/1) for server to listen on
failed: Address already in use
Oct 22 10:48:33 (1445510913.391501) sockd[11364]: error: serverinit():
failed to bind internal addresses: Address already in use
Oct 22 10:48:33 (1445510913.391516) sockd[11364]: alert: mother[1/1]:
shutting down
"""

Strange thing is nothing else is configured to run on this port, I checked
the ctl_setup.sh and socks startup script, I suspect it might be flapping
somehow but Monit is not giving any debug info that would help me resolve
the issue. Hoping for some advice - tried on both the latest Ubuntu and
CentOS stemcells.

- *bosh gem versions *

[root(a)bosh-cli-01 socksd-boshrelease]# bosh -v
BOSH 1.3104.0

- *bosh director info *
[root(a)bosh-cli-01 socksd-boshrelease]# bosh status
Config
/root/.bosh_config

Director
Name my-bosh
URL https://192.168.100.205:25555
Version 1.3104.0 (00000000)
User admin
UUID eef3d294-5790-41eb-81e2-296cf3883c07
CPI cpi
dns disabled
compiled_package_cache disabled
snapshots disabled

Deployment
Manifest /root/bosh-workspace/deployments/socksd-boshrelease/socksd.yml

- *stemcell version(s)*

bosh-vsphere-esxi-ubuntu-trusty-go_agent / 3104
bosh-vsphere-esxi-centos-7-go_agent /3104

- *deployment manifest *

<%
director_uuid = 'eef3d294-5790-41eb-81e2-296cf3883c07'
deployment_name = 'socksd'
%>
---
name: <%= deployment_name %>
director_uuid: <%= director_uuid %>

releases:
- name: socksd
version: latest

compilation:
workers: 2
network: default
reuse_compilation_vms: false
cloud_properties:
preemptible: true
cpu: 2
ram: 2_048
disk: 10_240

update:
canaries: 0
canary_watch_time: 30000-60000
update_watch_time: 30000-60000
max_in_flight: 32
serial: false

networks:
- name: default
type: manual
subnets:
- range: 192.168.100.0/24
reserved: [192.168.100.2 - 192.168.100.110]
gateway: 192.168.100.1
cloud_properties:
name: VLAN100
tags:
- bosh
- socksd
dns:
- 192.168.100.1
- 8.8.8.8

resource_pools:
- name: default
network: default
stemcell:
name: bosh-vsphere-esxi-centos-7-go_agent
version: latest
cloud_properties:
cpu: 1
ram: 1_024
disk: 10_240

jobs:
- name: socks
templates:
- name: socksd
- name: stunnel
instances: 2
resource_pool: default
persistent_disk: 10240
networks:
- name: default
default: [dns, gateway]

properties:
vsphere:
host: xxx
user: xxx
password: xxx
datacenters:
- name: LAB
vm_folder: bosh
template_folder: Templates
disk_path: bosh_disks
datastore_pattern: '\AISCSI_SSD\z'
persistent_datastore_pattern: '\AISCSI_SSD\z'
clusters:
- LAB: {resource_pool: LAB}

director:
max_threads: 3
hm:
resurrector_enabled: true
resurrector:
minimum_down_jobs: 5
percent_threshold: 0.2
time_threshold: 600
ntp:
- 0.asia.pool.ntp.org
- 1.asia.pool.ntp.org

I suspect the issue is deep in Monit somewhere, so any advice on diagnosing
it with self made releases please, it would help a lot.

Stephen


Dr Nic Williams
 

If I recall one possibility is that your monit file is referencing a pid file that is a different location to where your ctl script/app is dropping the pid fileĀ 




So your monit starts the app, you drop a pid file. Monit can't see the pid file it expects so it starts your app again. And now you have two processes vying for the same port/shared resource.

On Thu, Oct 22, 2015 at 3:56 AM, Stephen Knight <sknight(a)pivotal.io>
wrote:

Hi guys,
I have a boshrelease, which includes Dante proxy server. The release builds
and the code compiles, installs and runs on the hosts. However, I have an
issue with Monit.
When I initially run deploy, although the machines build and the correct
ports are listening (as in Dante started on port 1081), Bosh first reports
this error during build time:
"""
Failed updating job socks > socks/0: `socks/0' is not running after update
(00:01:36)
Failed updating job socks (00:01:36)
Error 400007: `socks/1' is not running after update
"""
When I log on to the stemcell and run "monit summary":
"""
-bash-4.2# monit summary
The Monit daemon 5.2.4 uptime: 1m
Process 'socksd' Execution failed
Process 'stunnel' not monitored
System 'system_61693bef-3e13-4be5-bbde-90154639f452' running
"""
However, even with these errors if I run lsof I see the application running:
"""
-bash-4.2# lsof -i:1081
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sockd 11124 root 7u IPv4 24814 0t0 TCP
61693bef-3e13-4be5-bbde-90154639f452:pvuniwien (LISTEN)
"""
I can't figure out what the issue might be, one suspicious thing is
appearing in the logs, this message keeps recurring:
/var/vcap/sys/log/socksd/*.err.log
"""
Oct 22 10:47:23 (1445510843.376364) sockd[11314]: error: serverinit():
failed to bind internal addresses: Address already in use
Oct 22 10:47:23 (1445510843.376375) sockd[11314]: alert: mother[1/1]:
shutting down
Oct 22 10:48:33 (1445510913.391384) sockd[11364]: warning: checkconfig():
setting the unprivileged uid to 0 is not recommended for security reasons
Oct 22 10:48:33 (1445510913.391490) sockd[11364]: warning: bindinternal():
bind of address 192.168.100.111.1081 (address #1/1) for server to listen on
failed: Address already in use
Oct 22 10:48:33 (1445510913.391501) sockd[11364]: error: serverinit():
failed to bind internal addresses: Address already in use
Oct 22 10:48:33 (1445510913.391516) sockd[11364]: alert: mother[1/1]:
shutting down
"""
Strange thing is nothing else is configured to run on this port, I checked
the ctl_setup.sh and socks startup script, I suspect it might be flapping
somehow but Monit is not giving any debug info that would help me resolve
the issue. Hoping for some advice - tried on both the latest Ubuntu and
CentOS stemcells.
- *bosh gem versions *
[root(a)bosh-cli-01 socksd-boshrelease]# bosh -v
BOSH 1.3104.0
- *bosh director info *
[root(a)bosh-cli-01 socksd-boshrelease]# bosh status
Config
/root/.bosh_config
Director
Name my-bosh
URL https://192.168.100.205:25555
Version 1.3104.0 (00000000)
User admin
UUID eef3d294-5790-41eb-81e2-296cf3883c07
CPI cpi
dns disabled
compiled_package_cache disabled
snapshots disabled
Deployment
Manifest /root/bosh-workspace/deployments/socksd-boshrelease/socksd.yml
- *stemcell version(s)*
bosh-vsphere-esxi-ubuntu-trusty-go_agent / 3104
bosh-vsphere-esxi-centos-7-go_agent /3104
- *deployment manifest *
<%
director_uuid = 'eef3d294-5790-41eb-81e2-296cf3883c07'
deployment_name = 'socksd'
%>
---
name: <%= deployment_name %>
director_uuid: <%= director_uuid %>
releases:
- name: socksd
version: latest
compilation:
workers: 2
network: default
reuse_compilation_vms: false
cloud_properties:
preemptible: true
cpu: 2
ram: 2_048
disk: 10_240
update:
canaries: 0
canary_watch_time: 30000-60000
update_watch_time: 30000-60000
max_in_flight: 32
serial: false
networks:
- name: default
type: manual
subnets:
- range: 192.168.100.0/24
reserved: [192.168.100.2 - 192.168.100.110]
gateway: 192.168.100.1
cloud_properties:
name: VLAN100
tags:
- bosh
- socksd
dns:
- 192.168.100.1
- 8.8.8.8
resource_pools:
- name: default
network: default
stemcell:
name: bosh-vsphere-esxi-centos-7-go_agent
version: latest
cloud_properties:
cpu: 1
ram: 1_024
disk: 10_240
jobs:
- name: socks
templates:
- name: socksd
- name: stunnel
instances: 2
resource_pool: default
persistent_disk: 10240
networks:
- name: default
default: [dns, gateway]
properties:
vsphere:
host: xxx
user: xxx
password: xxx
datacenters:
- name: LAB
vm_folder: bosh
template_folder: Templates
disk_path: bosh_disks
datastore_pattern: '\AISCSI_SSD\z'
persistent_datastore_pattern: '\AISCSI_SSD\z'
clusters:
- LAB: {resource_pool: LAB}
director:
max_threads: 3
hm:
resurrector_enabled: true
resurrector:
minimum_down_jobs: 5
percent_threshold: 0.2
time_threshold: 600
ntp:
- 0.asia.pool.ntp.org
- 1.asia.pool.ntp.org
I suspect the issue is deep in Monit somewhere, so any advice on diagnosing
it with self made releases please, it would help a lot.
Stephen


Gwenn Etourneau
 

Maybe you have the source code of the bosh release somewhere especially the
monit and the start up script.
My guess is the same at Dr Nic, monit check a pid file which is different
from the one your application / script is writing.
So Monit try to restart again and again your program.