Using Monitoring for Great Good

I was recently reminded of the importance of a good monitoring solution. For the main server in my home network I use Netdata, an open-source monitoring tool. I’ve been very happy with it from the beginning. It’s easy to install, configure, and maintain. It shows a ton of information. And it has a very handy feature:

A health monitoring system with built in notifications.

I’m using a notification system that uses Email and also a Telegram bot. The bot is especially handy, since I get the notifications pretty much instantly when an alarm is raised, and I receive notifications on the go. One day I received some puzzling alarms.

Screenshot of 3 Telegram messages. The first one (warning) says the ratio of dropped packets is 0.29%. The second one warns about the disk backlog being 2112 ms. The third is an error, and says that the ratio of dropped packets has gone to 11%.

The alarms in question

These three messages in succession were strange. Normally I get some packet loss and disk backlog on the system. I use it as a NAS, and this can happen when transferring large files. But the last message is really strange. 11% packet loss? That’s huge! So I started wondering what it was. I logged into the Netdata web interface. The CPU graph immediately caught my eye.

A netdata CPU graph. The graph has very obvious spikes that happen every 10-20 seconds.

Netdata CPU graph

What were these strange spikes? I thought the best Idea would be to just login to the server and check it locally. So I fired up htop (still the best task manager) and just waited for the spike. Unfortunately, since the spikes are very short, I wasn’t able to take a screenshot. But the culprit was soon found. It was Factorio!

Factorio is a survival/management game and I occasionally play it with some friends of mine, hence the server.

After I found out that Factorio was the culprit, I looked into it a bit further. With journalctl -u -f factorio.service I looked at the live log of the service.

31 04:01:15 stern systemd[1]: factorio.service: Scheduled restart job, restart counter is at 3474.
Dec 31 04:01:15 stern systemd[1]: Stopped Factorio game headless server.
Dec 31 04:01:15 stern systemd[1]: Started Factorio game headless server.
Dec 31 04:01:15 stern factorio[100820]:    0.000 2020-12-31 04:01:15; Factorio 1.0.0 (build 54889, linux64, headless)
Dec 31 04:01:15 stern factorio[100820]:    0.000 Operating system: Linux
Dec 31 04:01:15 stern factorio[100820]:    0.000 Program arguments: "/usr/bin/factorio" "--server-settings" "/etc/factorio/server-settings.json" "--use-server-whitelist" "--server-whitelist" "/etc/factorio/server-whitelist.json" "--server-adminlist" "/etc/factorio/server-adminlist.json" "--start-server-load-latest"
Dec 31 04:01:15 stern factorio[100820]:    0.000 Read data path: /usr/share/factorio
Dec 31 04:01:15 stern factorio[100820]:    0.000 Write data path: /var/lib/factorio/.factorio [82445/117495MB]
Dec 31 04:01:15 stern factorio[100820]:    0.000 Binaries path: /usr
Dec 31 04:01:15 stern factorio[100820]:    0.010 System info: [CPU: Intel(R) Celeron(R) J4105 CPU @ 1.50GHz, 4 cores, RAM: 7607 MB]
Dec 31 04:01:15 stern factorio[100820]:    0.010 Environment: DISPLAY=<unset> WAYLAND_DISPLAY=<unset> DESKTOP_SESSION=<unset> XDG_SESSION_DESKTOP=<unset> XDG_CURRENT_DESKTOP=<unset> __GL_FSAA_MODE=<unset> __GL_LOG_MAX_ANISO=<unset> __GL_SYNC_TO_VBLANK=<unset> __GL_SORT_FBCONFIGS=<unset> __GL_YIELD=<unset>
Dec 31 04:01:15 stern factorio[100820]:    0.010 Running in headless mode
Dec 31 04:01:15 stern factorio[100820]:    0.014 Loading mod core 0.0.0 (data.lua)
Dec 31 04:01:15 stern factorio[100820]:    0.122 Loading mod base 1.0.0 (data.lua)
Dec 31 04:01:16 stern factorio[100820]:    0.691 Loading mod base 1.0.0 (data-updates.lua)
Dec 31 04:01:16 stern factorio[100820]:    0.974 Checksum for core: 2630831588
Dec 31 04:01:16 stern factorio[100820]:    0.974 Checksum of base: 3509992273
Dec 31 04:01:17 stern factorio[100820]:    1.347 Prototype list checksum: 3301461508
Dec 31 04:01:17 stern factorio[100820]:    1.444 Info PlayerData.cpp:68: Local player-data.json available, timestamp 1609383665
Dec 31 04:01:17 stern factorio[100820]:    1.444 Info PlayerData.cpp:75: Cloud player-data.json unavailable
Dec 31 04:01:17 stern factorio[100820]:    1.447 Factorio initialised
Dec 31 04:01:17 stern factorio[100820]:    1.448 Info HttpSharedState.cpp:54: Downloading https://auth.factorio.com/api-login?api_version=4
Dec 31 04:01:17 stern factorio[100820]:    2.217 Info AuthServerConnector.cpp:41: Auth server returned error: {"data":{},"error":"login-insufficient-membership","message":"Your account doesn't own the game.  Please buy it to log in.  If you are playing through Steam, please check your Steam account is linked to your Factorio account"}
Dec 31 04:01:17 stern factorio[100820]:    2.217 Error CommandLineMultiplayer.cpp:345: Hosting multiplayer game failed: Your account doesn't own the game. Please buy it to log in.
Dec 31 04:01:17 stern factorio[100820]:    2.223 Info ServerMultiplayerManager.cpp:136: Quitting multiplayer connection.
Dec 31 04:01:17 stern factorio[100820]:    2.223 Info ServerMultiplayerManager.cpp:769: updateTick(4294967295) changing state from(Ready) to(Closed)
Dec 31 04:01:18 stern factorio[100820]:    2.288 Goodbye

Note the restart counter, which is currently at 3474! The log had millions of entries like this, and the service was restarting every few seconds. Apparently it was doing this for weeks/months now. The reason was that my Steam account wasn’t linked to my Factorio account anymore. After relinking, it worked again.

So, what did we learn from this experience. First of all, look at your systemd units, and how they are configured. The problem here was two-fold. First off, the AUR package for factorio-headless isn’t… ideal. It’s been out of date for years and does really weird things. The systemd unit it ships makes no sense in a lot of ways. IMHO, shipping a systemd unit for something like this makes no sense anyway. Anyone who sets up a Factorio server will likely have a specific way of configuring it. When I first installed the AUR package I had to change a lot to the systemd unit to make it work in my situation. I’ll include the unit below, maybe you can spot the mistake.

[Unit]
Description=Factorio game headless server
Documentation=http://www.factorio.com/
After=network.target

[Service]
User=factorio
Group=factorio
EnvironmentFile=/etc/conf.d/factorio
WorkingDirectory=/var/lib/factorio
ExecStart=/usr/bin/factorio --server-settings /etc/factorio/server-settings.json --use-server-whitelist --server-whitelist /etc/factorio/server-whitelist.json --server-adminlist /etc/factorio/server-adminlist.json --start-server-load-latest
TimeoutStopSec=30
KillSignal=SIGINT
RestartSec=10
Restart=on-failure

[Install]
WantedBy=multi-user.target

So, did you catch it? It’s line 15, Restart=on-failure. I’ve never been a big fan of autorestarting units, but for something like Factorio it makes no sense at all. It’s not a critical service. And if the service fails, I want to know why, and I don’t want it to just autorestart over and over. Which brings me to the second problem:

I didn’t notice. I just installed the service, configured it until it worked and didn’t give it much thought afterwards. This really goes to show the benefit of knowing how systemd works and writing your own units, instead of just copying other peoples work, who may or may not know what they are doing. Even as I am writing this article, I noticed another mistake in the service file.

The line After=network.target should be After=network-online.target. Here’s why.

To cite the page:

network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. Its primary purpose is for ordering things properly at shutdown.”

network-online.target makes sure the network is actually up, and it is the one you want when you use a service that needs access to the internet (e.g. a Factorio server).

In this experience, I learned a few things:

  • Look at your systemd units (ideally configure them manually for things like this)
  • Monitoring tools are nice, Netdata is very nice
  • If your monitoring tools have alarms, configure them so you get them somewhere where you can actually see them

I have to add that I don’t know if the error was actually connected to Factorio, since the service had this problem for a few weeks/months. But the error is what made me curious, and made me take a closer look.

I hope you enjoyed this little piece. May it help you in your future endeavors!


Articles from blogs I read - Generated by openring