Syslog is your friend

Feb 25, 2026

I have a Proxmox cluster of miniPCs that I use to host... various things.

I love miniPCs. Using miniPCs in a homelab is not a new idea, and I've had sysadmin coworkers evangelizing the joy of Intel NUCs to me for ages. I've had a small pile of Lenovo ThinkCentre 1L PCs for a while.

I overhauled three of them a bit ago to slap a 1tb SATA SSD, ~250gig M.2 SSD, and 2.5gig ethernet card in each. It's not much, but it gets the job done for most of what I need. I'm hoping to bring up a bigger compute node soon, but for the time they're what I've got.

I had an issue for several months after my overhaul where one of the nodes would occasionally hang. It'd drop off the network and wouldn't respond to anything when I attached a KVM. It seemed to happen randomly: it wasn't correlated to system load or uptime. It wasn't during a scheduled job. I ran memtest, made sure the RAM and everything were seated well.

I didn't have an offboard syslog collector running at the time, so I had to keep manually rebooting the node and standing in my freezing basement to do troubleshooting. After some googling I learned that journalctl, the onboard log collection facility for systemd distros, can sort logs by boot!

root@pve02:~# journalctl -b-9 | tail -n 10  
Nov 28 05:15:08 pve02 sshd[163247]: Received disconnect from 192.168.11.2 port 46112:11: disconnected by user  
Nov 28 05:15:08 pve02 sshd[163247]: Disconnected from user root 192.168.11.2 port 46112  
Nov 28 05:15:08 pve02 sshd[163247]: pam_unix(sshd:session): session closed for user root  
Nov 28 05:15:08 pve02 systemd[1]: session-216.scope: Deactivated successfully.  
Nov 28 05:15:08 pve02 systemd-logind[816]: Session 216 logged out. Waiting for processes to exit.  
Nov 28 05:15:08 pve02 systemd-logind[816]: Removed session 216.  
Nov 28 05:17:01 pve02 CRON[163837]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)  
Nov 28 05:17:01 pve02 CRON[163838]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)  
Nov 28 05:17:01 pve02 CRON[163837]: pam_unix(cron:session): session closed for user root  
Nov 28 05:19:34 pve02 smartd[812]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 65 to 66

That's the last ten lines of the now-minus-nine-th boot on the system. We see root disconnecting, probably normal cluster stuff (it runs operations as root), then a cronjob firing, a couple minutes go by, and then this SMART log about disk temp.

Nothing crazy incriminating here, no real smoking gun, though the disk temp is slightly high.

root@pve02:~# journalctl -b-8 | tail -n 4
Nov 29 16:45:22 pve02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.  
Nov 29 16:45:22 pve02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.  
Nov 29 16:45:22 pve02 systemd[1]: user-0.slice: Consumed 4.757s CPU time.  
Nov 29 16:48:52 pve02 smartd[832]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 68 to 69

root@pve02:~# journalctl -b-7 | tail -n 4
Nov 30 05:15:07 pve02 systemd-logind[835]: Removed session 160.  
Nov 30 05:17:01 pve02 CRON[71088]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)  
Nov 30 05:17:01 pve02 CRON[71089]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)  
Nov 30 05:17:01 pve02 CRON[71088]: pam_unix(cron:session): session closed for user root

root@pve02:~# journalctl -b-6 | tail -n 4 
Dec 01 01:30:18 pve02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.  
Dec 01 01:30:18 pve02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.  
Dec 01 01:30:18 pve02 systemd[1]: user-0.slice: Consumed 4.808s CPU time.  
Dec 01 01:34:41 pve02 smartd[839]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 69 to 68

root@pve02:~# journalctl -b-5 | tail -n 4  
Dec 28 12:45:31 pve02 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.  
Dec 28 12:45:31 pve02 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.  
Dec 28 12:45:31 pve02 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.  
Dec 28 12:45:31 pve02 systemd[1]: user-0.slice: Consumed 4.708s 

Huh. That last one looks like it could be a normal shutdown, but that's two more syslogs that end with the last log being about a high disk temperature. Moreover, this node physically sits between the other two on my rack. Maybe it is just overheating? ~70C is dangerously hot for an NVMe disk.

I grabbed a spare USB fan from a box somewhere, plugged it into the USB power output on my UPS, and pointed it at the offending node.

root@pve02:~# uptime
 22:05:05 up 45 days, 23:39,  1 user,  load average: 2.54, 2.31, 1.32

Well, that seems to have solved it!

Also, it turns out that's not what a normal shutdown looks like. This is:

root@pve02:~# journalctl -b-3 | tail -n 4
Dec 30 21:50:48 pve02 systemd-shutdown[1]: Syncing filesystems and block devices.  
Dec 30 21:50:48 pve02 systemd-shutdown[1]: Sending SIGTERM to remaining processes...  
Dec 30 21:50:48 pve02 systemd-journald[392]: Received SIGTERM from PID 1 (systemd-shutdow).  
Dec 30 21:50:48 pve02 systemd-journald[392]: Journal stoppedCPU time.
root@pve02:~# journalctl -b-2 | tail -n 4
Dec 30 21:58:09 pve02 systemd-shutdown[1]: Syncing filesystems and block devices.  
Dec 30 21:58:09 pve02 systemd-shutdown[1]: Sending SIGTERM to remaining processes...  
Dec 30 21:58:09 pve02 systemd-journald[375]: Received SIGTERM from PID 1 (systemd-shutdow).  
Dec 30 21:58:09 pve02 systemd-journald[375]: Journal stopped  
https://hnr.spacefish.net/posts/feed.xml