Nethserver stops working over night

NethServer Version: 7.4.1708

Hello,

I have a fresh Nethserver install and during the night the server stops responding, it hangs.
I don’t have physical access to the server, someone has to reboot it for me in the morning.
The server is a new Dell Poweredge T130 with four 1TB drives configured in software RAID1(boot) and RAID10(the rest) and 8GB RAM.
The server has Samba server and router roles(DHCP, DNS, firewall, MultiWAN).
I tried to configure WAN failover and shorewall crashed. Even in this moment it says “Check firewall rules The firewall is NOT running”. Last time I tried to reboot the server remotely it hanged.
I would like to know where should I start the troubleshooting.

Istvan

I disabled the 2nd WAN connection and Shorewall started.

Check in log if you can catch something before the server hangs
Check a hardware cause, like RAM.

Which log should I check?

particularly /var/log/messages but you could have a clue with other…check also that your log is not contained only events from the boot

There are a lot of messages like this before hanging:
Feb 3 00:18:19 server kernel: IPv4: martian source 46.97.27.250 from 46.97.27.249, on dev p3p1
Feb 3 00:18:19 server kernel: ll header: 00000000: ff ff ff ff ff ff 00 6c bc ef 5e 2e 08 06 …l…^…

These are related to the 2nd WAN which I disabled to be able to start Shorewall.

After I disabled the 2nd WAN this problem stopped.

This morning the server was not accessible again. This is my 2nd Nethserver and there are way too much stability problems with it. I need to find out if the stability problems are hardware related or Nethserver related. Is not normal to have weekly server hanging issues.

This morning I found this on the screen.


Any suggestions?

Istvan!

Could you give some more info about the setup? Is this your own WAN IP for instance? What is p1p3’s role in the network? What is the servers task?

There is an upstream bug open, not sure it is the one affecting your server:
https://bugs.centos.org/view.php?id=13843#c31121

I think that @dnutan has found the right answer.
@adv, could you please reboot your server using kernel 3.10.0-514.26.2.el7.x86_64 just to confirm that the problem goes away?
This seems to be a regression coming from Redhat, we use to trust them.

For this issue I supposed to open a new topic, but the forum administrators are closing them. They say the issues are related to this topic.
The thing is that I’m facing a lot of stability issues with this Nethserver installation. Actually since I use Linux in the past 15 years I never had stability issues.
I assume this must be a hardware issue, but I can’t prove it yet.

About this issue I disabled the auditd service, like is written on this link:

Now I wait…

You are using CentoOS 7, I don’t think that instructions for CentOS 6 are relevant to your problem.
Does your system match what is reported in the CentOS bugtracker?

Today I the server was blocked again. The kernel version is: 3.10.0-693.17.1.el7.x86_64
Apparently in this version the bug reported here persists: https://bugs.centos.org/view.php?id=13843#c31121
In this thread is mentioned that this bug is fixed in version: 3.10.0-820.el7, but I can’t find this version.
Instead I found that the kernel 3 is EOL(https://www.kernel.org/) but it can be upgraded to 4.4 or 4.15.(https://www.howtoforge.com/tutorial/how-to-upgrade-kernel-in-centos-7-server/)
So what do you think? Should I upgrade to 4.4 or 4.15? Or is better to revert to 3.10.0-514.26.2.el7.x86_64 ?

I just discovered I can’t revert to 3.10.0-514.26.2.el7.x86_64. The available versions are:
0 : CentOS Linux (3.10.0-693.17.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-693.11.6.el7.x86_64) 7 (Core)
2 : CentOS Linux (3.10.0-693.el7.x86_64) 7 (Core)
3 : CentOS Linux (0-rescue-b6a299e6f2a74c57aa53e745d1fd3c69) 7 (Core)

So should I go to update to kernel v4? Which one? 4.4 or 4.15?

3.10.0-820 will come with 7.5 in a few months.
3.10 is not EOL, it’s maintained by Redhat.
Follow the howto you have found to pick a new kernel from elrepo.
I’d prefer the -lt kernel (which, as of today, is 4.4).
You will lose the nDPI function, but you may rebuild it from sources (I know it works).

After I upgrade to 4.4, can I revert to 3.10.0-820 ?

I upgraded to 4.4.116-1.el7.elrepo.x86_64. I’m anxious to see if it solves the problem.
Indeed DPI function is not available. Do you have a tutorial how can I rebuild it from the sources? I’m planning to use this feature.

upgrading the kernel will put yourself on your own, since it is a big change from upstream’s path.

CentOS’ kernel is maintained so you can’t compare a 3.10 vanilla kernel with a CentOS/RH’s one…

What would you do in my place? My server is unstable with the latest maintained CentOS kernel.