Tag Archives: troubleshooting

Grub console how to

I’m sure it happened to migrate a linux server, maybe in a slightly dirty way (rsync’ing) or had some issues with the boot loader.

And when you reach the point with this:

grub rescue>

…and you start to cry (or almost) 🙂

Well, here some steps that helped me to boot the server and restore grub.

Use

ls

to see the list of available partitions. Find the one where you know (or think) the kernel is installed. In my case it was

(hd0,msdos1)

, which is basically /dev/sda1

After that, use the following:

grub rescue > set root=(hd0,msdos1)
grub rescue > set prefix=(hd0,msdos1)/boot/grub
grub rescue > insmod normal
grub rescue > insmod linux
grub rescue > linux /vmlinuz root=/dev/sda1 ro
grub rescue > initrd /initrd.img
grub rescue > boot

With these commands, I have been able to boot into my OS.

After that, I re-installed grub:

update-grub
grub-install /dev/sda

NOTE: UUID could be a cause of failed boot too.
Under Debian/Ubuntu there is a file
/etc/default/grub
where you can disable the UUID format. This could generate some issues if you have swapped the disk so it might be good to check this config file and eventually enable
GRUB_DISABLE_LINUX_UUID=true 
and re run the
update-grub
. To remember as well, the UUID is set in
/etc/fstab
. You can replace that with /dev/sdXy accordingly as well.

I hope this will help someone else that, like me, got stuck in restoring a VM.

Sources:

Systemd – find what’s wrong with systemctl

True: all the last changes in Linux distro didn’t make me really really happy.
I still like to use init.d to start a process (it took me a while to get used to service yourservice status syntax) and so.

Anyway, the main big ones don’t seem to look back, and we need to get used to this 🙂

I have few raspberry PIs at home, and I’ve noticed that after a restart I was experiencing different weird behaviours. The main two:

stuck and not rebooting
receiving strange logrotate email alerts (e.g. /etc/cron.daily/logrotate:
gzip: stdin: file size changed while zipping)

I tried to ignore them, but when you issue a reboot from a remote place and it doesn’t reboot, you understand that you should start to check what’s going on, instead of just unplug-replug your PI.

And here the discovery: systemctl

This magic command was able to show me the processes with issues, and slowly find out what was wrong with logrotate or my reboot. Or, better, I have realised that after fixing what was marked as failed, I didn’t experience any weird behaviors.

So, here few steps that I’d like to share – to help maybe someone else in the future, myself included – as I tend to forget things if I don’t use them 🙂

To check if your system is healthy or not:

systemctl status

Output should return “running”. If you get “degraded”, well, there is definitely something wrong.

Use the following to check what has failed:

systemctl --failed

Now, investigate those specific processes. Try to analyse their status and logs or literally try to restart them to see live what is the error:

systemctl status <broken_service>

journalctl _PID=<PID_of_broken_service>

tail /var/log/<broken_service>

systemctl restart <broken_service>

After fixing all, I tried to reboot few times and after I was checking again the overall status to make sure it was “running”.

In my case, I had few issues with “systemd-modules-load.service”. This probably related to my dist-upgrade. Some old and no longer existing modules were still listed in /etc/modules and, of course, the service wasn’t able to load them, miserably failing.
I’ve tested each module using modprobe <module_name> and I’ve commented out the ones where failing. Restarted and voila`, status… running!

On another PI I had some issues with Apache, but I can’t remember how I fixed it. Still, the goal of this post is mostly make everyone aware that systemctl can give you some interesting info about the system and you can focus your energies on the failed services.

I admit in totally honesty that I have no much clue why after fixing these failed services, all issues disappeared. In fact, the reboot wasn’t affecting one PI with the same non-existing modules listed, but it was stopping another one during the boot. Again, I could probably troubleshoot further but I have a life to live as well 🙂

Sources:

Postfix – blacklist domain

Few notes about how to block a specific domain to send out emails

To help in cutting down the number of spam mails currently getting through a specific domain

# vi /etc/postfix/blacklisted_domains

# cat /etc/postfix/blacklisted_domains
mydomain.com	REJECT

# postmap /etc/postfix/blacklisted_domains
# postconf -e 'smtpd_recipient_restrictions = check_recipient_access hash:/etc/postfix/blacklisted_domains, permit'

Compromised Email troubleshooting notes

Here some notes about how to troubleshoot a server that got compromised by a php script.

Check email queue

Qmail -> qmHandle
Postfix -> pmHandle / postqueue

# qmHandle -s
Total messages: 7357
Messages with local recipients: 0
Messages with remote recipients: 7357
Messages with bounces: 0
Messages in preprocess: 0

Get some email IDs

# qmHandle -l | head
1348989 (16, 16/1348989)
Return-path: #@[]
From: [email protected]
To: [email protected]
Subject: failure notice
Date: 30 Jun 2015 07:42:59 +0100
Size: 5093 bytes
less
42240113 (15, 15/42240113)
Return-path: [email protected]

Check for X-PHP header in the mail message
Look for the UID and script that sent the message

# qmHandle -m1348989 | grep X-PHP
X-PHP-Originating-Script: 48:wp-content.php(1) : eval()'d code

Find the script and UID

# grep 48 /etc/passwd => this was Apache ==> this means that the code was injected via Apache

=> permissions issue??

# locate wp-content.php
/var/www/vhosts/example.com/wp-content.php

Move away the file(s) and chown 000
!! if the file starts with – , you need to user chown — 000 filename

Disable execution php following this how to

Delete all the messages containing that header

# qmHandle -h'X-PHP-Originating-Script: 48:wp-content.php'
Calling system script to terminate qmail...
Stopping : Looking for messages with headers matching X-PHP-Originating-Script: 48:wp-content.php
Message 1345933 slotted for deletion.
Message 42240608 slotted for deletion.
Message 1346796 slotted for deletion.
Message 42240391 slotted for deletion.
Message 42241954 slotted for deletion.
[...]
Deleted 113 messages from queue
Restarting qmail... Starting qmail: [ OK ]
done (hopefully).

Extra notes:

Check the queue:

postqueue -p

See the content of a message:

postcat -q <ID from postqueue output>

Check for “X-PHP-Originating-Script” header, which generally gives you the name of the script that generate the email

If they are sent to a specific domain, you can block some domains in Postfix following this guide

Which PHP script is eating my memory?

For mod_php, you can just add this to your Apache LogFormat:

%{mod_php_memory_usage}n

Might be best to add it right at the end, so as not to break any log parsing.
Credit: http://tech.superhappykittymeow.com/?p=220

For PHP-FPM (i.e. anyone with NginX or anyone with one of our optimised Magento setups.), put the following into your FPM pool config file, probably here:
– /etc/php-fpm.d/website.conf (RHEL/CentOS)
– /etc/php5-fpm/pools.d/website.conf (Ubuntu)

access.log = /var/log/php-fpm/domain.com-access.log
access.format = "%p %{HTTP_X_FORWARDED_FOR}e - %u %t \"%m %{REQUEST_URI}e\" %s %f %{mili}d %{kilo}M %C%% \"%{HTTP_USER_AGENT}e\""

Notes:
• Permissions matter. Check that the User and/or Group (of THIS fpm pool) can write to /var/log/php-fpm (or /php5-fpm, whatever)
• %{HTTP_X_FORWARDED_FOR}e is there because I was behind a Load Balancer, Varnish and/or other reverse proxy.
• %{REQUEST_URI}e\ is there because this CMS (like most, now), rewrite everything to index.php. I want to know the original request, not just the script name.
• %{kilo}M %C – These are the kickers: Memory usage and CPU PerformanceOptimized usage per request. Ker-pow. Pick your favourite awk/sort one-liner to weed out the heavy hitters.

You can log pretty much anything: any arbitrary header, (like REQUEST_URI), and you should find all the options in the comments under the default “www.conf” packaged config file.
Given that this starts with the PID, it should also help track down any segfaults you see in /var/log/messages / kern.log.
Oh, and while you’re at it, PHP-FPM has a slow log you can enable.
Oh, and don’t forget your old friend logrotate.

Counting the average PHP-FPM process memory usage

Looking at the ‘ps’ output isn’t always accurate because of shared memory. Here’s a one-liner to count up my FPM pool, where “www” is in name of the FPM pool:

for pid in $(ps aux | grep fpm | grep "pool www" | awk '{print $2}'); do pmap -d $pid | tail -1 ; done | sed 's/K//' | awk '{sum+=$4} END {print sum/NR/1024}’

Thanks to Dan Farmer for his original Apache memory script, which made me think to do this.

Divide and conquer

Do make sure that multiple websites on the same server are using their own FPM pools, that way it should be much easier to see what’s what.

But you can also separate by URL. I’ve done this with the M-word, but the theory should stand for any CMS where the tasks performed under the admin section will be more intensive than normal front-end user traffic.
The following should work with WordPress, assuming the path is /wp-admin . Magento users can (and should) change their admin path from the default, so watch out for that (usually configured in local.xml)

Or anything really – maybe that “importvideo” script is especially memory-intensive and you don’t want to allow it to bloat up all your processes.

1. Set up a separate PHP-FPM pool.
– give it a different name, different log file names, and listen on a different socket/port.
– Perhaps with a much higher memory_limit
– Perhaps with many fewer pm.max_children (ask customer how many humans actually use the “backend”). You might only need 5 or so.
– Perhaps it needs a longer max_execution_time …. you get the idea.

2. Send your “admin” traffic there.
Here’s how you might go about it. These are excerpts and ideas, rather than solid copy-paste config.

Nginx: something like this:

upstream backend {
server unix:/var/run/php5-domain.com.sock;
}
upstream backend-admin {
server unix:/var/run/php5-domain.com-admin.sock;
}

map "URL:$request_uri." $fcgi_pass {
default backend;
~URL:.*admin.* backend-admin;
}

… then in your ‘server{}’ bit, use the variable for fastcgi_pass:

fastcgi_pass $fcgi_pass;

Apache: something like this…

# Normal backend alias, corresponds with FastCGIExgternalServer 
Alias /php.fcgi /dev/shm/domain-php.fcgi
<Location ~ admin>
# Override Action for “admin” URLs
Action application/x-httpd-php /domain-admin.fcgi
</Location>
Alias /domain-admin.fcgi /dev/shm/domain-admin-php.fcgi

3. You can probably now LOWER the memory_limit for your “main” pool
– if the rest of the website doesn’t use much memory. Now bask in the memory you just saved for the whole application server.

4. Bonus: NewRelic app separation
Don’t let all that heavy Admin work interfere with your nice / renice appdex statistics – we know (and expect) your backend dashboard to be slower.
Put this in the FPM pool config:

php_value[newrelic.appname] = "www.domain.com Admin"

Credits: https://willparsons.tech/

What to do with a down Magento site

1. Application level logs – First place to look.

If you are seeing the very-default-looking Magento page saying “There has been an error processing your request”, then look in here:

ls -lart <DOCROOT>/var/report/ | tail

The stack trace will be in the latest file (there might be a lot), and should highlight what broke.
Maybe the error was in a database library, or a Redis library…see next step if that’s the case.

General errors, often non-fatal, are in <DOCROOT>/var/log/exception.log
Other module-specific logs will be in the same log/ directory, for example SagePay.

NB: check /tmp/magento/var/ .
If the directories in the DocumentRoot are not writable (or weren’t in the past), Magento will use /tmp/magento/var and you’ll find the logs/reports/cache in there.

2. Backend services – Magento fails hard if something is inaccessible

First, find the local.xml. It should be under <DOCROOT>/app/etc/local.xml or possibly a subdirectory like <DOCROOT>/store/app/etc/local.xml

From that, take note of the database credentials, the <session_save>, and the <cache><backend>. If there’s no <cache> section, then you are using filesystem so it won’t be memcache or redis.

– Can you connect to that database from this server? authenticate? or is it at max-connections?
– To test memcache, “telnet host 11211” and type “STATS“.
– To test Redis, “telnet host 6379” and type “INFO”.
You could also use:

redis-cli -s /tmp/redis.sock -a PasswordIfThereIsOne info

If you can’t connect to those from the web server, check that the relevant services are started, pay close attention to the port numbers, and make sure any firewalls allow the connection.
If the memcache/redis info shows evictions > 0, then it’s probably filled up at some point and restarting that service might get you out of the water.

ls -la /etc/init.d/mem* /etc/init.d/redis*

3. Check the normal places – sometimes it’s nothing to do with Magento!

– PHP-FPM logs – good place for PHP fatal errors. usually in var/log/php[5]-fpm/

– Apache or nginx logs
– Is Apache just at MaxClients?
– PHP-FPM max_children?

ps aux | grep fpm | grep -v root | awk '{print $11, $12, $13}' | sort | uniq -c

– Is your error really just a timeout, because the server’s too busy?
– Did OOM-killer break something?

grep oom /var/log/messages /var/log/kern.log

– Has a developer been caught out by apc.stat=0 (or opcache.validate_timestamp=0) ?

Credits: https://willparsons.tech/

Rackspace Cloud Driveclient not working

First of all, checks the logs: /var/log/driveclient.log

You might find 403 errors and lines that are showing that the agent can’t connect properly.

In this case, the first step is trying to re-register the backup agent:
3) Maybe the customer has changed the API key so try re-register the backup agent:

# /usr/local/bin/driveclient --configure
WARNING: Agent already configured. Overwrite? [Y/n]: Y
Username: My_Username
Password: My_APIKey

Desired Output:

Registration successful!
Bootstrap created at: /etc/driveclient/bootstrap.json

In case you get something like “ERROR: Registration failed: Could not authenticate user. Identity returned 401“, this means that you probably need to force a bit the registration, using the following command:

# driveclient -u USER_NAME -k API_KEY -t LON -l raxcloudserver -a lon.backup.api.rackspacecloud.com -c

Netcat – such a powerful ‘cat’!

I was just looking around info about netcat and telnet, trying to understand a bit more. Well… in few words: no point to install telnet if you have netcat! 🙂 Netcat is perfect for scripting, ’cause it’s non-interactive, UDP/TCP capable, can be a listener as well… very powerful tool. Here some example.

How to check if your httpd is up and running:

~ $ nc -zv localhost 80
Connection to localhost 80 port [tcp/http] succeeded!

…and it closes gracefully 😉

How to check port-range ports:

~ $ nc -zv localhost 20-25
nc: connect to localhost port 20 (tcp) failed: Connection refused
Connection to localhost 21 port [tcp/ftp] succeeded!
Connection to localhost 22 port [tcp/ssh] succeeded!
nc: connect to localhost port 23 (tcp) failed: Connection refused
nc: connect to localhost port 24 (tcp) failed: Connection refused
nc: connect to localhost port 25 (tcp) failed: Connection refused

… or a list of ports:

$ nc -zv localhost 20 22 80 443
nc: connect to localhost port 20 (tcp) failed: Connection refused
Connection to localhost 22 port [tcp/ssh] succeeded!
Connection to localhost 80 port [tcp/http] succeeded!
Connection to localhost 443 port [tcp/https] succeeded!

NOTE: If you want to grep or play with the “output” of the command, you need to use 2>&1
For example:
nc -zv localhost 1-1024 <strong>2>&1</strong> | grep succeeded

How to check the service that’s running on that port:

(From man) Alternatively, it might be useful to know which server software is running, and which versions. This information is often contained within the greeting banners. In order to retrieve these, it is necessary to first make a connection, and then break the connection when the banner has been retrieved. This can be accomplished by specifying a small timeout with the -w flag, or perhaps by issuing a “QUIT” command to the server:

$ echo "QUIT" | nc host.example.com 20-30
SSH-1.99-OpenSSH_3.6.1p2
Protocol mismatch.
220 host.example.com IMS SMTP Receiver Version 0.84 Ready

In some cases, it’s handy to add -q 1 at the end, if nc hangs (I’ve noticed this in some cases) Like this:

$ echo "QUIT" | nc host.example.com 20-30 <strong>-q 1</strong>

Or how to send/receive a file:

On the receiver side:

$ nc -l 1234 > /tmp/file_to_receive

On the sender side:

$ cat file_to_send | nc receiver_ip_or_fqdn 1234

$ nc receiver_ip_or_fqdn 1234 < file_to_send

There are plenty of things that you can do. These are just simple examples… enjoy! 🙂

How to test virtualhosts – command line

curl -H'host: myvirtualserver.localdomain'

Example

curl -H'host: virtualdomain01.com' http://192.168.218.144/

My notepad

Chris' IT notes…