2020-04-13

Simple and Boring Offline Log Analysis

You should analyze your logs but you might not need a full-blown elasticsearch cluster for it. In this post I show you my current way of analyzing my access logs of this website. Perhaps you can make use of some of the techniques.


The current state of logging is weird: There are cheap or even free (as in free beer) services online that allow you to send them your logs and analyze them, as a service. Since this requires a huge amount of trust in the service provider to not sell your logs or keep them as ransom to force you to switch to higher-priced tiers, I am aversed to those services for my use case. You could start a huge elasticsearch cluster with logstash and kibana, what is exceptionally nice and powerful but will eat away all your resources and you time keeping it alive (If you don’t have a IT department to manage it for you). You could tell your webservers to dump all their logs in a file and never look at them again. All of the above is not perfect, so I descibe you my extremely low-effort way of analyzing logs offline.


The easy way: GoAccess

There is tool called GoAccess that is pure magic: you give it a logfile and after a bit of processing it presents you with a nice CLI- or even Web-Dashboard. If you are satisfied with the provided statistics, e.g. requests per resource, processing time distribution etc, your are done. In my case however, want to have a bit more insight.

Heavier analytics

So first-off, the ELK stack is great, if you have someone to run it for you. If you just want to analyze some logs from time to time and do not care about them being always available, you can spin up an on-demand elk stack and read in your logfile and throw away the cluster afterwards. That way, one does no have to care about nasty things like index-rotation, running out of disk-space, node-failure and what else tends to blow up your cluster.

Unfortunately elasticsearch does not really support what I want to analyze: How do requests correlate over time, for example something like this: Given a visitor landing on the page, when do they request another resource (if at all). If so, how long did they take between requesting resources (a client requesting each and every resource within 10 second might not be a human reader). Do people run into 404 errors and are never seen again? Is there anyone visiting on more than one day, who is not a bot? Is someone using the tags?

To answer all those questions, the logs eventually are (temporarily) saved in a relational database. If that sounds distasteful at first, bear with me: The logs are shipped there in bulk, not need to care about unneccessary transactions blocking loging adapers etc. Please do not dump your application logs directly into a relational database.

But how do the logs get into the databse?

The Log ‘Pipeline’

The log processing ‘pipeline’ is as follows:

  • The webservers write their access-logs in the default format to a file.
  • Once rotated, the now closed file is downloaded to a central host.
  • A custom script cleans and parses the logs and dumps them in a postgres database.
  • Now one can use SQL to query the logs.
  • After a retention time, the logs are deleted.

Why use such a cumbersome, old-fashioned way? I don’t (currently) need logs to be updated within seconds, so batching them daily is fine for me. This removes all the nasty details of shipping them to a centralized server, buffering and what-not.

So why a custom script? The currently available tools are more than unfit: rsyslogd permanently spit out errors when using postgres, complaining about some weird missing parameters that are documented nowhere. Their internal parsing language seems a bit untypical to me but that is personal taste. Logstash works great with elastic tech but postgres supoprt is not active anymore. Fluentd did not work with the postgres module. So after 5h of trying to configure some tool, I chose to hack a script together in 94 lines of Rust and it just works fine.

Log analysis with postgres

A word of caution: Please don’t push the logs of your running app directly into a relational database, that’s not what they are made for. In my case I bulk-import them from a file and change them once every week or month.

Use the following steps to have a database containing your logs:

We need a user and a database, refer to the manual on how to do this. Log in with the credential of the logger user and create a new table containing you logs:

psql -U logger -d logs -h 127.0.0.1
CREATE TABLE access_logs (
  acc_time timestamp,
  clientip inet,
  host varchar(128),
  usr varchar(128),
  verb varchar(128),
  res varchar(512),
  code integer,
  bytes integer,
  referrer varchar(512),
  agent varchar(512)
);

Now dump the logs into that table. If everything else fails, use CSV-Import, it did work surprisingly well in my testing environment. Don’t forget to sanitize the logs first, you don’t want to have your database deleted by a malicious user-agent string.

Now we can extract the first bits of information: Extract the easy to find bots:

-- create the table for our bots
CREATE TABLE bot_ips(clientip inet, num_seen integer);

-- we insert all ips that ever used an agent containing something suspicious
-- further, the cout of requests is noted
INSERT INTO bot_ips (
  SELECT clientip, COUNT(clientip) 
  FROM access_logs 
  WHERE
      agent LIKE '%bot%'
  OR  agent LIKE '%Bot%'
  OR  agent LIKE '%crawler%'
  OR  agent LIKE '%metrics%'
  OR  res LIKE '%robots%';
  GROUP BY clientip
  );

This will of course only find the nice guys, since if they have another user-agent set, they will fly under the radar.

Finding bad guys

We can filter out any bad actors: Since there is no wordpress or anything php related running on my servers, the following list was created over time. I know this might not scale well over time but I run this every week so who cares (at the moment, finding out offending IPs out of about 2.5M log lines takes about 20 seconds, that is time I am willing to invest once per week).


create table naughty_strings(pattern varchar(512));
-- a hand-picked list of resources, that are only requested with malicious intent
insert into naughty_strings values('%212%212%');
insert into naughty_strings values('%999999%');
insert into naughty_strings values('%unhex%');
insert into naughty_strings values('%name_const%');
insert into naughty_strings values('%wp-login%');
insert into naughty_strings values('%.php');
insert into naughty_strings values('%.env');
insert into naughty_strings values('%wp-includes%');
insert into naughty_strings values('%wp-login%');
insert into naughty_strings values('%/PHP/%');
insert into naughty_strings values('%.ckfinder%');
insert into naughty_strings values('%/.git/%');
insert into naughty_strings values('%/.hg/%');
insert into naughty_strings values('%/.svn/%');
insert into naughty_strings values('%wordpress%');
insert into naughty_strings values('%license.txt');
insert into naughty_strings values('%wp-json%');
insert into naughty_strings values('%myadmin%');
insert into naughty_strings values('%/mysql/');
insert into naughty_strings values('%/pma/%');
insert into naughty_strings values('%/phpMyAdmin/%');
insert into naughty_strings values('%/wp-content/%');

To filter out all naughty requests one can combine JOIN and LIKE (thanks stackoverflow):

SELECT res
FROM  access_logs AS logs
INNER JOIN (
  SELECT pattern as p FROM naughty_strings
) st ON logs.res LIKE st.p;

If someone knows a more efficient way to match a set of pattern on a table, please let me now!

To save the offending IPs, they can be added to an extra table

-- create a table containing all bad ips

CREATE TABLE bad_ips(clientip inet);

INSERT INTO bad_ips (
  SELECT distinct(logs.clientip) 
  FROM  access_logs AS logs
  INNER JOIN (
    SELECT pattern as p FROM naughty_strings
  ) st ON logs.res LIKE st.p
);

Filtering out the requests from these IPs removes about 10% of requests in my case:

SELECT COUNT(res) FROM access_logs AS logs
WHERE logs.clientip NOT IN (
  SELECT *
  FROM  bad_ips
)

Time based queries

A user shared one of my blog-posts via Slack (I appreciate this), so Slack requested the site to create a thumbnail. Don’t worry, I can’t tell who posted the link, I only analyzed the lack bot behaviour. Given the link and the timestamp, we can now have a look at the data to find out if anything did go wrong:

select * from access_logs as logs
where logs.ts > TIMESTAMP '2020-02-28 12:10:46' - interval '1 hour'
AND logs.ts < TIMESTAMP '2020-02-28 12:10:46' + interval '1 hour'
AND logs.res LIKE '%pps%'; -- part of the post title

In fact, slack did request the apple-touch-icon.png resource that I did not create, initially, probably leading to an image-placeholder to be used.

Since further SQL might be boring, I spare you the rest of the queries and rather present you with a few results.

Some findings

Combing through the logs, I found the following things:

  • Slack requests /apple-touch-icon.png, that thumbnail did not exist, so I created it. The same was true for the icon requested by twitter.
  • There are lots of feed-readers subscribing, I probably should pay more attention to the post-description, visible in the feed. Therefore I subscribed to my own post using the rss-feed without being too narcissistic.
  • People use outdated browsers to read. Whoever is out there, reading with Firefox 57 (November 2017), please consider an update.
  • Over the last month, about 5000 IPs tried to find wordpress installations on the server, luckily causing about only 7% of traffic.
  • There is an HTTP status code 499?
  • Many feed-readers use unencrypted HTTP
  • People normally don’t click on tags
  • Pople don’t click around much at all

I did not notice that my post was linked in readrust, thanks for that !

  timestamp         |   ip     | verb |           res          | code |  bytes | referrer                                      | agent
--------------------+----------+------+------------------------+------+--------+-----------------------------------------------+----------------
2020-02-27 00:00:00 | x.x.x.x  | GET  | /posts/tech/tokio_pps/ | 200  |  25394 | https://readrust.net/web-and-network-services | Mozilla/5.0 ...

Conclusion

For now, I am completely satisfied with my makeshift log analysis solution. If your usecase is simple enough, consider saving yourself the hassle of using all the fancy tools and use an old-fashioned database.


Associated Tags: