Counting Requests to Nginx

LH Fong
ESTL Lab Notes
Published in
4 min readSep 20, 2018

--

I was recently asked about the amount of traffic Coursemology receives. We don’t monitor this figure live, but I figured it should be possible to figure it out from the Nginx access logs.

Access Log Format

On Ubuntu 16, the Nginx access log is located at/var/log/nginx/access.log by default. Logrotate is configured to rotate the log once per day, so there were a few days of logs, with older ones compressed.

Here is a sample line from the log file.

xx.xxx.xx.xx - - [20/Sep/2018:06:25:00 +0800] "GET /users/sign_in HTTP/2.0" 200 5205 "-" "some browser and system information"

The main thing I wanted to check for was the format of the timestamp. It is logged to the second, which makes figuring out the number of requests per second more straightforward.

Processing access.log

A script in any programming language could be written to read the log file, parse out the timestamp and count the number of occurrences for each time.

However, this is not an obscure problem so other people must have tried to figure out the same thing. Sure enough, a couple of Google searches led to this site on parsing access.log and error.log with Linux commands. This is even better since no custom code is required.

Counting Requests per Second

The example on the site is for sorting by the number of times each access code shows up. It can be adapted to do exactly the same thing, but with the timestamp column instead. Here is the first pass:

sudo cat access.log | awk '{print $4}' | uniq -c | sort -rn | head

sudo is needed to read the logs as a normal user does not have the necessary privileges.

I’ve always found awk syntax obscure, but since the site used {print $9} for the response code, it’s a reasonable guess that the number refers to the column position. From there, it’s a matter of trying numbers until the timestamp comes up. It turns out that {print $4} only gives the timestamp up to the first space, but that’s good enough as the timezone is always the same for me.

uniq -c is where the main magic happens. It omits repeated lines, and adding the -c flag prefixes the lines with the count of the number of occurrences.

sort -rn then sorts the output from uniq in reverse order numerically, putting the timestamps with the most number of requests at the top. This works since uniq -c has placed the count in front.

head then truncates the output to the top most busy timestamps in the log file.

The output looks something like this (figures have been modified):

     45 [20/Sep/2018:16:38:05
38 [20/Sep/2018:17:38:55
34 [20/Sep/2018:16:38:06
34 [20/Sep/2018:16:36:52
31 [20/Sep/2018:16:38:18
31 [20/Sep/2018:14:38:24
30 [20/Sep/2018:16:36:54
30 [20/Sep/2018:16:36:53
30 [20/Sep/2018:16:36:31
30 [20/Sep/2018:14:15:06

Processing Compressed Log files

Use zcat instead of cat.

sudo zcat access.log.6.gz | awk '{print $4}' | uniq -c | sort -rn | head

Filtering out Hacking Attempts

Going through my log files, there were some timestamps with a lot more requests than the other timestamps in the file.

We can investigate those in more detail with grep:

sudo cat access.log | grep '<timestamp>'

These turned out to be hacking attempts looking for php files.

103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /help.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /java.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /_query.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /test.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /db_cts.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /db_pma.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /logon.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /help-e.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /license.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"
103.84.x.x - - [19/Sep/2018:04:43:13 +0800] "GET /log.php HTTP/1.1" 301 194 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0"

As I’m not running a PHP app, these can be filtered out:

sudo cat access.log | grep -v '[GET|POST] .*\.php' | awk '{print $4}' | uniq -c | sort -rn | head

grep -v inverts the match so lines that do NOT match GET or POST requests to php files are kept for further processing in the pipeline.

With the Nginx log files and some clever chaining of built in Linux commands, it turned out to be relatively easy to get the required information without having to write a custom program!

--

--