Querying Apache logs from the Linux command line

October 30, 2013

The problem: you have a need to consult your Apache logs in a programmatic fashion. For instance, you are looking for a particular string, and perhaps want to know more about its occurrence, like what days did it occur the most? what times of day? Is there an IP address or user agent that correlates in any unusual way? And so on.

The Solution? Well, there is probably many, including using PHP, Perl, Python or other scripting languages to load the logs and process the text into more advanced data structures. For the purpose of this article however, I'll be taking a purely command line approach, utilising tools you'll find in basically any Linux web server.

For those that are more accustomed to using web developing languages and patterns, the world of shell scripting can seem a bit awkward and foreign. Granted, the readability of most significantly sized scripts written for a shell like Bash can be lacking in parts. However, the power of the shell is probably vastly underrated by most "normal" developers, and doesn't require fully mastering to be an effective tool. In this article, I will be sticking to scripting on the command line, which means utilising the humble concept of 'piping' to shift data between commands.

If you're not at all familiar with piping when it comes to Bash, then let me offer a very simplistic explanation - by piping, you are sending the output of one command to the input of another. In reality the workings of the pipe is a bit more complex than that, but with this knowledge alone you can form enough of an understanding, for instance, to figure out what's going on here:

echo 'hello world' | tr ' ' '-'

Which of course comes back with hello-world. In any case, this article isn't about introducing piping, so if this still isn't making sense, perhaps a bit of Googling is in order.

Due to the nature of Bash piping and its focus on passing raw text, and the bevy of text based tools available in your typical Linux installation, querying Apache logs at the command line is actually quite a natural fit. To get started, here is a typical entry in an Apache access log, taken from this site (with the IP address replaced by a dummy):

10.12.13.14 - - [26/Oct/2013:10:31:04 +1100] "GET /api/posts?skip=0&take=15 HTTP/1.1" 200 118519 "http://www.nathandavison.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0"

So, just text; stuff the command line is built to play with. Going off this example, we can now quite quickly find out how many times this URI was accessed:

cat access.log | grep 'GET /api/posts?skip=0&take=15 ' | wc -l

The response will be a number, which represents the amount of lines in the log file (access.log) that make it through the first piping, which is checking whether the line matches the pattern using grep (in this case, a simple string). While this forms a rather basic example, clearly even here we've obtained what could be handy information with very minimal effort.

If you go back to 'the problem' listed at the start of this article, I listed some criteria that was a bit more complex than this, so from here on out, things will get a bit more detailed. In a real world example recently, I wanted to see Google's AJAX indexing at work, which involves the GoogleBot appending _escaped_fragment_= to the query string in its requests, to signal to the web server that it wants the static version of the AJAX fulfilled URL being requested. I wanted to find out on which dates these requests occurred, and also the number of these requests on each day. I also wanted this data to be formatted in CSV format so I could port it across platforms easily. These requirements form a fairly complex query on the Apache logs, but one that the command line can definitely fulfil.

First, I wanted to see what the text I would be working with looked like:

cat access.log | grep '_escaped_fragment_' | tail -10

This gives me the 10 most recent times an access to my web server had the string _escaped_fragment_ in it, returning results like this:

67.66.249.74.183 - - [25/Oct/2013:19:08:24 +1100] "GET /article/15/knowing-what-element-was-targeted-in-backbonejs-view-events?_escaped_fragment_= HTTP/1.1" 301 339 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

From here, I can use this result as my reference for how to extract data from the log text. Since I want to know date related data, it is the timestamp that I will be most interested now, given that any result coming through should already be matching my criteria of containing _escaped_fragment_ (and hence being a request I'm interested in).

The following command will output all the relevant log entries, showing only their timestamp:

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\(.*\)\].*/\1/'

This is using 'sed' to manipulate the text coming through the pipe so it only shows the timestamp. It does this by searching for a regular expression pattern that matches the entire line, capturing the timestamp portion, and then replacing the match (the entire line) with the captured timestamp. If you're not familiar with 'sed' (or, for that matter, 'grep' or 'cat' or 'cut' or 'tail' or 'uniq' etc), I'd recommend doing some research to gain an understanding.

This 'sed' command could use some refining so it is a bit more particular about capturing the timestamp (just in case square brackets occur elsewhere in the log text), and I will also cut down the output, so only the actual dd/mm/yyyy date is captured, and not the time.

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\([0-9]\{2\}\/[a-zA-Z]\{3\}\/[0-9]\{4\}\):.*\].*/\1/' 

So far so good, but we still need to convert this now list of dates into a CSV where each date is given a number of occurrences rather than repeated through. Like everything else so far, this can be done on the command line:

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\([0-9]\{2\}\/[a-zA-Z]\{3\}\/[0-9]\{4\}\):.*\].*/\1/' | sort | uniq -c 

Two new commands have been added to the chain of pipes. Firstly, we are sorting the results with 'sort'. This is important as when we use 'uniq' to condense repeats into one, we have to make sure repeats are bulked together, as 'uniq' will not remove identical lines that aren't together. With the 'uniq' call, we're also adding the '-c' option, which tells 'uniq' to return how many duplicates there were before it removed them.

Running this command now will output a somewhat awkwardly formatted bulk of text, with spaces to separate counts from dates. We're close, but we'll have to deploy some more 'sed' commands and a 'cut' command to get our CSV data.

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\([0-9]\{2\}\/[a-zA-Z]\{3\}\/[0-9]\{4\}\):.*\].*/\1/' | sort | uniq -c | sed 's/ \{1,\}/,/g' | sed 's/\(.*\)/,\1/' | sed 's/,\{1,\}/,/g' | cut -d, -f2,3 

As you can see, we've added three new 'sed' commands, and a 'cut' command. The first 'sed' is to replace one or more spaces with a comma, effectively converting those awkward spaces from the 'uniq' command's output into commas. The next two 'sed' commands are perhaps not necessary, but what they do is first ensure each line starts with a comma, and then it replaces any occurrences of duplicate commas with just one. This was done because, due to the output format of 'uniq -c', it's possible a line won't start with a space, and hence won't be replaced with a comma from the earlier 'sed' command, breaking the CSV format. Finally, 'cut' is used to only output the two "fields" we want: the amount of occurrences and the date.

This rather meaty (but still manageable command) achieves the objective, which was to query the Apache logs, and return the dates and amount of occurrences for requests that contained the string '_escaped_fragment', all in a nice universal CSV format. For that extra bit of perfection, here is how you might make sure the date field comes before the occurrence field (yep you guessed it, another sed!):

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\([0-9]\{2\}\/[a-zA-Z]\{3\}\/[0-9]\{4\}\):.*\].*/\1/' | sort | uniq -c | sed 's/ \{1,\}/,/g' | sed 's/\(.*\)/,\1/' | sed 's/,\{1,\}/,/' | cut -d, -f2,3 | sed 's/\(.*\),\(.*\)/\2,\1/' 

Finally, now that we have the output we want, we can write it to a file, like so:

cat access.log | grep '_escaped_fragment_' | sed 's/.*\[\([0-9]\{2\}\/[a-zA-Z]\{3\}\/[0-9]\{4\}\):.*\].*/\1/' | sort | uniq -c | sed 's/ \{1,\}/,/g' | sed 's/\(.*\)/,\1/' | sed 's/,\{1,\}/,/' | cut -d, -f2,3 | sed 's/\(.*\),\(.*\)/\2,\1/' > ajaxcrawls.csv 

And that's it, job done.

Again, all this could have been achieved in multiple ways. I'm sure there are powerful tools out there that take Apache logs and produce all manner of nice interfaces for admins to extract data with, and something like what was outlined in this article could just as easily be achieved in a full blown scripting language. However, sometimes it's nice to know that the typical Linux command line can go a long way to providing what you want, and it is definitely a skill that will reward you with results, particularly in situations where obtaining varied data quickly is key. And of course, all this isn't just relevant to Apache - these techniques can apply to any form of text file querying.