Everyone loves logs right?
Logs are long, complex, full of useless information, and it takes ages for you to find that one error message that you need to fix a problem. So if you’re working with over 100 servers and you’re getting over 200GB of logs a day how can you get through your logs to find the real information inside?
We wanted to find a tool which we could use to centralise our applications logs, to help minimise the time it took to find the single error in a vast ocean of log messages. At the same time it needed to be quick – I mean ‘real-time’ quick – because if you’re in the middle of a huge issue, having to sit and wait for logs to be processed in order to find the problem is not a good position to be in. The last requirement was that it needed to be able to handle A LOT of logs! I think at one point we clocked one of our applications writing out around 100 lines a second during peak times, the throughput coming from over 100 servers is immense and it needed to be able to handle that.
Then we found the ELK stack, made up of three different open source applications but used together came to form a viable solution to our sorrows. The stack is made up of ElasticSearch, Logstash and Kibana, all three with different roles in the stack but seem to work harmoniously together to be able to:
- take our logs off of the application servers
- translate them into formats that can be read by anyone
- store them in a centralised store
- display them on a friendly web UI with lots of colours and graphs
So what’s first? The first job was to create the centralised store where we would keep all the logs, index them and make them available for search, this is a job for ElasticSearch! Using a piece of technology called Apache Lucene, ElasticSearch is able to process and index our logs at an incredible speed and is able to keep up to our (nearly) real-time requirements of a maximum ~5 second delay.
Next is the transport of the logs off of the application servers into ElasticSearch, this is where Logstash comes into play. Logstash uses a simple configuration file which is used to specify a) where your logs are stored, b) how to translate yours logs (primarily with a Grok filter) and c) how and where to output the logs to, which in our case is ElasticSearch.
Lastly, how can you get your log files once they are in ElasticSearch? If you’re clever enough and have enough time on your hands you can write your own interface using the ElasticSearch API to call the logs and then parse them from their JSON format into something you prefer, or if you’re like us and don’t have time for that then you can use Kibana.
Kibana is great for talking to ElasticSearch, translating your search queries into the ElasticSearch API and then showing you the response in a very easy to read format. As well as this you can then also create funky little graphs and widgets to represent your logs, such as “errors per minute” graphs or “common reoccurring errors” to really help you see into your logs without having to spend hours trawling through lots of incomprehensible data.
Once we got the ELK stack in and running we were able to see our logs in (nearly) real time which helps us to triage issues quickly and efficiently. It also had the added benefit of being able to give better visibility of our logs to developers, who previously had no access to any Production environments. The centralised location and Kibana dashboards made looking through logs a pleasure to do, and in a lot of cases a lot of people were proactively looking through the logs to see what could be fixed/improved before they caused our customers any errors.
So, what’s next…
Now the challenge is to deploy the entire ELK stack into the cloud, fully resilient, fully scalable and, most importantly, fully automated!