Structuring unstructured data with GROK

If you are using the Elastic (ELK) stack and are interested in matching custom Logstash logs with Elasticsearch, then this post is for you.

ITKarma picture

The ELK stack is an abbreviation for three open source projects: Elasticsearch, Logstash, and Kibana. Together they form a magazine management platform.

  • Elasticsearch is a search and analytics system.
  • Logstash is a server-side data processing pipeline that receives data from several sources at the same time, converts it and then sends it to a “cache”, for example Elasticsearch.
  • Kibana allows users to visualize data using charts and graphs in Elasticsearch.

Beats appeared later and is an easy data shipper. The introduction of Beats has transformed the Elk Stack into an Elastic Stack, but that's not the point.

This article is about Grok, which is a feature in Logstash that can convert your logs before they are logged. For our purposes, I will only talk about processing data from Logstash to Elasticsearch.

ITKarma picture

Grok is a filter inside Logstash that is used to parse unstructured data into something structured and subject to a request. It sits on top of the regular expression (regex) and uses text patterns to match strings in log files.

As we will see in the following sections, using Grok is important when it comes to managing logs efficiently.

Without Grok your log data is Unstructured

ITKarma picture

Without Grok, when logs are sent from Logstash to Elasticsearch and rendered in Kibana, they appear only in the message value.

Requesting meaningful information in this situation is difficult, since all log data is stored in one key. It would be better if the journal posts were better organized.

Unstructured data from logs

localhost GET/v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0 

If you look closely at the raw data, you will see that they actually consist of different parts, each of which is separated by a space.

For more experienced developers, you can probably guess what each part means and that this is a log message from an API call. Each item is presented below.

A structured view of our data

  • localhost == environment
  • GET == method
  • /v2/applink/5c2f4bb3e9fda1234edc64d == url
  • 400 == response_status
  • 46ms == response_time
  • 5bc6e716b5d6cb35fc9687c0 == user_id

As we see in structured data, there is an order for unstructured journals. The next step is software processing of the raw data. That's where Grok shines.

Grok Templates

Grok Embedded Templates

Logstash comes with over 100 built-in templates for structuring unstructured data. You should definitely take advantage of this when possible for general syslogs such as apache, linux, haproxy, aws and so on.

However, what happens when you have custom logs, as in the example above? You must build your own Grok template.

Custom Grok Templates

You need to try to build your own Grok template.I used Grok Debugger and Grok Patterns .

Note that the syntax for Grok templates is as follows: CDMY0CDMY

The first thing I tried to do was go to the Discover tab in the Grok debugger. I thought it would be great if this tool could automatically generate a Grok template, but it was not very useful, as it found only two matches.

ITKarma picture

Using this discovery, I began to create my own template on the Grok debugger, using the syntax found on the Github Elastic page.

ITKarma picture

After playing with different syntaxes, I was finally able to structure the log data as I wanted.

ITKarma picture

Link to the Grok debugger


localhost GET/v2/applink/5c2f4bb3e9fda1234edc64d 400 46ms 5bc6e716b5d6cb35fc9687c0 


%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id} 

What happened in the end

{ "environment": [ [ "localhost" ] ], "method": [ [ "GET" ] ], "url": [ [ "/v2/applink/5c2f4bb3e9fda1234edc64d" ] ], "response_status": [ [ "400" ] ], "BASE10NUM": [ [ "400" ] ], "response_time": [ [ "46ms" ] ], "user_id": [ [ "5bc6e716b5d6cb35fc9687c0" ] ] } 

With the Grok template in hand and the associated data, the last step is to add it to Logstash.

Updating the Logstash.conf configuration file

On the server where you installed the ELK stack, go to the Logstash configuration:

sudo vi/etc/logstash/conf.d/logstash.conf 

Paste the changes.

input { file { path => "/your_logs/*.log" } } filter{ grok { match => { "message" => "%{WORD:environment} %{WORD:method} %{URIPATH:url} %{NUMBER:response_status} %{WORD:response_time} %{USERNAME:user_id}"} } } output { elasticsearch { hosts => [ "localhost:9200" ] } } 

After saving the changes, restart Logstash and check its status to make sure it is still working.

sudo service logstash restart sudo service logstash status 

Finally, to make sure the changes take effect, be sure to update the Elasticsearch index for Logstash in Kibana!

ITKarma picture

With Grok, your log data is structured!

ITKarma picture

As we see in the image above, Grok is able to automatically map log data to Elasticsearch. This facilitates log management and quick information retrieval. Instead of rummaging through log files for debugging, you can simply filter out what you are looking for, such as the environment or url.

Try giving Grok expressions a chance! If you have another way to do this, or if you have any problems with the examples above, just write a comment below to let me know.

Thanks for reading - and please follow me here on Medium for more interesting articles on software engineering!

PS Link to the source

Telegram channel by Elasticsearch