Tech Talk: Searching in a large text field with Elasticsearch stored fields (Part 1)

Elasticsearch is becoming more and more popular over the years, helping clients explore and analyze their data differently with search. More recently, it has been featured in the Gartner quadrant. With Elasticsearch, it’s incredible how easy it is to spin up a cluster and get to work. However, if you don’t configure your cluster accordingly, you’ll get into trouble sooner or later.

In this article, we’ll focus on tweaking the default Elasticsearch behavior of storing original fields. We’ll also pay attention to the potential performance benefits and drawbacks that can emerge.

What we tried to do

In a previous article about the Function App, I write about extracting and processing text from PDF documents. Now, I will share how we made all the computation results available for search by optimizing the performance.

Our dataset is composed of JSON documents combining extracted text and attributes from PDF. The text size is, on average, 200K characters. The user experience must be the same as using a famous search engine like Google. Moreover, when a user types in a query, they have to receive an answer promptly. A result after 1 minute doesn’t make any sense in our use cases.

Our dataset will be 32 TB of indexed data. In addition, we know that Elasticsearch scale-out is much better than it scales up and that it can easily handle the data and the workload. It will be a matter of architecture and the number of nodes.

Starting, we have a cluster composed of 21 nodes as a ballpark estimation. There are some optimizations and configurations worth mentioning and experimenting with, e.g., Stored fields and highlighting. We’ll focus on the stored fields’ configuration in this first part.

The stored field in Elasticsearch

In Elasticsearch, a stored field allows you to retrieve its original value in a query result. The original document will be stored in a field called source_ which will be returned. In most cases, this is how you get the original document from your search. By default, the source_ contains all the fields, including the large one, unless you explicitly exclude/include fields.

Elasticsearch will read the _source from the disk every time you run a query, parse it as JSON and return. Reading such a stored field containing extensive text can be a performance issue both on CPU (parsing large JSON structure) and disk I/O (reading many data). Moreover, reading an unnecessarily large amount of data from disk will blow out the file system cache making it less effective. Depending on the size of the large text field and the use cases, it would be a good idea to tweak the default behavior to reach better performance.

Tweaking the _source in Elasticsearch

Elasticsearch gives the possibility to exclude fields from this particular _source. It also gives you the ability to store a specific field on its own, separately from the _source. This way, you’ll be able to return valuable information like “title” and other metadata attached to the document without having to parse and read extensive text.

To store your field separately, you have to configure your index mapping: - Exclude it from the _source. - Specify store: true to store separately.

{
"mappings": {
  "_source": {
    "excludes": ["content"]
  },
  "properties": {
    "content": {
      "type": "text",
      "store": true
    },
    "author": {
      "type": "keyword"
    },
    "title": {
      "type": "keyword"
    }
  }
}
}

Store fields separately

Even if the solution comes with potential great performance improvement, it also brings severe drawbacks you need to know.

As explained in the official documentation, excluding field from _sourcewill also disable some useful features:

  • Update, update_by_query, and reindex APIs.

  • The highlighting feature from un-stored fields. Highlighting search results requires the field to be stored somehow.

  • Updating an index to a major version. At some point, you won’t be able to upgrade further.

  • Ability to debug queries by viewing the original document.

  • Any potential future features requiring the original _source containing all the fields.

What we recommend in Elasticsearch

It is now obvious this kind of performance improvement comes with a great cost. Some scenarios come into our mind:

  • If you use the large text exclusively to build the inverted index for full-text search, it will make sense to exclude it from the _source. You will potentially lower your I/O and CPU workload but also disk usage.

  • If we have rare edge cases where we must retrieve the large text, it will make sense to store it separately from the _source field.

According to our use-cases (large field of 200 KB), we didn’t see any noticeable improvement of storing the large text field separately that would justify such drawbacks. We believe the file system cache made the difference.

As stated in this article on making Elasticsearch perform well, it would make sense to explicitly exclude the field from the _source when the large field size (100 MB) is large enough for the performance improvement to take over the drawbacks. In this very particular case, we strongly recommend keeping the data in a separate data store to:

  • Update documents by manually deleting and re-indexing.

  • Update indices to a major version.

When upgrading, you’ll need to start over with an empty cluster and re-index all the documents manually. In some cases, this could be unacceptable from a business point of view, and you’ll then have to work with two separate clusters to minimize downtime. You could also work with a single cluster if it is large enough to index your data twice.

Conclusion

We initially believed our dataset would include a large text field. We quickly realized that our field wasn’t that large, and Elasticsearch is capable of handling a much bigger text field than we thought. The idea of storing the large text field separately may seem intriguing at first, but it doesn’t provide any valuable performance benefit in our dataset. We still firmly believe it could increase the performance in some particular edge cases.

All in all, excluding fields from the _source must be a thoughtful decision. In most cases, we do not recommend tweaking the _source unless a field is large enough to justify the drawbacks. To find out more on this subject, I invite you to read this discussion.

Previous
Previous

Tech Talk: Searching in a large text field with Elasticsearch highlighting and Kibana (Part 2)

Next
Next

Tech Talk: Building intensive workload solutions with Azure’s function app, expectations, and limitations