Update: When we published this article in March 2024, Elasticsearch did not yet support Apache Arrow streaming format. This is possible now, see "From ES|QL to native Pandas dataframes in Python" for more details.
The Elasticsearch Query Language (ES|QL) provides a powerful way to filter, transform, and analyze data stored in Elasticsearch. Designed to be easy to learn and use, it is a perfect fit for data scientists familiar with Pandas and other dataframe-based libraries. Indeed, ES|QL queries produce tables with named columns, which is the definition of dataframes! This blog explains how to export ES|QL queries as Pandas dataframes in Python.
ES|QL to Pandas dataframes in Python
Importing test data
First, let's import some test data. We will be using the employees
sample data and mappings. The easiest way to load this dataset is to run those two Elasticsearch API requests in the Kibana Console.
Converting dataset to a Pandas DataFrame object
OK, with that out of the way, let's convert the full employees
dataset to a Pandas DataFrame object using the ES|QL CSV export:
from io import StringIO
from elasticsearch import Elasticsearch
import pandas as pd
client = Elasticsearch(
"https://[host].elastic-cloud.com",
api_key="...",
)
response = client.esql.query(
query="FROM employees | LIMIT 500",
format="csv",
)
df = pd.read_csv(StringIO(response.body))
print(df)
Even though this dataset only contains 100 records, we use a LIMIT
command to avoid ES|QL warning us about potentially missing records. This prints the following dataframe:
avg_worked_seconds ... salary_change.long still_hired
0 268728049 ... 1 True
1 328922887 ... [-7, 11] True
2 200296405 ... [12, 14] False
3 311267831 ... [0, 1, 3, 13] True
4 244294991 ... [-2, 13] True
.. ... ... ... ...
95 204381503 ... NaN False
96 206258084 ... -1 False
97 272392146 ... [-2, 4, 8] False
98 377713748 ... [-8, -3, 10, 14] True
99 223910853 ... [-7, 13] True
This means you can now analyze the data with Pandas. But you can also continue massaging the data using ES|QL, which is particuarly useful when queries return more than 10,000 rows, the current maximum number of rows that ES|QL queries can return.
Analyzing the data with Pandas
In the next example, we're counting how many employees are speaking a given language by using STATS ... BY
(not unlike GROUP BY
in SQL). And then we sort the result with the languages
column using SORT
:
response = client.esql.query(
query="""
FROM employees
| STATS count = COUNT(emp_no) BY languages
| SORT languages
| LIMIT 500
""",
format="csv",
)
df = pd.read_csv(
StringIO(response.body),
dtype={"count": "Int64", "languages": "Int64"},
)
print(df)
Note that we've used the dtype
parameter of pd.read_csv()
here, which is useful when the type inferred by Pandas is not enough. The above code prints the following:
count languages
0 15 1
1 19 2
2 17 3
3 18 4
4 21 5
21 employees speak 5 languages, wow!
Finally, suppose that end users of your code control the minimum number of languages spoken. You could format the query directly in Python, but it would allow an attacker to perform an ES|QL injection! Instead, use the built-in parameters support of the ES|QL REST API:
response = client.esql.query(
query="""
FROM employees
| STATS count = COUNT(emp_no) BY languages
| WHERE languages >= (?)
| SORT languages
| LIMIT 500
""",
format="csv",
params=[3],
)
df = pd.read_csv(
StringIO(response.body),
dtype={"count": "Int64", "languages": "Int64"},
)
print(df)
which prints the following:
count languages
0 17 3
1 18 4
2 21 5
Conclusion
As you can see, ES|QL and Pandas play nicely together. However, CSV is not the ideal format as it requires explicit type declarations and doesn't handle well some of the more elaborate results that ES|QL can produce, such as nested arrays and objects. For this, we are working on adding native support for Apache Arrow dataframes in ES|QL, which will make all this transparent and bring significant performance improvements.
Additional resources
If you want to learn more about ES|QL, the ES|QL documentation is the best place to start. You can also check out this other Python example using Boston Celtics data. To know more about the Python Elasticsearch client itself, you can refer to the documentation, ask a question on Discuss with the language-clients tag or open a new issue if you found a bug or have a feature request. Thank you!
Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!
Related content
October 16, 2024
How to use Elasticsearch with popular Ruby tools
Take a look at how to use Elasticsearch with some popular Ruby libraries.
October 16, 2024
Convert your Kibana Dev Console requests to Python and JavaScript Code
The Kibana Dev Console now offers the option to export requests to Python and JavaScript code that is ready to be integrated into your application.
October 17, 2024
Unlock the Power of Your Data with RAG using Vertex AI and Elasticsearch
Unlock your data's potential with RAG using Vertex AI and Elasticsearch. This blog series covers data ingestion into Elasticsearch for a robust knowledge base for creating advanced RAG based search applications.
October 11, 2024
Which job is the best for you? Using LLMs and semantic_text to match resumes to jobs
Learn how to use Elastic's LLM Inference API to process job descriptions, and run a double hybrid search to find the most suitable job for your resume.
October 10, 2024
How to ingest data from AWS S3 into Elastic Cloud - Part 2 : Elastic Agent
Learn about different options to ingest data from AWS S3 into Elastic Cloud.