Access a Presto Database with Python and SQLAlchemy
At Twilio, I maintain a blog dashboard that allows my colleagues and I to perform analysis on traffic, conversions, and other metrics for articles published on this blog. I have written in the past about some of the database benchmarking techniques I’ve implemented for this project, and this generated interesting discussions on Hacker News. Today I will be discussing a topic that is a lot less controversial.
The charts and reports generated by this dashboard combine data from several sources that include web traffic, blog article metadata, and user information. The latter comes from a large internal company database hosted in a Presto cluster.
Presto is an open-source distributed query engine for big data created by Facebook. It allows users to query many different data sources, such as Hadoop, Kafka, MySQL, MongoDB, and many more using its own SQL implementation. This effectively abstracts the developer from query language differences among sources.
In this article I’m going to tell you how I enabled the Python back end of my dashboard application to communicate with Presto using the familiar SQLAlchemy library.
Python requirements
There are two Presto client libraries for Python, the official Presto-Python client and PyHive. At the time I’m writing this article, only PyHive has SQLAlchemy support, so that made it an easy choice for me, since I already use this library in my project.
If you are going to attempt to connect to your own Presto server, go ahead and install SQLAlchemy and PyHive in your virtual environment as follows:
Create a connection
In SQLAlchemy, the first step in making a connection to a database is to create an engine for it. The syntax to create an engine is shown below:
The database URL that is passed as an argument to the create_engine()
function includes all the connection's parameters. For a Presto database, the URL has the following format:
This URL above has a lot of placeholders for information that is going to be specific to your own Presto instance. Let’s review these placeholders:
username
andpassword
are the credentials that you will use to access the cluster.hostname
andport
specify the server and port number on which Presto is listening for connections.catalog
andschema
define the Presto database you intend to use. The catalog is associated with the data source. Theschema
has different meanings depending on the source, but is in general a method of referencing a collection of tables as a group.protocol
can be set tohttp
orhttps
, depending on the requirements of your Presto server.
To keep my scripts clean of sensitive information I prefer to configure the parameters of the connection in environment variables, which are then imported by the Python script. Below you can see an example get_presto_engine()
function that returns the SQLAlchemy engine instance:
Note how I take advantage of Python’s f-strings to use the URL template from above directly in the code.
Access to database tables
Once you have the engine instance, you can create a MetaData
object and add tables to it, which would allow you to start building queries.
To create table object that represent the actual tables in the database you have two options:
- You can manually define
Table
objects withColumn
instances that match the Presto tables you are interested in querying. - You can have SQLAlchemy automatically “reflect” these tables for you.
The following example manually defines a pageviews
table with five columns, then queries the database for the 10 most recent page views from it:
Be aware that this approach will only work when the columns defined in the Table
instances actually match those in the Presto database. An interesting detail is that you do not need to define every column in the SQLAlchemy table, it is sufficient to define only the columns that you need for your queries.
For tables with lots of columns it may be tedious to manually replicate every column, so using table reflection can be more efficient. This is where SQLAlchemy automatically instantiates table objects based on information retrieved from the database. The example above will largely be the same under table reflection, the only change would be in how the Table
object is defined:
With this method SQLAlchemy uses the table name and the engine to retrieve the schema of the table, and then creates a corresponding table object and adds it to the metadata
instance.
While table reflection sounds like a fantastic solution, I have had some issues with it. The pyhive
package appears to not recognize certain column types reported by Presto, so it could not properly reflect some columns.
Integration with the SQLAlchemy ORM
If you’ve got this far, the next logical step is to see if Presto can be queried via the SQLAlchemy ORM.
And this actually worked as well. Instead of creating MetaData
and Table
instances, a declarative base and a model class can be defined:
With this approach, rows from the table are represented by instances from the PageView
class. The class includes the __tablename__
attribute with a reference to the Presto table name. Database queries are issued through the Session
object, which automatically maps between table rows and PageView
objects.
You may have noticed that in the Table
example I did not define any primary keys for the table, because my intention was to only read from this table. When using the ORM, SQLAlchemy requires a primary key, so I had to define it to avoid errors.
If you prefer to use table reflection, that also works with the ORM. Here is the PageView
model class from the example above, defined with reflection:
Conclusion
And there you have it, now you know how to access your Presto cluster using the familiar SQLAlchemy library. I hope this helps you keep your big data needs Pythonic as much as it helped me!
Miguel Grinberg is a Principal Software Engineer for Technical Content at Twilio. Reach out to him at mgrinberg [at] twilio [dot] com if you have a cool project you’d like to share on this blog!
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.