Airport Servers and the DuckDB Catalog
The Airport DuckDB extension is designed to provide a seamless user experience by integrating data served via Apache Arrow Flight directly into DuckDB. The primary goal is to make remote Arrow Flights appear and behave like native DuckDB tables and functions.
Evolution of Integration
Initially, the Airport DuckDB extension operated separately from the DuckDB catalog. Discovering and consuming data required using specific table-returning functions, namely airport_flights
and airport_take_flight
. This approach lacked the transparency of treating remote data as local database objects.
To address this, the extension now integrates directly with the DuckDB catalog. This enhancement allows Arrow Flights to be represented as standard tables and functions within the database schema, significantly streamlining data access for users.
How the Airport Extension Simulates a Catalog
This integration works by leveraging the Airport extension’s ability to interpret the metadata of Arrow Flights provided by the server, effectively creating a catalog layer on top of Arrow Flight, which itself does not provide one.
The Airport extension relies on additional metadata attached to the flight information. This metadata is stored in the app_metadata
field of the FlightInfo
structure. This metadata signals to Airport whether a specific flight should be treated as a database table, a scalar function, or a table-returning function. The metadata is a serialized C++ structure named using msgpack
named SerializedFlightAppMetadata
.
This is the structure of the metadata that is stored in the app_metadata
field of the FlightInfo structure.
struct SerializedFlightAppMetadata
{
; // set to "table_function" or "table" or "scalar_function"
string type
// The name of the schema where this item exists.
;
string schema
// The name of the catalog or database where this item exists.
;
string catalog
// The name of the item.
;
string name
// A custom comment for this item.
;
string comment
// This must be defined for a table returning function
// it is a serialized Arrow schema.
std::optional<string> input_schema;
// The name of the action passed to the Arrow Flight server
std::optional<string> action_name;
// This is the function description for table or scalar functions.
std::optional<string> description;
(
MSGPACK_DEFINE_MAP, schema,
type, name,
catalog, input_schema,
comment, description)
action_name};
For scalar functions, the input_schema
field must contain a serialized Arrow Schema describing the function’s input. The schema of the flight serves as the output schema of the scalar function, which must consist of exactly one element (but can be any Arrow data type).
The interpretation of a server’s flights’ metadata only happens after the Airport extension is asked to attach to a specific database on the Arrow Flight server via the ATTACH
command. An Airport database name is simply an identifier; a single Arrow Flight server process can host multiple distinct Airport Databases.
The Airport database acts as the top-level container, analogous to any other DuckDB attached database. Within it, schemas contain the database objects (tables and functions) that are provided by the Arrow flights from the server.
In essence, connecting to an Airport database allows the Airport DuckDB extension to populate the DuckDB catalog with representations of the remote Arrow Flights according to their associated metadata.
The following diagram shows an example of an in-memory DuckDB database and an attached Airport database provided by a remote server.
Arrow Flight and ListFlights
The Arrow Flight RPC interface includes a ListFlights
method for listing the available flights from a server. It seemed logical to use this method to integrate those flights with the DuckDB catalog. However, several limitations make ListFlights
unsuitable for this purpose:
Performance Overhead: An Arrow Flight server can provide thousands of flights. Serializing all flights in response to a
ListFlights
RPC can be CPU-intensive. Since serialization occurs at the gRPC level, the current implementation of Arrow does not support response caching when flights remain unchanged.Large Schema Sizes: Apache Arrow schemas can be large, especially when dealing with hundreds of tables. Compressing responses could help but would add additional CPU overhead to the Flight server.
The serialized size of an Apache Arrow schema can be quite large, especially when dealing with hundreds of schemas. The table below illustrates the size of an Apache Arrow schema as the number of int64
fields increases.
Number of Int64 Fields | Serialized Size of Arrow Schema (bytes) | Serialized Size of Arrow Schema with ZStandard Compression level=20 (bytes) |
---|---|---|
0 | 56 | 46 |
1 | 128 | 71 |
2 | 176 | 90 |
3 | 224 | 98 |
4 | 272 | 106 |
5 | 320 | 114 |
6 | 368 | 119 |
7 | 416 | 127 |
8 | 464 | 135 |
9 | 512 | 151 |
10 | 560 | 155 |
- Lack of Caching and CDN Support: If a portion of a Arrow server’s flight list rarely changes, there is no built-in way to leverage an external service (e.g., a CDN) to cache and distribute this data nor indicate that a client can cache the data. Using a CDN could offload bandwidth and CPU usage from the Flight server.
Databases often contain hundreds or even thousands of tables, each with its own schema. An Arrow Flight server may spend significant time serializing these schemas for every client request, even though they change infrequently.
To optimize performance, a server can serialize and compress schema descriptions once and provide clients with a reference to the stored representation via a URL (e.g., accessible with curl). This URL could be hosted on a CDN to further improve efficiency.
If an Airport-provided list_schemas
schema or database includes a SHA256 checksum and its contents are not provided inline with the response, it is assumed that the content can be be cacheable indefinitely. Instead of repeatedly downloading the schema information, DuckDB caches it in the .duckdb
home directory under the airport_cache
direction, using the SHA256 checksum as the cache key. Before attempting to retrieve the schema from a URL, DuckDB first checks this cache. If the schema exists on disk and its hash matches, the cached version is used.
- Missing Hierarchical Structure: DuckDB organizes objects within databases, schemas, and then tables/functions.
ListFlights
returns a flat list of flights, making it difficult to represent this hierarchy.
Alternative Approach: list_schemas
Instead of using Arrow Flight’s ListFlights
RPC for catalog integration, the Airport extension introduces a separate RPC, list_schemas
, invoked via DoAction
. The key benefits of list_schemas
include:
Flexible Response Delivery: Responses can be provided inline or via an external URL, validated and cached based off of a SHA256 checksum. This enables CDN support and client-side caching.
Comprehensive Database Representation:
list_schemas
returns either the entire database’s structure. Each schema contained in the database can either be returned inline or via an external URL. This allows schemas that are immutable to be efficiently cached, and fixed schemas the be served effectively.Efficient Handling of Dynamic Schemas: For schemas that change frequently, SHA256 checksums are optional. Without a SHA256 checksum the schema information will not be cached by DuckDB.
Retrieving Catalog and Schema Information
Database schema information is retrieved using an Arrow Flight RPC action called via DoAction
1 with the name list_schemas
.
Case Sensitivity
DuckDB table and identifier names are case-insensitive but the names are not transformed (or standardized to a particular case) before they are sent to the Arrow Flight server. As such a query like
SELECT FROM from FooBar;
Will look for a table named FooBar
.
It is the responsibility of the server to match the behavior of DuckDB.
Footnotes
From the Arrow Flight documentation:
Flight services can support an arbitrary number of simple actions in addition to the possible ListFlights, GetFlightInfo, DoGet, DoPut operations that are potentially available. DoAction allows a flight client to do a specific action against a flight service. An action includes opaque request and response objects that are specific to the type action being undertaken.