Table Column Statistics
DuckDB’s tables can optionally supply statistics that assist with efficient query execution. What these statistics consist of vary based on the data type of the column. Tables provided via the Airport extension can provide these statistics if the Arrow Flight server chooses to do so.
Arrow Flight Server Implementation Notes
For an Airport based table to provide statitics, the Arrow schema of the table should contain a metadata key called can_produce_statistics
with a non-empty string value. If the table has opted-in to providing statistics an Arrow Flight DoAction
RPC with be made to the server with an action name of column_statistics
.
The action will be provided a parameter with this msgpack
encoded message.
struct GetFlightColumnStatistics
{
std::string flight_descriptor;
std::string column_name;
std::string type;
(flight_descriptor, column_name, type)
MSGPACK_DEFINE_MAP};
The flight_descriptor
field is the Arrow Flight serialized FlightDescriptor structure. The type
field is the DuckDB data type name, i.e. VARCHAR
, TIMESTAMP WITH TIME ZONE
.
The response to the the DoAction
call is the GetFlightColumnStatisticsResult
structure.
struct GetFlightColumnStatisticsResult
{
//! Whether or not the segment can contain NULL values
bool has_null;
//! Whether or not the segment can contain values that are not null
bool has_no_null;
// estimate that one may have even if distinct_stats==nullptr
idx_t distinct_count;
//! Numeric and String stats
;
GetFlightColumnStatisticsNumericStatsData numeric_stats;
GetFlightColumnStatisticsStringData string_stats
(has_null, has_no_null, distinct_count, numeric_stats, string_stats)
MSGPACK_DEFINE_MAP};
This in turn references a few additional structures. The first is for strings.
struct GetFlightColumnStatisticsStringData
{
std::string min;
std::string max;
(min, max)
MSGPACK_DEFINE_MAP};
Returning statistics for numeric types is a bit more complicated since there are different levels of precision and msgpack
doesn’t support 128-bit integers. So they’ve been split into a high and low 64-bit integer.
struct GetFlightColumnStatisticsNumericStatsData
{
//! Whether or not the value has a max value
bool has_min;
//! Whether or not the segment has a min value
bool has_max;
//! The minimum value of the segment
;
GetFlightColumnStatisticsNumericValue min//! The maximum value of the segment
;
GetFlightColumnStatisticsNumericValue max
(has_min, has_max, min, max)
MSGPACK_DEFINE_MAP};
struct GetFlightColumnStatisticsNumericValue
{
bool boolean;
int8_t tinyint;
int16_t smallint;
int32_t integer;
int64_t bigint;
uint8_t utinyint;
uint16_t usmallint;
uint32_t uinteger;
uint64_t ubigint;
uint64_t hugeint_high;
uint64_t hugeint_low;
float float_;
double double_;
(
MSGPACK_DEFINE_MAP, tinyint,
boolean, integer,
smallint, utinyint,
bigint, uinteger,
usmallint,
ubigint, hugeint_low,
hugeint_highfloat_, double_)
};
By providing column level statistics DuckDB can do a better job with planning and query optimization.
There is an experimental schema defined by Apache Arrow for statistics. In the future that schema may be adopted rather than the msgpack
schema.