column_statistics Action

The get_flight_column_statitics action is used to provide DuckDB with statitics about a field in a table.These statitics allow DuckDB to execute queries more efficienty.An Arrow Flight server can optionally choose to implement this action.

Input Parameters

There is a single msgpack serialized parameter passed to the action.

struct GetFlightColumnStatistics
{
  std::string flight_descriptor;
  std::string column_name;
  std::string type;

  MSGPACK_DEFINE_MAP(flight_descriptor, column_name, type)
};

The flight_descriptor field is the Arrow Flight serialized FlightDescriptor structure.The type field is the DuckDB data type name, i.e. VARCHAR, TIMESTAMP WITH TIME ZONE.

Output Results

The response is a msgpack encoded GetFlightColumnStatisticsResult structure.

struct GetFlightColumnStatisticsResult
{
  //! Whether or not the segment can contain NULL values
  bool has_null;
  //! Whether or not the segment can contain values that are not null
  bool has_no_null;
  // estimate that one may have even if distinct_stats==nullptr
  idx_t distinct_count;

  //! Numeric and String stats
  GetFlightColumnStatisticsNumericStatsData numeric_stats;
  GetFlightColumnStatisticsStringData string_stats;

  MSGPACK_DEFINE_MAP(has_null, has_no_null, distinct_count, numeric_stats, string_stats)
};

This in turn references a few additional structures. The first is for strings.

struct GetFlightColumnStatisticsStringData
{
  std::string min;
  std::string max;
  MSGPACK_DEFINE_MAP(min, max)
};

Returning statistics for numeric types is a bit more complicated since there are different levels of precision and msgpack doesn’t support 128-bit integers. So they’ve been split into a high and low 64-bit integer.

struct GetFlightColumnStatisticsNumericStatsData
{
  //! Whether or not the value has a max value
  bool has_min;
  //! Whether or not the segment has a min value
  bool has_max;
  //! The minimum value of the segment
  GetFlightColumnStatisticsNumericValue min;
  //! The maximum value of the segment
  GetFlightColumnStatisticsNumericValue max;

  MSGPACK_DEFINE_MAP(has_min, has_max, min, max)
};

struct GetFlightColumnStatisticsNumericValue
{
  bool boolean;
  int8_t tinyint;
  int16_t smallint;
  int32_t integer;
  int64_t bigint;
  uint8_t utinyint;
  uint16_t usmallint;
  uint32_t uinteger;
  uint64_t ubigint;

  uint64_t hugeint_high;
  uint64_t hugeint_low;

  float float_;
  double double_;

  MSGPACK_DEFINE_MAP(
      boolean, tinyint,
      smallint, integer,
      bigint, utinyint,
      usmallint, uinteger,
      ubigint,
      hugeint_high, hugeint_low,
      float_, double_)
};
Note

There is an experimental schema defined by Apache Arrow for statistics. In the future that schema may be adopted rather than the msgpack schema.