Modifying primary_key/sort_key for an existing table that has materialized views

  Kiến thức lập trình

I Have a table stats_view_source that is described as follows:

CREATE TABLE stats_view_source
(
    `timestamp` DateTime,
    `user_ip` String,
    `user_country` Nullable(String),
    `user_language` Nullable(String),
    `user_agent` Nullable(String),
    `entity_type` String,
    `entity_id` String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/.../{uuid}/{shard}', '{replica}')
PRIMARY KEY (entity_type, entity_id)
ORDER BY (entity_type, entity_id, timestamp)
SETTINGS index_granularity = 8192

It has multiple materialized views attached to it that aggregate daily, monthly and yearly stats into different tables.

Too late did i discover that i need the user_ip to be part of the sort_key so that multiple rows to the same sort_key in the same second do not get deduplicated. I attempted to rectify this by modifying the sort_key to include (entity_type, entity_id, user_ip, timestamp), however after investigating errors for this query i am now aware that this only works with new colums and for adding existing columns to the sort_key, i have to recreate the table. (https://github.com/ClickHouse/ClickHouse/issues/13835)

My concern is that the tables already have a few hundred-thousand rows, aggregated through multiple materialized views, that i cannot start from a clean slate.
As i am forced to recreate a new table, i am looking for the best way to accomplish this.

  • I need to create a new temporary table:
CREATE TABLE IF NOT EXISTS stats_view_source_tmp
            (
                `timestamp`   DateTime,
                `user_ip`     String,
                `user_country` Nullable(String),
                `user_language` Nullable(String),
                `user_agent` Nullable(String),
                `entity_type` String,
                `entity_id`   String
                )
                ENGINE = ReplicatedMergeTree()
                PRIMARY KEY (entity_type, entity_id, user_ip)
                ORDER BY (entity_type, entity_id, user_ip, timestamp)
                SETTINGS index_granularity = 8192;
  • The existing data needs to be copied from old to temp without re-triggering the materialized view tables, otherwise i am worried i might duplicate data in the _daily, _monthly, _yearly tables. Also, under a million rows should be easy enough to copy without having to use any special commands i hope
INSERT INTO stats_view_source_tmp SELECT * FROM stats_view_source;
  • swap the old with the temp table
EXCHANGE TABLE stats_view_source_tmp AND stats_view_source
  • the materialized views should automatically trigger from the new table and get disconnected from the old table.

What are my options?
Is my suggested path correct or are there better ways to acomplish this?

LEAVE A COMMENT