Billion records in ClickHouse

The problem that needed to be solved:

There is a SQL server with a database of 1 billion records, which can no longer cope with preparing reports for certain queries. The execution of requests could take an hour or two or three.

It was decided to transfer all the data that was stored on the SQL server to the ClickHouse database.

After some searching, the following script was created in Python. A special pandas library is used.

We launch it on the server where ClickHouse is located

# libraries for Clikhouse

from sqlalchemy import create_engine

import pandas as pd

# engine for clickhouse. Let's connect

uri = 'clickhouse+native://IP address of ClickHouse server:9000/Database name'

engine_click = create_engine(uri)

# engine for MSSQL

server = 'IP address of the SQL server'

database = 'Database'

driver='driver=/opt/microsoft/msodbcsql18/lib64/libmsodbcsql-18.2.so.1.1'

user = 'username to connect to the SQL database'

password = “password”

engine_ms = create_engine(f'mssql+pyodbc://{user}:{password}@{server}/{database}?{driver}&Encrypt=yes&TrustServerCertificate=yes')

# Query for MSSQL. Selecting everything from the table

sql_str = '''SELECT * FROM Database.TableName'''

# We take the result in portions (chunks). We divide them into one hundred thousand records.

df = pd.read_sql_query(sql_str, engine_ms.connect(),chunksize=100000)

for chunk in df:

# For each chunk we can change data types, modernize data, filter, and so on, before sending it to ClickHouse

chunk['LogonType']=chunk['LogonType'].fillna(0)

chunk['LogonType']=chunk['LogonType'].astype('int')

# record the portion in ClickHouse

chunk.to_sql('AccessEvents', engine_click, if_exists='append', index=False)

Copying one billion records in our example took several hours.

Billion records in ClickHouse

Most popular articles