The problem that needed to be solved:
There is a SQL server with a database of 1 billion records, which can no longer cope with preparing reports for certain queries. The execution of requests could take an hour or two or three.
It was decided to transfer all the data that was stored on the SQL server to the ClickHouse database.
After some searching, the following script was created in Python. A special pandas library is used.
We launch it on the server where ClickHouse is located
# libraries for Clikhouse
from sqlalchemy import create_engine
import pandas as pd
# engine for clickhouse. Let's connect
uri = 'clickhouse+native://IP address of ClickHouse server:9000/Database name'
engine_click = create_engine(uri)
# engine for MSSQL
server = 'IP address of the SQL server'
database = 'Database'
driver='driver=/opt/microsoft/msodbcsql18/lib64/libmsodbcsql-18.2.so.1.1'
user = 'username to connect to the SQL database'
password = “password”
engine_ms = create_engine(f'mssql+pyodbc://{user}:{password}@{server}/{database}?{driver}&Encrypt=yes&TrustServerCertificate=yes')
# Query for MSSQL. Selecting everything from the table
sql_str = '''SELECT * FROM Database.TableName'''
# We take the result in portions (chunks). We divide them into one hundred thousand records.
df = pd.read_sql_query(sql_str, engine_ms.connect(),chunksize=100000)
for chunk in df:
# For each chunk we can change data types, modernize data, filter, and so on, before sending it to ClickHouse
chunk['LogonType']=chunk['LogonType'].fillna(0)
chunk['LogonType']=chunk['LogonType'].astype('int')
# record the portion in ClickHouse
chunk.to_sql('AccessEvents', engine_click, if_exists='append', index=False)
Copying one billion records in our example took several hours.