Migration of PostgreSQL using Pandas
Bhupin baral
January 3, 2022
Pandas is the most popular data wrangling package widely used in data science/data Analysis and Machine Learning task. It is the open source package works with different data science module inside python ecosystem and available in any python distribution.
Installation
Open your terminal and run the following command it will automatically download and save required packages in your system
pip install pandas, pyyaml, sqlalchemy
Make a .yaml credential file in your directory
touch cred.yaml
Put all your database credential in that cred.yaml file. You need to add 2 different credential, one for to read data from database and another is the credential where new data are stored after migration.
Database_Credential_1: {
"DB_USER": "db_user_name",
"DB_NAME": "database_name",
"DB_HOST": "db_host_name",
"DB_PASSWORD": "db_user_password"
}
Database_Credential_2: {
"DB_USER": "db_user_name",
"DB_NAME": "database_name",
"DB_HOST": "db_host_name",
"DB_PASSWORD": "db_user_password"
}
Create a .py file and import the require library as follow and read the cred.yaml file.
.
├── cred.yaml
└── data_migration.py
#data_migration.py
import yaml
import pandas as pd
import sqlalchemy
#Read cred.yaml file
with open("cred.yaml", "r") as credential:
cl = yaml.safe_load(credential)
After reading the credential save it into the two different dictionary.
source_dict = cl['Database_Credential_1']
destination_dict = cl['Database_Credential_2']
Create the connection string using f-string (F-strings provide a way to embed expressions inside string literals, using a minimal syntax).You need to write a function to return and create engine by passing the dictionary to establish the connection with source and destination database. sqlalchemy is the python package which create the engine to connect sql database.
#data_migration.py
def make_engine(credential: dict):
connection_string = f"postgresql://{credential['DB_USER']}
{credential['DB_PASSWORD']}@{credential['DB_HOST']}/
{credential['DB_NAME']}"
return sqlalchemy.create_engine(connection_string)
Develop the connection engine by passing the source and destination dictionary.
source_engine = make_engine(source_dict)
destination_engine = make_engine(destination_dict)
Read the database using pandas and connection engine and store them into dataframe. you need to provide the name of the table which you want to read from database.
data_frame = pd.read_sql_table('table_name',source_engine)
Identify the primary_key and set it as an index of the dataframe.
data_frame = data_frame.set_index(“primary_key”)
Change the dataframe like drop column or add column, Visualize, manipulate data and many other task by using pandas dataframe as per your need and send them again into database using to_sql function. If table name already exist in the destination database then pass if_exists=’replace’ in function.
data_frame.to_sql('table_name',destination_engine,if_exists='replace')
This process of migration takes less time and become more efficient because we need not to add any schema to add or delete any attribute of the pr-existing table in database.if the data is too large then chunk_size is used to read the data which create the generator object and make easy to read large datasets using less amount of RAM.
After completing all the code you can run the script which migrate your table from one database to another database with or without any changes.
Thank-You
Bhupin Baral