Hive MetaStore migration from Embedded Postgres to MySQL

Hive MetaStore migration from Embedded Postgres to MySQL

Recently I was dealing with a case where customer wanted to migrate from Cloudera Manager’s Embedded PostgreSQL to MySQL due to the fact that the Embedded PostgreSQL provided by Cloudera Manager is not recommended for production use, instead, using external databases like MySQL, PostgreSQL or Oracle etc is highly recommended. The Embedded PostgreSQL should only be considered to be used by development or testing clusters. For more information, please refer to Cloudera’s documentation regarding Cloudera Manager and Managed Service Datastores. So, this article is focusing on how to migrate from Embedded PostgreSQL to MySQL, due to some incompatibilities between database schema for Hive under MySQL and PostgreSQL (see HIVE-7018 for details, at the time of writing HIVE-7018 is not available in any CDH releases). Assuming that you are using Cloudera Manager, please follow the steps below:
  1. Stop cluster so that no changes will happen to the cluster during the migration
  2. Backup the old PostgreSQL database, including schema and data
  3. Create a new database and user in MySQL, with proper permissions:
    CREATE DATABASE metastore DEFAULT CHARACTER SET utf8;
    GRANT ALL ON metastore.* TO 'hive'@'%' IDENTIFIED BY 'hive_password';
    
  4. Update the database connection details in Cloudera Manager > Hive > Configuration page to use the new database
  5. Go to Hive > “Actions” menu on the top right corner and click on “Create Hive MetaStore Database Tables”
  6. Export the PostgreSQL DB without schema definition, i.e., data only, using the following command:
    pg_dump -U  --column-inserts --format p -h  -p  -f hive-in.sql -a 
    
    the “-a” for “data only” option
  7. Download a php program from http://www.lightbox.ca/pg2mysql/pg2mysql-1.9.tar.bz2. This program can help us to do the SQL conversion from PostgreSQL to MySQL
  8. Install php-cli on your machine. If you use CentOS, simply run:
    sudo yum install php-cli
    
  9. Run the following command to generate the hive-out.sql file (tested on PHP5.3)
    php pg2mysql-1.9/pg2mysql_cli.php hive-in.sql hive-out.sql InnoDB
    
  10. Open the hive-out.sql file in any editor and add:
    SET FOREIGN_KEY_CHECKS=0;
    
    to the beginning of the file and:
    SET FOREIGN_KEY_CHECKS=1;
    
    to the end of the file, so that no foreign keys will be checked during data import.
  11. Update the following three tables to UPPER CASE, so change from:
    INSERT INTO next_compaction_queue_id (ncq_next) VALUES (1);
    INSERT INTO next_lock_id (nl_next) VALUES (1);
    INSERT INTO next_txn_id (ntxn_next) VALUES (1);
    
    to:
    INSERT INTO NEXT_COMPACTION_QUEUE_ID (ncq_next) VALUES (1);
    INSERT INTO NEXT_LOCK_ID (nl_next) VALUES (1);
    INSERT INTO NEXT_TXN_ID (ntxn_next) VALUES (1);
    
    because those tables are in the UPPER case in MySQL version.
  12. Remove the following line in the hive-out.sql file:
    INSERT INTO VERSION (VER_ID, SCHEMA_VERSION, VERSION_COMMENT) VALUES (1, '1.1.0', 'Hive release version 1.1.0');
    
    As we have used Cloudera Manager to generate the schema for us in MySQL at step 5 and this table has been populated automatically.
  13. Put hive-out.sql file on the MySQL server host
  14. Log into MySQL and then run command:
    USE metastore;
    SOURCE /path/to/hive-out.sql;
    
  15. Now ready to “Start” Hive service and test out if everything is OK.

Leave a Reply

Your email address will not be published.

My new Snowflake Blog is now live. I will not be updating this blog anymore but will continue with new contents in the Snowflake world!