De-Identifying or Randomizing production data in MySQL database

There is a necessity in several Web application projects to pull valid data from production and test against the same in Pre-prod environments

Fortunately or Unfortunately there are compliance issues associated with using production data.

For example in the medical domain you have HIPAA compliance, in aerospace, banking and other domains you have GDPR which prohibit disclosing production environment user details , what is appropriately called PHI (Personal Health Information) to unauthorized users.

Therefore it becomes necessary to de-identify or randomize production data before using it in a non-production environment.

Presented below are some queries and functions which will help a developer to :

1. Export data from a production database into a dump file

2. Import either the entire data or a subset of the tables into a pre-prod database.

3. Change user first names and last name to random anglical names.

4. Change key patient attributes like social security number, mobile phone number, email id , address,zipcode to valid but random value.

5. To verify the de-identified pre-prod database schema for correctness

All the queries are presented with reference to MySQL database. But the same can be replicated for other database like MS SqlServer or Postgresql

Assumptions :

my production database schema name is : batman_prod

user id is : joker

password is : joker#007

database hostname : darkKnight

my production database schema name is : batman_uat

user id is : nolan

password is : nolan#007

database hostname : punkRock

developer test email id :

test.developer.google@gmail.com

Export Data

mysqldump --routines -u joker -pjoker#007 -h darkKnight batman_prod > /path/batman_full_proddbdump-date +%F_%T.sql
mysqldump --routines -u joker -pjoker#007 -h darkKnight batman_prod BILL_TABLE BOB_TABLE JERRY_TABLE DAVID_TABLE > /path/batman_part_proddbdump-date +%F_%T.sql

Import data

mysql -h punkRock -P 3306 -u nolan -pnolan#007 batman_uat < batman_part_proddbdump-2019-08-25_00_00_01.sql

De-identify

update HOSPITAL_USER set first_name = ELT(FLOOR(1 + (RAND() * (100-1))), "James","Mary","John","Patricia","Robert","Linda","Michael","Barbara","William","Elizabeth","David","Jennifer","Richard","Maria","Charles","Susan","Joseph","Margaret","Thomas","Dorothy","Christopher","Lisa","Daniel","Nancy","Paul","Karen","Mark","Betty","Donald","Helen","George","Sandra","Kenneth","Donna","Steven","Carol","Edward","Ruth","Brian","Sharon","Ronald","Michelle","Anthony","Laura","Kevin","Sarah","Jason","Kimberly","Matthew","Deborah","Gary","Jessica","Timothy","Shirley","Jose","Cynthia","Larry","Angela","Jeffrey","Melissa","Frank","Brenda","Scott","Amy","Eric","Anna","Stephen","Rebecca","Andrew","Virginia","Raymond","Kathleen","Gregory","Pamela","Joshua","Martha","Jerry","Debra","Dennis","Amanda","Walter","Stephanie","Patrick","Carolyn","Peter","Christine","Harold","Marie","Douglas","Janet","Henry","Catherine","Carl","Frances","Arthur","Ann","Ryan","Joyce","Roger","Diane");
update HOSPITAL_USER set last_name = ELT(FLOOR(1 + (RAND() * (100-1))), "Smith","Johnson","Williams","Jones","Brown","Davis","Miller","Wilson","Moore","Taylor","Anderson","Thomas","Jackson","White","Harris","Martin","Thompson","Garcia","Martinez","Robinson","Clark","Rodriguez","Lewis","Lee","Walker","Hall","Allen","Young","Hernandez","King","Wright","Lopez","Hill","Scott","Green","Adams","Baker","Gonzalez","Nelson","Carter","Mitchell","Perez","Roberts","Turner","Phillips","Campbell","Parker","Evans","Edwards","Collins","Stewart","Sanchez","Morris","Rogers","Reed","Cook","Morgan","Bell","Murphy","Bailey","Rivera","Cooper","Richardson","Cox","Howard","Ward","Torres","Peterson","Gray","Ramirez","James","Watson","Brooks","Kelly","Sanders","Price","Bennett","Wood","Barnes","Ross","Henderson","Coleman","Jenkins","Perry","Powell","Long","Patterson","Hughes","Flores","Washington","Butler","Simmons","Foster","Gonzales","Bryant","Alexander","Russell","Griffin","Diaz","Hayes");
update batman_uat.HOSPITAL_PATIENT_DETAILS set ssn_id = str_random('(d{3})-d{2}-d{4}');
update batman_uat.HOSPITAL_PATIENT_DETAILS set email = CONCAT('test.developer.google+', first_name ,last_name,'@gmail.com');

Here we use the unique feature available in gmail where you can append any string to a valid email id test.developer.google with the sign + and generate N number of unique email ids. That way a single inbox can be used to test send/receive of emails on behalf of several user ids.

update batman_uat.HOSPITAL_PATIENT_DETAILS set address = concat(str_random('(d{5}) '), ELT(FLOOR(1 + (RAND() * (20-1))),"Second","Third","First","Fourth","Park","Fifth","Main","Sixth","Oak","Seventh","Pine","Maple","Cedar","Eighth","Elm","View","Washington","Ninth","Lake","Hill" ),str_random(' [street|lane|road|park] '))

Here I used the 20 most commonly found street names in the US to randomize the address.

update batman_uat.HOSPITAL_PATIENT_DETAILS set zipcode = str_random('(d{5})');
update batman_uat.HOSPITAL_PATIENT_DETAILS set dob = str_random_date('1920-01-01','2018-12-31','%Y-%m-%d');
update batman_uat.HOSPITAL_PATIENT_DETAILS set mobile_phone = str_random('(d{3})-d{3}-d{4}');

Below are some queries to check the resultant pre-prod schema

Find total number of rows by each table in all tables :

SELECT table_name, table_rows

FROM INFORMATION_SCHEMA.TABLES

WHERE TABLE_SCHEMA = 'batman_uat';

Find tables matching a given column name:

SELECT DISTINCT TABLE_NAME, COLUMN_NAME

FROM INFORMATION_SCHEMA.COLUMNS

WHERE COLUMN_NAME LIKE 'hospital_patient_first_name'

References:

fn_str_random.sql and fn_str_random_date.sql : MySQL Function code below. Thanks to the author !

SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES';

DELIMITER //
DROP FUNCTION IF EXISTS str_random;
//

CREATE FUNCTION str_random(p_pattern VARCHAR(200))
    RETURNS VARCHAR(2000)
    NO SQL
    BEGIN
    /**
    * String function. Returns a random string based on a mask
    * <br>
    * %author Ronald Speelman
    * %version 2.3
    * Example usage:
    * SELECT str_random('dddd CC') AS DutchZipCode;
    * SELECT str_random('d{4} C{2}') AS DutchZipCode;
    * SELECT str_random('*{5}*(4)') AS password;
    * select str_random('Cccc(4)') as name;
    * SELECT str_random('#X{6}') AS htmlColorCode;
    * See more complex examples and a description on www.moinne.com/blog/ronald
    *
    * %param p_pattern String: the pattern describing the random values
    *                          MASKS:
    *                          c returns lower-case character [a-z]
    *                          C returns upper-case character [A-Z]
    *                          A returns either upper or lower-case character [a-z A-Z]
    *                          d returns a digit [0-9]
    *                          D returns a digit without a zero [1-9]
    *                          b returns a bit [0-1]
    *                          X returns hexadecimal character [0-F]
    *                          * returns characters, decimals and special characters [a-z A-Z 0-9 !?-_@$#]
    *                          DIRECTIVES
    *                          "text"      : text is taken literally
    *                          {nn}        : repeat the last mask nn times
    *                          (nn)        : repeat random, but max nn times
    *                          [item|item] : pick a random item from this list, items are separated by a pipe symbol
    *                          All other characters are taken literally
    * %return String
    */

    DECLARE v_iter              SMALLINT DEFAULT 1;
    DECLARE v_char              VARCHAR(1) DEFAULT '';
    DECLARE v_next_char         VARCHAR(1) DEFAULT '';
    DECLARE v_list              VARCHAR(200) DEFAULT '';
    DECLARE v_text              VARCHAR(200) DEFAULT '';
    DECLARE v_result            VARCHAR(2000) DEFAULT '';
    DECLARE v_count             SMALLINT DEFAULT 0;
    DECLARE v_jump_characters   TINYINT DEFAULT 0;
    DECLARE v_end_position      SMALLINT DEFAULT 0;
    DECLARE v_list_count        TINYINT DEFAULT 0;
    DECLARE v_random_item       TINYINT DEFAULT 0;

    WHILE v_iter <= CHAR_LENGTH(p_pattern) DO

        SET v_char := BINARY SUBSTRING(p_pattern,v_iter,1);
        SET v_next_char := BINARY SUBSTRING(p_pattern,(v_iter + 1),1);

        -- check if text is a fixed text
        IF (v_char = '"') THEN
            -- get the text
            SET v_end_position := LOCATE('"', p_pattern, v_iter + 1);
            SET v_text := SUBSTRING(p_pattern,v_iter + 1,(v_end_position - v_iter) - 1);
            -- add the text to the result
            SET v_result := CONCAT(v_result, v_text);
            SET v_iter := v_iter + CHAR_LENGTH(v_text) + 2;
        -- if character has a count specified: repeat it
        ELSEIF (v_next_char = '{') OR (v_next_char = '(') THEN
            -- find out what the count is (max 999)...
            IF (SUBSTRING(p_pattern,(v_iter + 3),1) = '}') OR
               (SUBSTRING(p_pattern,(v_iter + 3),1) = ')') THEN
                SET v_count := SUBSTRING(p_pattern,(v_iter + 2),1);
                SET v_jump_characters := 4;
            ELSEIF (SUBSTRING(p_pattern,(v_iter + 4),1) = '}') OR
                   (SUBSTRING(p_pattern,(v_iter + 4),1) = ')')THEN
                SET v_count := SUBSTRING(p_pattern,(v_iter + 2),2);
                SET v_jump_characters := 5;
            ELSEIF (SUBSTRING(p_pattern,(v_iter + 5),1) = '}') OR
                   (SUBSTRING(p_pattern,(v_iter + 5),1) = ')')THEN
                SET v_count := SUBSTRING(p_pattern,(v_iter + 2),3);
                SET v_jump_characters := 6;
            ELSE
                SET v_count := 0;
                SET v_jump_characters := 3;
            END IF;
            -- if random count: make it random with a max of count
            IF (v_next_char = '(') THEN
                SET v_count := FLOOR((RAND() * v_count));
            END IF;
            -- repeat the characters
            WHILE v_count > 0 DO
                SET v_result := CONCAT(v_result,str_random_character(v_char));
                SET v_count := v_count - 1;
            END WHILE;
            SET v_iter := v_iter + v_jump_characters;
        -- check if there is a list in the pattern
        ELSEIF (v_char = '[') THEN
            -- get the list
            SET v_end_position := LOCATE(']', p_pattern, v_iter + 1);
            SET v_list := SUBSTRING(p_pattern,v_iter + 1,(v_end_position - v_iter) - 1);
            -- find out how many items are in the list, items are seperated by a pipe
            SET v_list_count := (LENGTH(v_list) - LENGTH(REPLACE(v_list, '|', '')) + 1);
            -- pick a random item from the list
            SET v_random_item := FLOOR(1 + (RAND() * v_list_count));
            -- add the item from the list
            SET v_result := CONCAT(v_result,
                                   REPLACE(SUBSTRING(SUBSTRING_INDEX(v_list, '|' ,v_random_item),
                                           CHAR_LENGTH(SUBSTRING_INDEX(v_list,'|', v_random_item -1)) + 1),
                                           '|', '')
                                  );
            SET v_iter := v_iter + CHAR_LENGTH(v_list) + 2;
        -- no directives: just get a random character
        ELSE
            SET v_result := CONCAT(v_result, str_random_character(v_char));
            SET v_iter := v_iter + 1;
        END IF;

   END WHILE;

   RETURN v_result;
END;
//
DELIMITER ;

SET SQL_MODE=@OLD_SQL_MODE;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES';

DELIMITER //
DROP FUNCTION IF EXISTS str_random_date;
//

CREATE FUNCTION str_random_date(p_date_start VARCHAR(20)
                               ,p_date_end VARCHAR(20)
                               ,p_format VARCHAR(20)
                                 )
    RETURNS VARCHAR(50) 
    NO SQL
    BEGIN
    /**
    * String function. Returns a random date string
    * <br>
    * %author Ronald Speelman
    * %version 1.0
    * Example usage:
    * SELECT str_random_date('1980-01-01','2012-01-01','%Y-%m-%d') AS MysqlDate;
    * See more complex examples and a description on www.moinne.com/blog/ronald
    *
    * %param p_date_start   string: the start date/ time
    * %param p_date_end     string: the end date/ time
    * %param p_format       string: the format of the returned date/time 
    * %return String
    */

    DECLARE v_format VARCHAR(20) DEFAULT '%Y-%m-%d';
    DECLARE v_rand_secs BIGINT DEFAULT 0;

    SET v_format := COALESCE(p_format, v_format);
    SET v_rand_secs  := FLOOR(0 + (RAND() * (86400 * (DATEDIFF(p_date_end , p_date_start)))));
    RETURN DATE_FORMAT(DATE_ADD(p_date_start , INTERVAL  v_rand_secs SECOND),v_format);
END;
//
DELIMITER ;

SET SQL_MODE=@OLD_SQL_MODE;

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: