There is a necessity in several Web application projects to pull valid data from production and test against the same in Pre-prod environments
Fortunately or Unfortunately there are compliance issues associated with using production data.
For example in the medical domain you have HIPAA compliance, in aerospace, banking and other domains you have GDPR which prohibit disclosing production environment user details , what is appropriately called PHI (Personal Health Information) to unauthorized users.
Therefore it becomes necessary to de-identify or randomize production data before using it in a non-production environment.
Presented below are some queries and functions which will help a developer to :
1. Export data from a production database into a dump file
2. Import either the entire data or a subset of the tables into a pre-prod database.
3. Change user first names and last name to random anglical names.
4. Change key patient attributes like social security number, mobile phone number, email id , address,zipcode to valid but random value.
5. To verify the de-identified pre-prod database schema for correctness
All the queries are presented with reference to MySQL database. But the same can be replicated for other database like MS SqlServer or Postgresql
Assumptions :
my production database schema name is : batman_prod
user id is : joker
password is : joker#007
database hostname : darkKnight
my production database schema name is : batman_uat
user id is : nolan
password is : nolan#007
database hostname : punkRock
developer test email id :
test.developer.google@gmail.com
Export Data
mysqldump --routines -u joker -pjoker#007 -h darkKnight batman_prod > /path/batman_full_proddbdump-date +%F_%T.sql
mysqldump --routines -u joker -pjoker#007 -h darkKnight batman_prod BILL_TABLE BOB_TABLE JERRY_TABLE DAVID_TABLE > /path/batman_part_proddbdump-date +%F_%T.sql
Import data
mysql -h punkRock -P 3306 -u nolan -pnolan#007 batman_uat < batman_part_proddbdump-2019-08-25_00_00_01.sql
De-identify
update HOSPITAL_USER set first_name = ELT(FLOOR(1 + (RAND() * (100-1))), "James","Mary","John","Patricia","Robert","Linda","Michael","Barbara","William","Elizabeth","David","Jennifer","Richard","Maria","Charles","Susan","Joseph","Margaret","Thomas","Dorothy","Christopher","Lisa","Daniel","Nancy","Paul","Karen","Mark","Betty","Donald","Helen","George","Sandra","Kenneth","Donna","Steven","Carol","Edward","Ruth","Brian","Sharon","Ronald","Michelle","Anthony","Laura","Kevin","Sarah","Jason","Kimberly","Matthew","Deborah","Gary","Jessica","Timothy","Shirley","Jose","Cynthia","Larry","Angela","Jeffrey","Melissa","Frank","Brenda","Scott","Amy","Eric","Anna","Stephen","Rebecca","Andrew","Virginia","Raymond","Kathleen","Gregory","Pamela","Joshua","Martha","Jerry","Debra","Dennis","Amanda","Walter","Stephanie","Patrick","Carolyn","Peter","Christine","Harold","Marie","Douglas","Janet","Henry","Catherine","Carl","Frances","Arthur","Ann","Ryan","Joyce","Roger","Diane");
update HOSPITAL_USER set last_name = ELT(FLOOR(1 + (RAND() * (100-1))), "Smith","Johnson","Williams","Jones","Brown","Davis","Miller","Wilson","Moore","Taylor","Anderson","Thomas","Jackson","White","Harris","Martin","Thompson","Garcia","Martinez","Robinson","Clark","Rodriguez","Lewis","Lee","Walker","Hall","Allen","Young","Hernandez","King","Wright","Lopez","Hill","Scott","Green","Adams","Baker","Gonzalez","Nelson","Carter","Mitchell","Perez","Roberts","Turner","Phillips","Campbell","Parker","Evans","Edwards","Collins","Stewart","Sanchez","Morris","Rogers","Reed","Cook","Morgan","Bell","Murphy","Bailey","Rivera","Cooper","Richardson","Cox","Howard","Ward","Torres","Peterson","Gray","Ramirez","James","Watson","Brooks","Kelly","Sanders","Price","Bennett","Wood","Barnes","Ross","Henderson","Coleman","Jenkins","Perry","Powell","Long","Patterson","Hughes","Flores","Washington","Butler","Simmons","Foster","Gonzales","Bryant","Alexander","Russell","Griffin","Diaz","Hayes");
update batman_uat.HOSPITAL_PATIENT_DETAILS set ssn_id = str_random('(d{3})-d{2}-d{4}');
update batman_uat.HOSPITAL_PATIENT_DETAILS set email = CONCAT('test.developer.google+', first_name ,last_name,'@gmail.com');
Here we use the unique feature available in gmail where you can append any string to a valid email id test.developer.google with the sign + and generate N number of unique email ids. That way a single inbox can be used to test send/receive of emails on behalf of several user ids.
update batman_uat.HOSPITAL_PATIENT_DETAILS set address = concat(str_random('(d{5}) '), ELT(FLOOR(1 + (RAND() * (20-1))),"Second","Third","First","Fourth","Park","Fifth","Main","Sixth","Oak","Seventh","Pine","Maple","Cedar","Eighth","Elm","View","Washington","Ninth","Lake","Hill" ),str_random(' [street|lane|road|park] '))
Here I used the 20 most commonly found street names in the US to randomize the address.
update batman_uat.HOSPITAL_PATIENT_DETAILS set zipcode = str_random('(d{5})');
update batman_uat.HOSPITAL_PATIENT_DETAILS set dob = str_random_date('1920-01-01','2018-12-31','%Y-%m-%d');
update batman_uat.HOSPITAL_PATIENT_DETAILS set mobile_phone = str_random('(d{3})-d{3}-d{4}');
Below are some queries to check the resultant pre-prod schema
Find total number of rows by each table in all tables :
SELECT table_name, table_rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'batman_uat';
Find tables matching a given column name:
SELECT DISTINCT TABLE_NAME, COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE 'hospital_patient_first_name'
References:
fn_str_random.sql and fn_str_random_date.sql : MySQL Function code below. Thanks to the author !
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES';
DELIMITER //
DROP FUNCTION IF EXISTS str_random;
//
CREATE FUNCTION str_random(p_pattern VARCHAR(200))
RETURNS VARCHAR(2000)
NO SQL
BEGIN
/**
* String function. Returns a random string based on a mask
* <br>
* %author Ronald Speelman
* %version 2.3
* Example usage:
* SELECT str_random('dddd CC') AS DutchZipCode;
* SELECT str_random('d{4} C{2}') AS DutchZipCode;
* SELECT str_random('*{5}*(4)') AS password;
* select str_random('Cccc(4)') as name;
* SELECT str_random('#X{6}') AS htmlColorCode;
* See more complex examples and a description on www.moinne.com/blog/ronald
*
* %param p_pattern String: the pattern describing the random values
* MASKS:
* c returns lower-case character [a-z]
* C returns upper-case character [A-Z]
* A returns either upper or lower-case character [a-z A-Z]
* d returns a digit [0-9]
* D returns a digit without a zero [1-9]
* b returns a bit [0-1]
* X returns hexadecimal character [0-F]
* * returns characters, decimals and special characters [a-z A-Z 0-9 !?-_@$#]
* DIRECTIVES
* "text" : text is taken literally
* {nn} : repeat the last mask nn times
* (nn) : repeat random, but max nn times
* [item|item] : pick a random item from this list, items are separated by a pipe symbol
* All other characters are taken literally
* %return String
*/
DECLARE v_iter SMALLINT DEFAULT 1;
DECLARE v_char VARCHAR(1) DEFAULT '';
DECLARE v_next_char VARCHAR(1) DEFAULT '';
DECLARE v_list VARCHAR(200) DEFAULT '';
DECLARE v_text VARCHAR(200) DEFAULT '';
DECLARE v_result VARCHAR(2000) DEFAULT '';
DECLARE v_count SMALLINT DEFAULT 0;
DECLARE v_jump_characters TINYINT DEFAULT 0;
DECLARE v_end_position SMALLINT DEFAULT 0;
DECLARE v_list_count TINYINT DEFAULT 0;
DECLARE v_random_item TINYINT DEFAULT 0;
WHILE v_iter <= CHAR_LENGTH(p_pattern) DO
SET v_char := BINARY SUBSTRING(p_pattern,v_iter,1);
SET v_next_char := BINARY SUBSTRING(p_pattern,(v_iter + 1),1);
-- check if text is a fixed text
IF (v_char = '"') THEN
-- get the text
SET v_end_position := LOCATE('"', p_pattern, v_iter + 1);
SET v_text := SUBSTRING(p_pattern,v_iter + 1,(v_end_position - v_iter) - 1);
-- add the text to the result
SET v_result := CONCAT(v_result, v_text);
SET v_iter := v_iter + CHAR_LENGTH(v_text) + 2;
-- if character has a count specified: repeat it
ELSEIF (v_next_char = '{') OR (v_next_char = '(') THEN
-- find out what the count is (max 999)...
IF (SUBSTRING(p_pattern,(v_iter + 3),1) = '}') OR
(SUBSTRING(p_pattern,(v_iter + 3),1) = ')') THEN
SET v_count := SUBSTRING(p_pattern,(v_iter + 2),1);
SET v_jump_characters := 4;
ELSEIF (SUBSTRING(p_pattern,(v_iter + 4),1) = '}') OR
(SUBSTRING(p_pattern,(v_iter + 4),1) = ')')THEN
SET v_count := SUBSTRING(p_pattern,(v_iter + 2),2);
SET v_jump_characters := 5;
ELSEIF (SUBSTRING(p_pattern,(v_iter + 5),1) = '}') OR
(SUBSTRING(p_pattern,(v_iter + 5),1) = ')')THEN
SET v_count := SUBSTRING(p_pattern,(v_iter + 2),3);
SET v_jump_characters := 6;
ELSE
SET v_count := 0;
SET v_jump_characters := 3;
END IF;
-- if random count: make it random with a max of count
IF (v_next_char = '(') THEN
SET v_count := FLOOR((RAND() * v_count));
END IF;
-- repeat the characters
WHILE v_count > 0 DO
SET v_result := CONCAT(v_result,str_random_character(v_char));
SET v_count := v_count - 1;
END WHILE;
SET v_iter := v_iter + v_jump_characters;
-- check if there is a list in the pattern
ELSEIF (v_char = '[') THEN
-- get the list
SET v_end_position := LOCATE(']', p_pattern, v_iter + 1);
SET v_list := SUBSTRING(p_pattern,v_iter + 1,(v_end_position - v_iter) - 1);
-- find out how many items are in the list, items are seperated by a pipe
SET v_list_count := (LENGTH(v_list) - LENGTH(REPLACE(v_list, '|', '')) + 1);
-- pick a random item from the list
SET v_random_item := FLOOR(1 + (RAND() * v_list_count));
-- add the item from the list
SET v_result := CONCAT(v_result,
REPLACE(SUBSTRING(SUBSTRING_INDEX(v_list, '|' ,v_random_item),
CHAR_LENGTH(SUBSTRING_INDEX(v_list,'|', v_random_item -1)) + 1),
'|', '')
);
SET v_iter := v_iter + CHAR_LENGTH(v_list) + 2;
-- no directives: just get a random character
ELSE
SET v_result := CONCAT(v_result, str_random_character(v_char));
SET v_iter := v_iter + 1;
END IF;
END WHILE;
RETURN v_result;
END;
//
DELIMITER ;
SET SQL_MODE=@OLD_SQL_MODE;
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='STRICT_TRANS_TABLES';
DELIMITER //
DROP FUNCTION IF EXISTS str_random_date;
//
CREATE FUNCTION str_random_date(p_date_start VARCHAR(20)
,p_date_end VARCHAR(20)
,p_format VARCHAR(20)
)
RETURNS VARCHAR(50)
NO SQL
BEGIN
/**
* String function. Returns a random date string
* <br>
* %author Ronald Speelman
* %version 1.0
* Example usage:
* SELECT str_random_date('1980-01-01','2012-01-01','%Y-%m-%d') AS MysqlDate;
* See more complex examples and a description on www.moinne.com/blog/ronald
*
* %param p_date_start string: the start date/ time
* %param p_date_end string: the end date/ time
* %param p_format string: the format of the returned date/time
* %return String
*/
DECLARE v_format VARCHAR(20) DEFAULT '%Y-%m-%d';
DECLARE v_rand_secs BIGINT DEFAULT 0;
SET v_format := COALESCE(p_format, v_format);
SET v_rand_secs := FLOOR(0 + (RAND() * (86400 * (DATEDIFF(p_date_end , p_date_start)))));
RETURN DATE_FORMAT(DATE_ADD(p_date_start , INTERVAL v_rand_secs SECOND),v_format);
END;
//
DELIMITER ;
SET SQL_MODE=@OLD_SQL_MODE;