Strategyfiles
March 18, 2024 · View on GitHub
Strategyfiles are configurations containing the "What" and the "How" of the anonymization process for a database.
pynonymizer can parse a strategyfile into a DatabaseStrategy, which will be used as a set of instructions during an anonymization run.
Format
Strategyfiles can be written in json or yaml, as the basic structure is the same. As long as your file has a well-formed file + extension, pynonymizer should be able to use the appropriate parser.
The below examples will be given in yaml, although the underlying structures will be the same in both languages.
tables:
accounts:
columns:
password: ( '' )
current_sign_in_ip: ipv4_public
last_sign_in_ip: ipv4_public
created_at: ( NOW() )
username: user_name
email:
type: fake_update
fake_type: company_email
where: username != 'admin'
transactions: truncate
other_important_table: truncate
secrets: delete
scripts:
before:
- DELETE FROM config where name = 'secret';
after:
- select * from config;
Philosophy
Wherever there are multiple options for a key, such as a table or update_column strategy, the "verbose" syntax will be a dict, containing a type key that specifies the type of configuration.
tables:
table_example:
type: table_type
# [...]
In certain circumstances, the type key can be determined automatically by the content. For example, when a column key is set to unique_login, the type is automatically detected to be the unique_login, rather than a fake_update type with fake_type: unique_login.
When the type is automatically determined, this is called the "compact" syntax.
The compact syntax has significantly fewer options but allows you to build many common configurations quickly and easily.
Syntax
Key: tables
Tables is a top-level key, containing a either a subkey for each table_name:table_strategypair, or a list of verbose-style objects.
The latter syntax is useful if you want to specify multiple strategies for the same table in different schemas or different selective options.
# `tables` key in compact and list type syntax.
# The two examples below are functionally equivalent
tables:
table1: truncate
table2:
columns: [...]
# List-of-objects verbose syntax
tables:
- table_name: table1
type: truncate
- table_name: table2
type: update_column
columns: [...]
Each table has an individual strategy.
truncateupdate_columnsdelete
Table Strategy: truncate
Wipe the entire table, preferably using a TRUNCATE statement, which normally does not respect table constraints.
table_name:
type: truncate
# Compact Syntax:
table_name: truncate
Table Strategy: delete
Wipe the entire table, preferably using a DELETE statement, which normally does respect table constraints.
table_name:
type: delete
# Compact Syntax:
table_name: delete
Table Strategy: update_columns
Overwrite columns in the table with specific or randomized values.
table_name:
type: update_columns
columns:
column_name1: ( '' )
#[...]
# Compact Syntax: `type` can be omitted if `columns` key is present
table_name:
columns:
column_name1: ( '' )
#[...]
#Columns can also be given in a list of verbose objects, if multiple column strategies for the same column are required.
table_name:
columns:
- column_name: column_name1
type: ( '' )
where: cake OR death
- column_name: column_name1
type: ( '' )
where
column_name:
type: ( '' )
where: username = 'barry'
All update_column types support a where key, that can be used to add conditions to updates. This is only available in the verbose syntax, and must be a string of predicate(s) that will be appended to the query. Pynonymizer does not check syntax for sql, so you must ensure that what you've written works in your database's language.
Update statements will be grouped on the content of the where key and executed together, i.e. if you have 2 unique WHERE statements, pynonymizer will execute 3 update statements: 1 with 1st where clause, 1 with 2nd where clause, 3rd with no where clause.
Column Strategy: unique_login
Replaces a column with a unique value that vaguely resembles a username.
column_name:
type: unique_login
# Compact Syntax: string "unique_login"
column_name: unique_login
Column Strategy: unique_email
Replace a column with a unique value that vaguely resembles an email address.
column_name:
type: unique_email
# Compact Syntax: string "unique_email"
column_name: unique_email
Column Strategy: literal
Replace a column with the value specified.
This will not perform escaping for your language, so you must ensure that it is correct and executable by your provider.
column_name:
type: literal
value: RAND()
# Compact Syntax: add parentheses around your value
column_name: ( RAND() )
Column Strategy: fake_update
Replace a column with a value from a generated data pool (non-unique).
column_name:
type: fake_update
fake_type: file_path
# Optional, provide arguments to the faker generator:
# For example: fake.file_path(depth=1, category=None, extension=None)
fake_args:
depth: 1
# Optional, define the SQL type in the database.
# The value will not be escaped, it will be a raw SQL type so be careful.
sql_type: CHAR(255)
# Compact Syntax: specify any supported faker method string, excluding the keywords `unique_login`, `unique_email`
column_name: file_path
You can specify a fake update with any default Faker provider.
While accessible, some generators may not be supported by your provider if they use more complex types or coercion is not properly described. This is an ongoing process to make sure that generators are annotated according to the types they return.
Some generators take arguments which can be specified with the fake_args key (as a dictionary of kwargs).
To get you started, here's a list of some useful generators.
first_namelast_namenameuser_nameemailcompany_emailphone_numbercompanybscatch_phrasejobcitystreet_addresspostcodeuriipv4_privateipv4_publicfile_namefile_pathparagraphprefixrandom_intdate_of_birthfuture_datepast_datefuture_datetimepast_datetimedateuser_agentdomain_namemac_addressisbn13paragraphwordcredit_card_number
Custom generators can be managed using the providers key.
Key: Providers
Providers is a list of fully qualified import paths for faker providers that this strategy
depends on. Examples might be providers you've created or from community providers in the python path e.g. faker_airtravel.AirtravelProvider.
You should specify the full import path of the provider, as would be passed to the
Faker#add_provider method. Pynonymizer will import and manage the rest.
You can then use the new data types as you would any other generator in your strategyfile.
Pynonymizer CLI will automatically include the current working directory in the path for importing custom provider modules.
providers:
- faker_airtravel.AirtravelProvider
- some_local_module.MySpecialProvider
tables:
table_name:
columns:
column_name: my_generator_type
Key: Locale
Some Faker providers support localization. Use this key to localize the generators during the
anonymization process. This option is identical to the cli option -l/--fake-locale (now deprecated)
You can see localized providers here https://faker.readthedocs.io/en/master/locales.html
locale: en_GB
Key: Scripts
Scripts is a top-level key describing SQL statements to be run during the anonymization process. These scripts can also return results that will be entered into the logs.
This might help you to do something specific that pynonymizer may not have a feature for (yet?), or to provider a report on something before you dump/drop/etc.
Scripts are given in the scripts top level key. There are two subkeys, before and after. In both cases, the required input format
is a list of strings to be run as individual scripts.
scripts:
after:
- DELETE FROM students;
- [...]
before:
- DELETE FROM config where name = 'secret';
You should recieve the output from each script in the log. The exact format will depend on the database provider. See the script except below:
[...]
creating seed table with 3 columns
Inserting seed data
Inserting seed data: 100%|██████████████████| 150/150 [00:05<00:00, 25.98rows/s]
Running before script 0 "SELECT COUNT(1) FROM `accounts`;"
COUNT(1)
1817
Running before script 1 "SELECT SLEEP(1);"
SLEEP(1)
0
Anonymizing 2 tables
Truncating transactions: 100%|████████████████████| 2/2 [00:00<00:00, 2.66it/s]
Running after script 0: "SELECT COUNT(1) FROM `accounts`;"
COUNT(1)
1817
[...]
before
Before takes place just before the anonymization starts, after any preparation by the database provider (e.g. seed table, etc)
after
After takes place directly after the anonymization.