Create Fictional Customer Datasets with Python & Random Key

Randomkey.io
8 min readApr 28, 2020

--

Names, locations, IDs, credit card numbers and more data are a couple of clicks away.

By the end of this article you will have generated a csv file with personal data for the region of your choice, with elements of your choice (names, addresses, social security numbers, credit card numbers, etc.), and the number of records that you need.

Typical requirements dictate that mock data should be realistic, and in some cases, regional: names should resemble names; social security numbers, credit card numbers must follow the standardised convention of their issuing bodies; locations must be geographical, not imaginary.

Random Key was created to do exactly that: the API publishes a variety of endpoints that can be used to generate personal data. The produced data has a realistic feel to it — as it is context-aware— while being completely fictional.

We will use Python to get data from Random Key and save it to a csv file.

Convincingly real customer data produced by Random Key

This article runs the code in Jupyter, but the same steps can be executed using Python’s command line or another IDE. If you wish to use Jupyter, install it via pip or conda or, if you don’t have Python on your system, download the Anaconda Data Science bundle that includes both (and more cool things). It’s also possible to achieve the same result with a REST client, like Postman, or using any other language that supports REST calls.

Once ready, start up your environment and follow this guide. If you rather get going on your own, skip ahead to the last section, and download the final notebook as an html document or a notebook file.

#1 Prelude: Prepare the environment & register with Random Key

Import the libraries we’ll use to complete the quest: requests for sending and receiving REST requests and responses, json for processing the responses, random for random shuffling of data, and csv for saving data in that format.

import json
import requests
import random
import csv

Register with Random Key, if haven’t done so already. Substitute the email placeholder with your email address:

url = "https://random.api.randomkey.io/v1/register"

# Provide your email address as the value
payload = "{\"email\": \"youremail@domain.com\"}"
headers = {
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['body']
print(response)

The response will thank you for registering. Your authentication key will be waiting for you in your inbox.

Check your inbox for the authentication key

Save the key in the token variable so that we can reference it from the requests we are about to engineer:

# replace the token value with your authentication key
token = "51299d7b8f6fde"

Upon registration, Random Key grants you 10,000 free requests. Now that the setup is done, the real fun can start:

#2 Create a Sample Request

Let’s commence by creating a request to the First Name endpoint. The request’s body needs to carry the gender of the person (f for female, m for male), the region you want to localise your data in (fr for France, de for Germany, uk for the Great Britain, and us for the United States of America), and the number of records to return.

The following snippet returns a single record for a French female first name.

url = "https://random.api.randomkey.io/v1/name/first"

payload = "{\"region\" : \"fr\", \"gender\" : \"f\", \"records\" : 1}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['name']
print(response)

The name is returned within a list:

The awesome news is that Random Key supports batch requests — it can return up to 10,000 records per request. Let’s give it a go in the next section.

#3 Create a Batch Request

Increase the number of requests to 10 and observe the result:

url = "https://random.api.randomkey.io/v1/name/first"

payload = "{\"region\" : \"fr\", \"gender\" : \"f\", \"records\" : 10}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['name']
print(response)

Just as we asked for, 10 names are returned:

Time to take our tests further and wrap the requests into re-usable functions.

#4 Create Function Wrappers

Rather than re-running the snippets and adjusting the body contents for every single endpoint, let’s save them as functions. The configurable bits can be passed as arguments.

The function returning a first name can be specified as:

def gen_fname(name_type,gender,region,records):
url = "https://random.api.randomkey.io/v1/name/" + name_type

payload = "{\"region\" : \"" + region + "\", \"gender\" : \"" + gender + "\", \"records\" : " + records + "}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['name']
return response

Sample run of the function:

The function returning a surname will look like this:

def gen_lname(name_type,gender,region,records):
url = "https://random.api.randomkey.io/v1/name/" + name_type

payload = "{\"region\" : \"" + region + "\", \"gender\" : \"" + gender + "\", \"records\" : " + records + "}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['name']
return response

The function returning French locations will look like this:

def gen_location(region,records):
url = "https://random.api.randomkey.io/v1/location"

payload = "{\"region\": \"" + region + "\", \"records\" : " + records + "}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['location']
return response

Calling the location function will produce a city and a post code of a place (the admin element returns as NA for all French locations. For Germany and the US it would translate to state, for the UK it will print the country):

The French social security number (numéro de sécurité sociale) function can be abstracted as:

url = "https://random.api.randomkey.io/v1/id/nss"

payload = "{\"gender\": \"" + gender + "\", \"records\" : " + records + "}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['id']
return response

The numbers are gender-dependent, that’s why we need gender in the parameter list. The function will return two attributes per record: the number, and the randomly generated date of birth based on the month and year embedded in the NSS:

Some social security numbers and other ID cards do not carry as much information about an individual: in those instances, date of birth can be created by calling the Random Keys’ Date endpoint.

Finally, the function generating credit card numbers will be saved as:

def gen_ccn(records):
url = "https://random.api.randomkey.io/v1/ccn"

payload = "{\"records\" : " + records + "}"
headers = {
'auth': token,
'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data = payload)
response = json.loads(response.text)['ccn']
return response

In response, the API will return 3 elements per record: the CCN number, the card’s vendor, and the number of digits.

After such preparation, we can dive straight into generating a multi-row customer dataset.

#5 Create a Multi-Record Dataset

Decide how many records you need: in this demo we will produce a 200-record set, equally split between female and male accounts. Whatever you produce will be subtracted from you free Random Key quota. If you decide to go with our scenario, this will “cost” you a 1000 requests.

Run the following snippet to produce 100 female records, structured as first name, last name, gender, date of brith, social security number, city, postcode, credit card number:

# generate first names - female
fnames = gen_fname("first","f","fr","100")

# generate last names - female
lnames = gen_lname("last","0","fr","100")

# shuffle the datasets - json.loads will order the names alphabetically
random.shuffle(fnames)
random.shuffle(lnames)

# generate locations
locations = gen_location("fr","100")

# extract city & zip from the list of locations
cities = []
pcs = []
for i in locations:
city = i[0]
pc = i[2]
cities.append(city)
pcs.append(pc)

# generate nss numbers
ids = gen_nss("f","100")

# extract nss number and its date of birth from the list of nss numbers
nss_ids = []
dobs = []
for i in ids:
nss_id = i[0]
dob = i[1]
nss_ids.append(nss_id)
dobs.append(dob)

# generate ccn numbers
ccns = gen_ccn("100")
ccn_nums = []
for i in ccns:
ccn_num = i[0]
ccn_nums.append(ccn_num)

Combine all parts to form a comma-delimited dataset:

french_customers_f = []
for i in range(0,99):
french_customer_f = ",".join((fnames[i],lnames[i],"female",dobs[i],nss_ids[i],cities[i],pcs[i],ccn_nums[i]))
french_customers_f.append(french_customer_f)

Return the first 10 rows of the set:

french_customers_f[0:9]

Et voila! Half of our work is done. Now let’s create the male part of the set by adjusting the initial script:

# generate first names - male
fnames = gen_fname("first","m","fr","100")

# generate last names - male
lnames = gen_lname("last","0","fr","100")

# shuffle the datasets - json.loads will order the names alphabetically
random.shuffle(fnames)
random.shuffle(lnames)

# generate locations
locations = gen_location("fr","100")

# extract city & zip from the list of locations
cities = []
pcs = []
for i in locations:
city = i[0]
pc = i[2]
cities.append(city)
pcs.append(pc)

# generate nss numbers
ids = gen_nss("m","100")

# extract nss number and its date of birth from the list of nss numbers
nss_ids = []
dobs = []
for i in ids:
nss_id = i[0]
dob = i[1]
nss_ids.append(nss_id)
dobs.append(dob)

# generate ccn numbers
ccns = gen_ccn("100")
ccn_nums = []
for i in ccns:
ccn_num = i[0]
ccn_nums.append(ccn_num)

Merge the records in one list:

for i in range(0,99):
french_customer_m = ",".join((fnames[i],lnames[i],"male",dobs[i],nss_ids[i],cities[i],pcs[i],ccn_nums[i]))
french_customers_m.append(french_customer_m)

Join the female and male customer sets:

french_customers = french_customers_f + french_customers_m

Shuffle the records to mix the gender distribution:

random.shuffle(french_customers)

Print the first 10 records:

french_customers[0:9]

The rows are mixed in random order:

Save the result in a csv file:

with open("french_customers.csv","w") as file:
wr = csv.writer(file, delimiter='\n')
wr.writerow(french_customers)

The dataset is ready to use! Be adventurous and adjust the number of records returned, include other regions, other endpoints, or change the gender ratio to mimic your production data better. For the endpoint description, refer to the Random Key documentation.

We hope you enjoyed the tutorial! You can download it as a Jupyter Notebook or an HTML site. If you feel the API could be improved — give us a shout!

--

--

Randomkey.io
Randomkey.io

Written by Randomkey.io

We are the team behind Randomkey, a developer’s toolkit for data privacy.

No responses yet