Fraud detection and investigation presents one of the most popular use cases for graph databases, especially in the financial services industry. But for those not employed directly by a bank or insurance firm, it can be hard to study or experiment with realistic data. If it’s not obvious, a lack of publicly available datasets is a real problem for academics looking to develop machine learning or heuristic approaches to fraud detection.
Lopez-Rojas, Elmire, and Axelsson1 published PaySim, an approach using an agent-based model and some anonymized, aggregate transactional data from a real mobile money network operator to create synthetic financial data sets academics and hackers can use for exploring ways to detect fraudulent behavior.
Check out their initial dataset posted to kaggle: https://www.kaggle.com/ntnu-testimon/paysim1
There’ve already been some good write-ups exploring the output of PaySim, both in terms of the sample dataset posted to Kaggle circa 3 years ago and possible ML-based approaches to fraud detection like those of Arjun Joshua2. Most recently, Sara Robinson3, published an example using TensorFlow and Google’s Cloud AI Platform to build a predictive model.
But, what’s the one thing all the ML-based approaches have in common? They all illustrate critical shortcomings in PaySim, specifically its overly simplistic modeling of a single type of fraud. They all exploit the fact PaySim’s logic for fraudsters is overly simplisitc.
Let’s see if we can improve PaySim and find new ways to identify fraud using graphs with neo4j, shall we?
This is the first post of a few (maybe 3?) that will explore my experimentation and research taking the open-source PaySim project, improving upon it, and integrating it with Neo4j to implement a fraud analytics platform.
To understand PaySim, we need to understand a little about what it was built to model, specifically a mobile money network.
Mobile money takes different forms, but in the case of PaySim it involves both Banks and participating Merchants. Merchants can take mobile payments via the network (for goods/services) as well as perform the function of putting money into the network (e.g. “topping up” an account).
If it sounds a lot like Apple Pay, it’s because mobile payment services are effectively a type of mobile money.
The mobile money network used by the PaySim authors comes from an undisclosed African country, which leads me to believe it’s of the sort similar to M-Pesa.
From the M-Pesa Wikipedia page:
M-Pesa is a branchless banking service; M-Pesa customers can deposit and withdraw money from a network of agents that includes airtime resellers and retail outlets acting as banking agents.
So consider it something like Apple Pay, but where you can also make deposits via participating merchants.
If PaySim models financial transactions, what does it look like and how does it work?
Let’s jump a bit ahead and talk about what PaySim produces with the help of a graph visualization and then dive into the core components of the simulation: Agents and Transactions.
PaySim is a multiagent simulation, that steps through time, where during each step the agents are allowed to act in ways that can change themselves and the rest of the simulation state. If this sounds confusing at first, PaySim functions with a single core axiom:
Clients
perform zero or manyTransactions
at each step in time, exchanging money with other agents in the network, specificallyBanks
,Merchants
, and otherClients
.
Let’s look a bit closer at both the types of Agents and the types of Transactions that PaySim simulates.
Agents are the key actors, meaning they can perform actions in the simulation. There are three (3) primary agent types and a few subtypes as well.
Clients model the end users in the mobile money network, effectively mapping to unique accounts that, in theory, are controlled by real people. Since Clients model people, and we’re concerned about modeling fraud, it follows that not all people in our simulation behave the same way. (Surprise, surprise!)
Merchants model the vendors or businesses that participate in the network through interactions with Clients.
Banks are pretty inert in PaySim, acting only as a target for Debit transactions. They appear to play a relatively limited role PaySim, probably due to not being a critical component of the mobile money network PaySim models. (Consider, for example, the point that some mobile money networks exist in a market because its consituents are “under banked.")
The only role Banks play is to facilitate Debit transactions, which seem more to be a debit against a client’s balance in the network as if they’re transfering money back into their actual bank account.
Transactions form the cornerstone of PaySim that they’re the only real way client can interact with other agents. In fact, clients are the only agents that perform transactions.
While in the real world a financial transaction could occur initiated by banks, merchants, etc., PaySim focuses entirely on the behavior of the Clients.
What can a Client do each turn in the simulation? They have a choice of five (5) possible transactions:
Transaction | Description |
---|---|
CashIn | A Client moves money into the network via a Merchant |
CashOut | A Client moves money out of the network via a Merchant |
Debit | A Client moves money into a Bank |
Transfer | A Client sends money to another Client |
Payment | A Client exchanges money for something from a Merchant |
Depending on the type of transaction, certain rules apply:
Every transaction must have a second agent of a supported type, dependent on the type of transaction.
Only Transfers between clients require proper double-entry bookkeeping where there’s a zero-sum. (Corollary: the simulation’s money supply can be increased/decreased via Merchants and Banks.)
Transfers amounts must fall under a global transfer limit set in the simulation parameters prior to simulation start. For larger transfers, they must be broken into multiple transactions.
The last thing to note about PaySim (and then you’ll be a PaySim expert!), is that the simulation runs in discrete steps. At every “step”, each agent (in some deterministic order) gets an opportunity to act.
In the case of PaySim:
From a code perspective, each agent in the simulation needs to
implement a simple sim.engine.Steppable
interface5 that the
simulation will call at each step while providing a reference to the
overall simulation state itself:
/*
Copyright 2006 by Sean Luke and George Mason University
Licensed under the Academic Free License version 3.0
See the file "LICENSE" for more information
*/
package sim.engine;
/** Something that can be stepped */
public interface Steppable extends java.io.Serializable
{
public void step(SimState state);
}
In PaySim, all the clients implement Steppable
and provide their own
logic for how they’ll behave.
You can run PaySim as-is, out of the box, and generate synthetic data, so why not just use it now to explore fraud and build our graph? Well…it presents a few challenges:
PaySim expects to write out simulation results as CSV files. While Neo4j natively supports loading csv6, loading the transactions on the fly would open a lot more possibilities like simulating real-time detection and action.
Transactions in PaySim contain only bare bones data, with some critical aspects left to be inferred.
PaySim never explicitly documents all the actors in a simulation run, leaving you to infer their details from the raw transaction output. (In the code, however, it does keep track of all agents.)
Since PaySim is open source, I’ve forked the original and all the changes we’ll be walking through will be part of my PaySim 2.1.7
Before we dive in, the changes we want to make fall into two categories:
PaySim is provided as a Java application built upon the MASON agent simulation framework8, a mature and proven kitchen-sink multi-agent simulation platform. However, the way PaySim was implemented by the authors makes it challenging to build upon and expand.
Here I’ll provide a high level overview of code improvements in my fork of PaySim available at https://github.com/voutilad/paysim.
If you’re not interested in some of the lower-level code changes, jump ahead to Enhancing PaySim’s Fraudsters.
First up is fixing PaySim’s desire to only output to the file system. There are two primary improvements I made to make PaySim embeddable as a library:
Abstracted out the base simulation logic from the orchestration, so the original PaySim can be run writing out to disk, but developers can implement alternative implementations doing whatever they want.
Implemented an iterating version of PaySim, allowing an application embedding PaySim to drive the simulation at its own pace and consume data on the fly.
The original PaySim logic is preserved, but the front-end is now
choosable by the developer or end-user. For example, to run something
analagous to the original PaySim project, you can run the main()
method in the OriginalPaySim
class and it will write out all the
expected output files to disk.
If instead you want to drive the simulation using an implementation of
a Java Iterator<org.paysim.base.Transaction>
, use the
IteratingPaySim
class and consume transactions sequentially. A
worker thread drives the simulation in the background while data flows
via an buffered implementation of a java.util.ArrayDeque
9. (The
nitty gritty details are beyond the scope of this post at the moment.)
This part is a relatively simple change as to keep compatibility with
the original PaySim logic I’ve kept the Transaction
implementation
relatively the same, with the key exception of adding in details about
the actor “types” on the sending and receiving end.
Since all actors derive from the org.paysim.actors.SuperActor
base
class, they all implement some getter for a SuperActor.Type
value (an enum).
By tracking the SuperActor.Type
on the Transaction
:
We don’t have to keep references to the actors and they can ultimately be garbage collected by the JVM if we destroy the simulation.
More importantly, we can always know what type of actors the transaction pertains to, allowing us to accurately look up specific instances either in PaySim’s tracking of Clients/Merchants/Banks or in our resulting database.
I made various touchups and tweaks that are too in-the-weeds for this blog post, so if you’re interested make sure to check out the project’s README for some more details. Some items of note:
static
members allowing multiple
configurations of PaySim to be loadedSystem.out
for loggingWith the foundation improved, we can now work on shoring up the logic for our fraudsters. Let’s first look at how the original PaySim fraudsters behave and then get into the changes for 1st and 3rd Party implementations.
PaySim originally only models what looks to be a form of 3rd-party fraud:
CashOut
A real-world example of this might be someone breaching someone’s mobile money account via credential skimming/theft or phishing. Once the Fraudster has access to the payment card they can cash out by buying gift cards or prepaid cards that can in turn either be used or sold to convert to actual cash.
Can we make it a tad more realistic?
Fraudsters try to completely drain a Victim’s account, performing Transfers up to the network “transfer limit” set by the model parameters.
A PaySim Fraudster picks a Victim from the simulation universe at random.
With the above in mind, let’s first talk about turning our generic PaySim fraudster into a 3rd Party Fraudster.
We’ll enhance our 3rd-party Fraudsters to incorporate a few new behaviors bringing it closer to realistic behavior:
Like the original PaySim, we’ll keep the idea that a 3rd-party Fraudster creates a Mule account as a means of cashing out of the network.
For logic changes, let’s keep it simple but accounting for some key events:
Test fraud probability like in original PaySim. If test fails, abort actions for this simulation step.
If there are no victims OR we pass a probability check for picking a new victim, we enter New Victim mode:
Otherwise, pick an existing Victim at random and try a “Transfer” of some percentage of the Client balance to a Mule.
See ThirdPartyFraudster.java in the code base for implementation details.
First Party Fraud typically entails misrepresenting oneself in order to establish a line of credit with no intent to fulfill any debts. (See the definition in Open Risk Manual.)
A more interesting form of fraud is synthetic identity fraud where instead of using their own identifying information, fraudsters mix real with fake identifiers in order to slip past fraud checks when opening accounts or getting credit lines.
Should be easy to add to PaySim, but PaySim doesn’t have any form of identities!
First, we’ll have to bend our definition of the payment network being modeled by PaySim and assume some of it involves lines of credit.
Next, adding identities is pretty easy, but requires a bit of an overhaul across the agent (actor) codebase: we ultimately needs all Clients, whether Fraudsters, Mules, or regular, to have some identifiable details that are generally unique.
What should it look like in the end? From a graph perspective, there’s a pretty trivial way to incorporate identities with Clients: relate each Client to an instance of an Identity.
From the PaySim code perspective, it gets a bit trickier, and easily can turn into a bike shedding exercise. Here’s where I ended up:
All SuperActor
instances (our base actor class) are
Identifiable
.
Identifiable
means you have an “Id” and a “Name” (both
Strings) as attributes.Identity
.An Identity
effectively is a container for the different identity
attributes (name, id, etc.) and there are multiple implementations:
BankIdentity
and MerchantIdentity
both only have an “Id” and
a “Name”.ClientIdentity
is more representitive of a “person”, having
not only a “Name” and “Id”, but others like “email”, “ssn”, and
“phone” numbers.An IdentityFactory
provides a deterministic means of producing
“random” identities as needed.
Constructors for actors get overhauled to optionally take a
reference to an Identity
implementation OR will generate one if
not provided.
PHEW! If you want to look at the code mess, the org.paysim.identity package contains most of the additional code. Also check out some commits like 78b1cfb and f7b174a to see how things were changed.
Now that we have an identity component to our actors, let’s put together a new fraudster.
Using security breaches and identity theft stories from the headlines, let’s pretend our fraudster acquired some number of viable identities (names, ssn’s, and phone numbers). When we create a 1st-party fraudster, we can generate a handful of identities and give them to the fraudster.
For committing the fraud, we’ll start with a pretty trivial implementation:
From a Java implementation standpoint11, it’s pretty short and sweet:
@Override
public void step(SimState state) {
PaySimState paysim = (PaySimState) state;
final int step = (int) state.schedule.getSteps();
if (paysim.getRNG().nextDouble() < parameters.fraudProbability) {
ClientIdentity fauxIdentity = composeNewIdentity(paysim);
Mule m = new Mule(paysim, fauxIdentity);
Transaction drain = m.handleTransfer(cashoutMule, step, m.balance);
fauxAccounts.add(m);
paysim.addClient(m);
paysim.onTransactions(Arrays.asList(drain));
}
}
At this point, we’ve got a revamped, new version of PaySim that can be run standalone or embedded. We’ve also got an understanding of our data model and how we plan on adapting it to our graph model, laying the foundation. Our data model is also slightly different.
You’ll notice that unlike what we started with, it now provides
identifiers (e.g. Phone
, Email
, SSN
) for each Client account
(which may or may not be a Mule).
Other enhancements in PaySim 2.1 not visible in the data model:
To me this feels like an improvement. Let’s now put it to work and simulate some fraud!
In my next post, we’ll look at how to configure and run a PaySim simulation while simultaneously bulk loading the transaction output into a live Neo4j instance. We’ll cover:
A final post (TBA) will dive into how to analyze the data from both a visual perspective as well as an algorithmic approach using Neo4j’s Graph Algorithms library.
Until next time! 👋
PaySim:A Financial Mobile Money Simulator For Fraud Detection ↩︎
See Arjun’s Kaggle notebook here: https://www.kaggle.com/arjunjoshua/predicting-fraud-in-financial-payment-services ↩︎
Sara is a Developer Advocate for Google Cloud. You can find her blog at https://sararobinson.dev/ ↩︎
This is due to PaySim using aggregate data to drive the simulation and the data provided (by the original authors) only covers 30 days. Modifying this data will allow PaySim to produce different outcomes of differing lengths. ↩︎
https://github.com/voutilad/mason/blob/728bdc43f35dd52c06ffce99a704f3191c2fcfa4/mason/src/main/java/sim/engine/Steppable.java ↩︎
As such, PaySim is provided under the GPLv3 and my fork is available at https://github.com/voutilad/PaySim. ↩︎
See the MASON project’s home page: https://cs.gmu.edu/~eclab/projects/mason/ ↩︎
https://docs.oracle.com/javase/8/docs/api/java/util/ArrayDeque.html ↩︎
See Stripe’s docs on how they define “card testing” https://stripe.com/docs/card-testing ↩︎
https://github.com/voutilad/PaySim/blob/3cfb56d0d52e45157f387144e8a4d0be7bcb7850/src/main/java/org/paysim/actors/FirstPartyFraudster.java#L44 ↩︎