Tag Archives: Music

A Possible Algorithm for Detecting Malicious Users

As requested, a soundtrack has been attached for your multisensory enjoyment: Algorhythm

A common situation in today’s tech world: you are a large tech company with a vast amount of user data. Some of those users are bots, scammers, or otherwise unsavory individuals. However, you often only know this once it is too late; another user reports him, someone gets scammed/cheated, someone gets malware, etc. As a gatekeeper/monitor of user interactions, how can you preempt such a situation?  The following is an example of how vector similarity might be used to give some indications, across a variety of metrics, that a user is a high-risk individual. I will use the term “scammer” for reductive purposes, but it is interchangeable with “predator”, “cheater”, “troublemaker”, or any other outlying user you wish to detect prior to that user’s perpetration of an activity that would reflect poorly on your site as a whole.

I should note that this is surely a concept that has been studied significantly by people interested in machine learning, security, etc; this is not a scientific paper, merely a bit of self-edification.

The Algorithm

Initially, there are no known scammers. We define a scammer as someone who has successfully perpetrated v, a violation of your site’s rules. It is important that v is unique and well-defined. Define a set of users U who exist on the site across a large time interval T. At the end of this time interval, some users will have committed scams and some will not. Ideally, the “corruption ratio”, or the ratio of scammers:nonscammers within the set should be similar to that of your site as a whole. Split the set U into two evenly divided sets U1{u10,u11…u1n} and U2{u20,u21…u2n}. Give U1 and U2 roughly equal corruption ratios.  Each user unm is defined uniquely in each time interval, unless he was discovered to be a scammer within a previous time interval.  In other words, after a user is discovered to be a scammer, his state is no longer considered relevant.

For now we will focus on U1. Define a series of equal time intervals across T{t0, t1…tn}. Define the set of users within U1 who were identified as scammers by the end of T as S1{s10,s11…s1n}.  Define the set of scammers who were discovered within ti as si.  Define a set of dimensions D{d0, d1…dm}, each of which is a value that has a likelihood of correlation with scamming.  A few potential values for a user dimension include:

Site with a social networking component:

  • amount of difficult-to-forge information
  • number of photos uploaded with the user’s face
  • strength of connectedness of relationships with known scammers
  • strength of connectedness of relationships with known nonscammers
  • age disparity between the user and the average person they contact (particularly relevant when tracking down MySpace predators)

Site with a commerce component

  • mean/median/mode/standard deviation transaction size
  • amount of positive/negative feedback from high-frequency/low-frequency users

General

  • rate of login/logout
  • variance of login location
  • various quantifications of legal enforcement within that user’s location

Define a set of prototype vectors, each having the m dimensions defined by D, P1{p10, p11…p1k} where p1i is the prototype vector composed from all the vectors within s1i.  In each case, p1i represents the prototypical scammer from U1 who was discovered within the time interval ti.

Note that this approach does not use weightings on any of these dimensions; how they could be used to achieve increased granularity is beyond the scope of this article.

Now we move our focus back to U2 and divide it using time intervals parallel to those used when evaluating U1.  Define the set of users within U2 who were identified as scammers by the end of T as S2{s20,s21…s2n}.  Define a set of prototype vectors for U2, P2{p20,p21…p2k}.

As a reminder, P1 and P2 are sets of prototypical scammers.  Define the set of values A{a0,a1…an} to be the vector similarity between P1 and P2 at time an.  an is also defined as the MOSNDS (measure of similarity necessary to determine similarity) over the time interval tn.  This is calculated to be a metric as to what a satisfactory degree of similarity it would take between, for example, prototype vector p1n and user u2k to predict whether u2k is a probable scammer.

Is this valid?