Tag Archives: Computer Science

Stock Similarities

Stock Similarities is a tool I wrote for comparing equities using cosine similarity.   The source code can be found on github.


Upon starting the program, the user is presented with the following:

restrict : limits the parsed metrics to a stricter set than default
ld <ticker> : loads all information about a stock into memory
ld <sector> : load tech, pharm, food, or finance
ld all : loads several NASDAQ stocks from various sectors
list : list all loaded companies
print_vect <ticker> : print the formatted stock vector for a ticker which has been loaded into memory
print_atts <ticker> : print all raw attributes of a stock which is in memory
sim <ticker> <ticker> : print the cosine similarity of two vectors
vis : enter visualization mode
sr : perform SageRank
q : quit the system

A standard series of commands can be found here.  It was generated from an older version of the code.  Several key lines are:

measure_similarity MSFT AAPL
measure_similarity AAPL AMZN
measure_similarity MSFT AMZN

The output is code that can be copied into a Processing file to get the following visualization:

visualizationThe lines of output suggest that, of the three companies, AAPL and AMZN are the most disparate.  As a result, AAPL and AMZN are connected by the hypotenuse (the longest line).  The other meaningful component of the visualization is the radius of the circle, which is dictated by price/earnings ratio.


Stock data is pulled from Yahoo finance, formatted, parsed, and mapped to vectors.  After this process, a stock can be summarized by a vector such as AAPL -> {contracts traded yesterday = 1000000000, last traded price = 520, short ratio = .5 …}.  Vectors are compared using cosine similarity.

	public static double cosineSimilarity(AttributeVector v1, AttributeVector v2) 
		return dotProduct(v1, v2) / (v1.magnitude() * v2.magnitude());

This creates a 1-to-1 similarity ratio for each pair of stocks.  GraphFactory turns these relationships into edge lengths, so that the stocks form a fully connected graph.

The nodes can each be printed in order of ranked importance.  A node’s importance is the sum of the incoming edges in that node.

Locate Great Spy Observe Recorders With Great Purpose

Spam bots are getting so sophisticated that I want to approve the comment even though I know it’s spam.

MoonStocks Works Better Than I Thought

While staring at MoonStocks today, I noticed that my algorithm to convert the dominant frequency of a stock’s song into the price of that stock was working better than I thought. It is hard to see that this process generates predictable patterns because the stock prices are updating every 100 ms but have variance associated with each individual conversion of dominant frequency measurement to price.  Over some number of iterations of a song -> price-series conversion, every point in time will have a price that is converged upon.

The Lunar Market Goes Live

Today marks the beta release of The Lunar Market, a game I worked on with Josh Stewart and Pong Tam.

The objective is to accrue money by trading stocks as a robot. The game includes music written by me, and the fluctuations in the music drive the fluctuations in stock prices.

You can download and install the apk by navigating to the following link while on an Android device. You may have to change the settings to allow installations from unauthorized sources, but if I give you viruses, you know where to find me.

Download The Lunar Market!


Musical Complexity Bloat

Once again, I’ve gone from an undermixed arrangement…
…to an overmixed arrangement…
…saving nothing intermittently along the way. If I were to try to clean things up, I would probably begin by clearing the entire mixer of all effects, and the subsequent process of remixing would take hours.

I really should use version control in my production.


In software engineering, the advantage of keeping your software modular is well-understood.  Treating individual classes or even entire projects as reusable components instead of context-sensitive, isolated systems allows you to build off of them easily in the future.  In my efforts to understand the overlap between electronic music production and software design, I have tried to think in similar terms, to little success.  I’m currently making efforts at collaborative production with a friend who uses Ableton (I use FL Studio).  Instead of a granular approach of sending him a collection of MIDI files, perhaps somehow associated with the samples and VSTs with which they are instrumented, I simply exported them to mp3.  We are both of the view that songs are all essentially vectors to be operated upon, and to take this a step further, a mixer track can be viewed as a vector of operators, each with friendly UIs.  In this view, an mp3 is not inferior by virtue of its lossy compression, it has just been scaled down, which can be accommodated by powerful EQing and multiband compression.

The Waffle Party


Rafflesia Perfume

synth pop for u



tetris invocation




Feliz Navidad




synth pop for u


spring mvc




A Possible Algorithm for Detecting Malicious Users

As requested, a soundtrack has been attached for your multisensory enjoyment: Algorhythm

A common situation in today’s tech world: you are a large tech company with a vast amount of user data. Some of those users are bots, scammers, or otherwise unsavory individuals. However, you often only know this once it is too late; another user reports him, someone gets scammed/cheated, someone gets malware, etc. As a gatekeeper/monitor of user interactions, how can you preempt such a situation?  The following is an example of how vector similarity might be used to give some indications, across a variety of metrics, that a user is a high-risk individual. I will use the term “scammer” for reductive purposes, but it is interchangeable with “predator”, “cheater”, “troublemaker”, or any other outlying user you wish to detect prior to that user’s perpetration of an activity that would reflect poorly on your site as a whole.

I should note that this is surely a concept that has been studied significantly by people interested in machine learning, security, etc; this is not a scientific paper, merely a bit of self-edification.

The Algorithm

Initially, there are no known scammers. We define a scammer as someone who has successfully perpetrated v, a violation of your site’s rules. It is important that v is unique and well-defined. Define a set of users U who exist on the site across a large time interval T. At the end of this time interval, some users will have committed scams and some will not. Ideally, the “corruption ratio”, or the ratio of scammers:nonscammers within the set should be similar to that of your site as a whole. Split the set U into two evenly divided sets U1{u10,u11…u1n} and U2{u20,u21…u2n}. Give U1 and U2 roughly equal corruption ratios.  Each user unm is defined uniquely in each time interval, unless he was discovered to be a scammer within a previous time interval.  In other words, after a user is discovered to be a scammer, his state is no longer considered relevant.

For now we will focus on U1. Define a series of equal time intervals across T{t0, t1…tn}. Define the set of users within U1 who were identified as scammers by the end of T as S1{s10,s11…s1n}.  Define the set of scammers who were discovered within ti as si.  Define a set of dimensions D{d0, d1…dm}, each of which is a value that has a likelihood of correlation with scamming.  A few potential values for a user dimension include:

Site with a social networking component:

  • amount of difficult-to-forge information
  • number of photos uploaded with the user’s face
  • strength of connectedness of relationships with known scammers
  • strength of connectedness of relationships with known nonscammers
  • age disparity between the user and the average person they contact (particularly relevant when tracking down MySpace predators)

Site with a commerce component

  • mean/median/mode/standard deviation transaction size
  • amount of positive/negative feedback from high-frequency/low-frequency users


  • rate of login/logout
  • variance of login location
  • various quantifications of legal enforcement within that user’s location

Define a set of prototype vectors, each having the m dimensions defined by D, P1{p10, p11…p1k} where p1i is the prototype vector composed from all the vectors within s1i.  In each case, p1i represents the prototypical scammer from U1 who was discovered within the time interval ti.

Note that this approach does not use weightings on any of these dimensions; how they could be used to achieve increased granularity is beyond the scope of this article.

Now we move our focus back to U2 and divide it using time intervals parallel to those used when evaluating U1.  Define the set of users within U2 who were identified as scammers by the end of T as S2{s20,s21…s2n}.  Define a set of prototype vectors for U2, P2{p20,p21…p2k}.

As a reminder, P1 and P2 are sets of prototypical scammers.  Define the set of values A{a0,a1…an} to be the vector similarity between P1 and P2 at time an.  an is also defined as the MOSNDS (measure of similarity necessary to determine similarity) over the time interval tn.  This is calculated to be a metric as to what a satisfactory degree of similarity it would take between, for example, prototype vector p1n and user u2k to predict whether u2k is a probable scammer.

Is this valid?


As much as I enjoyed CS371p, my most valuable gains from it were not really from learning object-oriented best practices.  What the class helped me most with was the idea of creating coding standards and then enforcing them upon myself strictly.  This has more to do with what the professor wanted to impart than what the material necessitated, which makes me feel quite lucky to go to UT.  Writing UML and reading papers that discussed CS industry and culture further supplemented the “practical” feeling of the class.

I’m glad there was a points-based enforcement of pair programming.  I have found it easy to make acquaintances in CS, and pair-programming has turned several acquaintances into friends.  Granted, this is not the stated goal of having us pair program together, and certainly not the only reward, but even if I were to fail this class and dejectedly quit CS altogether, I would retain those friendships.

Without the information gained in OOP, I would likely have not have gotten a job this Summer.  Many questions asked throughout the twenty-something interviews I had were answered with information from the class, and several questions I missed were things that I would have learned in class had I been paying closer attention.

To summarize, great class.