In 2017, the Edinburgh Festival Fringe was host to 3,398 shows selling over 2.5 million tickets, numbers that are increasing year on year. With this abundance of shows it can be difficult to find something that one wants to see. I describe here how I used data to create an application that will find shows similar to shows an individual has seen before and enjoyed.
The data comes from the Edinburgh Festival Fringe API. The data used in this demo contains over 3500 shows with over 80 variables of information for each show. I first selected variables I felt indicated similarity between shows and then created distance matrices based on these variables.
Distance matrices are best explained with an example. A distance matrix can be used to represent the geographical distance between a list of locations, for example:
This is a distance matrix showing the distance between Edinburgh, Cardiff and Glasgow. Each city has a distance of zero between itself, and the other values in the matrix are the distance as the crow flies between the corresponding locations. Note that the matrix is symmetric (i.e. the distance between A and B is identical to the distance between B and A).
Distance matrices can also represent more general similarities. Below is an example of a distance matrix representing the similarity between five shows, expressed as a value between 0 and 1. As with the cities, the smaller values represent shows that are closer together, i.e. more similar.
Given such a matrix we can find which show is most similar to a given show by examining its column and finding which row contains the least value greater than zero. For example, the show most similar to show 3 is show 4:
Similarity Measures for Fringe Events
While choosing variables that would indicate show similarity I found that a set of relevant variables could be split into two categories: (1) the content of the show consisting of genre, sub-genre, description, and age rating; and (2) variables relating to the production side of the show consisting of artist type (whether it was a professional or amateur company), number of performers, venue, capacity and duration. From these eight variables I created nine distance matrices (two different text analyses on the descriptions, and one for each of the other variables).
The next step was to sum the nine distance matrices to obtain a single measure of similarity. I knew I wanted some variables to have more influence in the final matrix so I decided to weight the matrices and to fine-tune these weights by hand. To do this, I used multi-dimensional scaling (MDS) to estimate (x,y) coordinates for each show and then plotted each show on a graph. The closer the shows are on the graph the more similar they are. I then considered each variable in turn and coloured the shows based on the categories associated with each variable. The plot below shows the MDS plot (top-left) and the same plot with shows coloured by genre (top-right), venue capacity (bottom-left), and age rating (bottom-right).
The graphs show that the genre and age rating are clustered together whereas the venue capacity does not have clear clusters. From this exploration I created the following hierarchy for the variables and weighted the matrices according to this hierarchy.
Once I had decided on a final similarity matrix between shows I integrated it into an application (app) allowing users to search for a show they have enjoyed and be presented with the four shows that are most similar to it. I also included customisable filters for values of variables, such as age rating, to enable users to narrow search results. The user interface for the app can be seen below:
This article has been an exploration into using Fringe data to build a simple proof-of-concept search application. While simple, the underlying algorithm could be combined with next year’s Fringe data and integrated into an application for next year’s Fringe visitors. Over time the underlying algorithm could be iterated and improved according to the needs of the users. For example, location information or user history could be used to better personalise the recommendations.