In preparation for the new Star Wars movie premier, Richard and Roman, the Data Scientists at The Data Lab, have analysed several hundred characters from the Star Wars films and associated series’ to determine from which language each name is most likely to have come.
A list of over 500 names was taken from Wikipedia and on each an n-gram model from artificial intelligence was performed. The n-gram model, from the field known as natural language processing, first splits the name into a sequence of single, double, and triple character strings. For example, the name Luke decomposes into the strings l, u, k, e, lu, uk, ke, luk, and uke. Using a piece of software called textcat – short for text categorisation – the frequency of the resulting strings is compared with those of dozens of language corpuses. From this the software is able to calculate probabilities of a given name coming from each of the languages. The most likely language is noted for each character name.
The authors are keen to point out that this exercise is done for fun and that the results are not meant to be taken too seriously. The technique is only really applicable to larger bodies of text and is typically used to categorise written works by, for example, similarity, author or subject matter. The research did throw up some interesting conclusions, however.
The names span a huge number of different languages, from the readily familiar to the rather more obscure. Middle Frisian, for example, was spoken around the Netherlands, Germany and southern Denmark in the 17th and 18th centuries, whilst Tagalog is a modern-day language from the Philippines.
There appears to be a connection between the names of the Hutt characters and Scottish. In addition to Jabba the Hutt, each of Borvo the Hutt, Gardulla the Hutt, Mama the Hutt, Rotta the Hutt, Ziro the Hutt, and Zorba the Hutt maps to Scottish, as does Sy Snootles, the lead vocalist in Jabba’s house band in Episode VI – Return of the Jedi.
The full list of names can be found here.