Technical blog from our Data Science Team

Read the latest technical blogs from our data science team.

Winning Pub Quizzes with pandas

Author: Richard Carter

I was in my local pub quiz recently when as usual we were faced with the anagram speed round. The quiz-master Dr Paul reads out a collection of words with a clue to a mystery person’s identity, and the first team to correctly shout out the answer wins a spot prize. It’s often a box of Tunnock’s Teacakes – so well worth the effort!

We were told on this particular occasion that the jumbled letters ‘OR ANAL MR BOND’ (!) were an anagram of a famous actor. As the words are being read out the well-worn technique is to write them in a “cloud” on the page, and then try to decipher the correct name. This can often take some time, but on this night myself and a couple of other teams almost instantaneously shouted “Orlando Bloom!”. One of the easiest anagrams we’d had, I thought. But I was wrong. It wasn’t Orlando Bloom, and so we had to get to thinking again. It turned out that the correct name we were after was old-Godfather himself, Marlon Brando.

I had to double-check that Orlando Bloom didn’t fit the original character set – that indeed, Orlando Bloom and Marlon Brando aren’t anagrams of each other. They’re close, but they’re different. This got me to wondering whether there are any famous actors who’s names are anagrams of each other, and to investigate this I took to Python.

The Internet Movie Database (or IMDb for short) is an encyclopedia of all things film and TV. On the page http://www.imdb.com/list/ls058011111/ I found a list of the “Top 1000 Actors and Actresses” created by user hagennelson. This forms the basis for the following research.

Since the list spreads over four pages we first need to download the elements and concatenate them into a single pandas data frame.

import pandas as pd

frame_list = []

for start in [1, 251, 501, 751]:
    URL = "http://www.imdb.com/list/ls058011111/?start=" + str(start) + "&view=compact&sort=listorian:asc"
    dfs = pd.read_html(URL, match="Name", header=0, index_col=0)
    frame_list.append(dfs[0])

df = pd.concat(frame_list)
df.index = df.index.astype(int)

The next task is to create a new column containing the alphabetically sorted characters of the actors’ names:

df['Chars'] = df['Name'].apply(lambda x: ''.join(sorted(x.lower().strip().replace(" ", ""))))

We can see the effect of this in the top five rows of the data frame using df.head().

One quick way to check for anagrams now is just to look for duplicates in the Chars column. If we have 1000 unique character strings then we know that we have no anagrams at all. In fact it turns out that we have only 997 unique character strings, so it looks like we have three pairs of anagrams.

One quick way to check for anagrams now is just to look for duplicates in the Chars column. If we have 1000 unique character strings then we know that we have no anagrams at all. In fact it turns out that we have only 997 unique character strings, so it looks like we have three pairs of anagrams.
997
              Name         Chars
#                               
292  Neve Campbell  abceeellmnpv
846  Neve Campbell  abceeellmnpv
              Name         Chars
#                               
773  Paul Scofield  acdefillopsu
949  Paul Scofield  acdefillopsu
               Name          Chars
#                                 
708  Patrick Wilson  aciiklnoprstw
954  Patrick Wilson  aciiklnoprstw

Unfortunately what we actually find above is that three of the actors appear erroneously on the list twice each! The actual list of 997 actors do not contain any self-anagrams whatsoever.

Given that we weren’t able to find anagrams, and that the downloaded list contains the actors birthdates, I thought it would be fun to see whether any of the famous thespians share the same date of birth. This turns out to be far more fruitful.

First we need to rid ourselves of the duplicated actors discovered above.

df = df.drop_duplicates(subset='Name')

Similar to above we now wish to find out which, if any, birth dates are duplicated within our dataframe.

df_duplicates = df[df.duplicated(subset='Birth Date')][['Name', 'Birth Date']]

for bd in df_duplicates['Birth Date']:
    print df[df['Birth Date']==bd][['Name', 'Birth Date']].to_csv(header=False)
195,Julianne Moore,3 December 1960
199,Daryl Hannah,3 December 1960

238,Lena Headey,3 October 1973
292,Neve Campbell,3 October 1973

397,Anna Paquin,24 July 1982
493,Elisabeth Moss,24 July 1982

145,Glenda Jackson,9 May 1936
571,Albert Finney,9 May 1936

494,Adrianne Palicki,6 May 1983
587,Gabourey Sidibe,6 May 1983

523,Tom Wilkinson,5 February 1948
668,Christopher Guest,5 February 1948

248,Gillian Anderson,9 August 1968
673,Eric Bana,9 August 1968

379,Ellen Page,21 February 1987
779,Ashley Greene,21 February 1987

786,Thandie Newton,6 November 1972
827,Rebecca Romijn,6 November 1972

747,Keri Russell,23 March 1976
836,Michelle Monaghan,23 March 1976

753,Djimon Hounsou,24 April 1964
871,Cedric the Entertainer,24 April 1964

77,Angelina Jolie,4 June 1975
920,Russell Brand,4 June 1975

629,Catalina Sandino Moreno,19 April 1981
932,Hayden Christensen,19 April 1981

425,Chris Pine,26 August 1980
946,Macaulay Culkin,26 August 1980

913,Justin Bartha,21 July 1978
953,Josh Hartnett,21 July 1978

828,Michelle Rodriguez,12 July 1978
964,Topher Grace,12 July 1978

858,John Leguizamo,22 July 1964
970,David Spade,22 July 1964

And above we get a set of 17 pairs of actors from our original list who were born on exactly the same day. To pick out just three pairs we have:

3 December 1960: Julianne Moore and Daryl Hannah.

21 February 1987: Ellen Page and Ashley Greene.

4 June 1975: Angelina Jolie and Russell Brand.

You just never know — this could all come in useful at a pub quiz one day!

Share this technical blog post:

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email

Comments

We’d love to hear your views on this article.

Comments will appear after they have been approved by the Data Science Team.

Leave a comment

Comments will appear after they have been  approved by the Data Science Team.

Search technical blog posts by author:

More Technical Blog Posts: