Guest blog by Colin Gillespie, Data Scientist at Jumping Rivers
Colin was due to speak at DataTech20, which was unfortunately cancelled due to COVID-19. We’re delighted that he has put together this brilliant blog on digital security for us.
How not to do security
Digital security is everywhere. Unfortunately, bad digital security is also everywhere. Take Disney+, the godsend to all parents during this lock-down. When initially released in the USA, it was plagued with security issues. Essentially, hackers used username and password combinations that had already been compromised to access user accounts. This is commonly known as credential stuffing. Not a particularly sophisticated attack, but simple and clearly effective.
Last week I purchased a Roku, a device that allows your television to connect to the numerous online streaming services, including Disney+. As someone who takes security seriously, I made my Disney+ account secure with a unique thirty-five digit password – I use a password manager. However, to access Disney+ on the Roku I had to enter the thirty-five digit password using a tiny hand control, via a terrible interface. Just when things couldn’t get worse after the first twenty characters, Roku stopped displaying the entered password!
Poor design makes security hard
It’s this sort of poor design that makes security hard. Users try to do the right thing, but are ushered into insecure practices. Of course, if my account was compromised it would be “my” fault.
Amazon by comparison, asked you to log on to their site with a laptop/phone and enter a code on the TV. Easy, simple and secure. Users are nudged into doing the right thing.
The difference with data scientists and other members of an organisation is that their job typically involves pulling together a variety of sensitive data sets and then reporting the results to relevant stakeholders. This may include a public facing website. Coupled with this is the need to work at the forefront of technology. Not a good mix from a security perspective.
Companies and organisations, are often, and rightly so, worried about security around their data assets. However, their solution is often to place barriers and obstacles that impede data scientists from doing their job.
At Jumping Rivers we provide online and onsite training around R and Python. Previously, we would send the client a list of R/Python packages that were required for training. At many organisations, installing software is painful. It requires multiple forms, time, emails, and signatures. As this process was so painful, we noticed that many organisations minimised this pain by simply installing all 12,000 CRAN R packages! That way they would avoid future form filling. The logic was this was a single request to IT, and so saved time in the future. There was also a bit of passing the buck – “IT said it was OK, so it’s no longer my responsibility”. While users should take some responsibility for this “solution”, IT should also take much of the blame. As they are actively hindering people’s ability to do their job, the obvious consequence is that users try to circumvent the barriers.
At Jumping Rivers we now avoid this situation, by providing a cloud solution for clients that requires no set-up and no bad security practices.
While this is an extreme example, it’s type is replicated at multiple organisations. Hire incredibly clever data scientists, stop them accessing the correct tools but still expect them to do their work. The result, workarounds – personal laptops for work, data on USB sticks, and logging on to unsecured wifi to download zip files. At one course I gave, the class would run over to Starbucks to install software at lunchtime as the office wifi blocked all zip files.
What could go wrong?
Bioconductor is a suite of R packages that are used for analysing genomic data sets. The sort of data that is very expensive to collect and almost certainly sensitive. The type of organisations that use Bioconductor are pharmaceutical companies, Government organisations, and Universities. Basically, anyone involved in serious medical research; most organisations looking at vaccines for Covid19 would be using Bioconductor.
To install Bioconductor, you run the following command
in R. This would download and run an R script. A few things to note about the above code:
- It uses https. This is good. But you can still have a secure conversation with the devil!
- Running this code involves complete trust in the Bioconductor team. You are giving the author of this script all of your user rights. Any files you can access, they can access. Any program you can run or install, they can run or install.
The first point around https, is well known. It’s not that https is insecure, it’s just that it’s often mistaken to mean that the content is secure. Indeed, Barclays got into trouble by the ASA in 2018 for making this very claim. So using https is a necessary, but not the only condition for security.
The second point around trust, at first glance, seems odd. Of course we trust Bioconductor, we are using their software after all. However, this assumes you actually install the software from Bioconductor! A few months ago, I made thirteen domain purchases. All misspellings of the name “bioconductor” for cost of around £100. For example, I had:
I then monitored the web-logs to see if anyone tried to download an R script from my domain. On average, I had fifteen unique IP addresses per day, which included:
- hits from all major pharmaceutical companies;
- hits from all top ten world Universities;
- hits from major government departments.
Remember, whenever anyone runs my script, this is equivalent to opening their laptop and giving an attacker full access. Depending on how an attacker was feeling, they could delete everyone on the laptop, or perhaps more nefariously, change a couple of values in an Excel spreadsheet, potentially causing millions of pounds worth of damage.
Where does the fault lie for the potential security breach? I would argue that it’s not just the users’ fault, as mistakes happen. Instead, fault lies partly with Bioconductor for encouraging this installation method (this has now been changed). Also, the fault lies with organisations who use Bioconductor. Why was this not installed securely site-wide, which would remove the need for anyone to install the software.
Getting the Basics Right
As organisations become more data driven, this in turn leads to a proliferation of different technologies, such as, Shiny dashboards, API ends, and Flask apps. This is normal, and by itself not an issue. However, it is all to easy to let weak security practices creep in. For example, whenever we at Jumping Rivers engage with a company, we have a standard check list of gotcha’s (and also solutions)
- Who is in charge of updating packages, e.g. R or Python
- How do you monitor old dashboards for potential security vulnerabilities
- How are your API keys generated, stored and shared?
- Do you use and enforce two factor authentication?
- How do you monitor your cloud computing resources?
Digital security around data science seems to be in its infancy. However, with a little thought, it is possible to use standard security practices in other areas to tighten up the process.
Notes on the Bioconductor Study
- The Bioconductor installation process was changed around 18 months ago. When you discover a vulnerability it’s good practice to give the organisation reasonable time to fix it.
- On my fake bioconductor domains, I never returned an R script. I simply give a 404 (page not found) error message. This allowed me to demonstrate the potential for a security vulnerability, without actually compromising organisations. The latter would cross that line to being illegal.