Backing up to USB with a batch Script in Microsoft Windows

I have recently reentered the world of work and have been issued with a brand new computer that has Windows 10 on it. In years passed, I developed an aversion to using Windows and was frustrated at each institution that pressed a computer into my hands with this proprietary operating system on me.

But with age, my zealousness and my idealism have waned and I am much more comfortable with using Windows, particularly now that Windows has an embedded Ubuntu subsystem.

I find, too, that once in a while I find something about Windows that I genuinely like. With an increasing number of training materials on my hard drive I have become increasingly paranoid that I will suffer a hard drive failure and lose all of my materials. For the moment I have taken to backing up my data to an USB thumb drive. In this article I will show you my approach in the hope that it will provide you some marginal value.

Fixing the USB thumb drive Drive Letter

To make it easier for us to write our script, firstly, we can ensure that each time we mount our USB thumb drive it will mount to the same logical drive letter. In my case, I chose the Z: drive meaning that any other media won’t accidentally be mapped to the same drive letter accidentally.

To achieve this, first launch the Disk Management utility. You can achieve this by pressing the Windows key then typing diskmgmt.msc and choosing the result that is listed. Insert your USB thumb drive and see it appear in the list of drives in Disk Management.

Right click on the the drive and choose “Change Drive Letter and Paths…”. Then change the drive letter to Z:. Each time you mount your drive in Windows from now on, it will mount to the drive letter Z.

Write the Script

Now that the drive is mounted predictably, a very simple batch script can be created. The purpose of this script will be to backup the contents of our Documents folder to the Z:\ and then un-mount the USB drive so we can pull it out as soon as the script has finished. As you can see, the script will give an error message if the Z: is not mounted. It uses ROBOCOPY to efficiently mirror the contents of the Documents directory to Z: and then to follow, it unmounts the volume.

Paste the contents of the above into a text editor and save as a .bat on your desktop.  Whenever you want to backup, insert your USB stick and then double-click the script. When it has successfully unmounted it will tell you you can pull out the USB thumb drive.

Posted in productivity | Tagged , , , , | Leave a comment

Unsubscribe from all your YouTube channels with one weird trick

Here’s a short one for you. I have wanted to clear my list of subscribed channels in YouTube for a long time. Unfortunately, it seems that in recent years, there’s no automated way of doing this. If, like me you are subscribed to a lot of channels on YouTube, this means that you need to click on each channel you are subscribed to and selecting the ‘Unsubscribe’ button.

In my case, it was quicker to write a short snippet of JavaScript that manipulates the DOM in order to achieve this. I’m sharing this here to save someone from repeating my efforts, as long as YouTube doesn’t change the layout of their subscriptions page any time soon.

To unsubscribe from all your YouTube channels, open chrome and visit https://www.youtube.com/subscription_manager.

Open the console by typing F12 and copying and pasting the following code:

 

Posted in Uncategorized | Tagged , , | Leave a comment

Calculate average ranking in R

Here is short post to describe how to calculate the average rank from a set of multiple ranks listed in different
data.frames in R. This is a fairly straightforward procedure, however, it took me more time than I anticipated to make this
work.

To begin with, let's create a set of data.frames with and randomly assign them rank values from 1 to 5 for the letters
A,B,C,D,E:

For A, B, C, D and E, we can quite easily calculate the average ranks. To do this, using the sapply command, we can create a
matrix of all the rankings for each data frame with a column for each of the five sets of rankings and a row for each of A
through E:

Next, we calculate the mean for each of A, B, C, D and E using the built-in R function rowMeans:

Finally we use the order function to get the final rank values and convert the vector back into a data.frame which is of
the same format as the original rank data.frames:

I hope this was of use to someone, even if that person happens to a forgetful future me. I'm more than certain that
there is an R function or an R package which will perform this for you but it is nonetheless at most interesting and at
least fun to implement. If anyone has an alternative, more elegant solution, I would really appreciate hearing from you.
Happy hacking!

Posted in Uncategorized | Leave a comment

Testing code in RMarkdown documents with knitr

Over the last few months, Literate Programming  has proved to be a huge help to me in documenting my exploratory code in R. Writing Rmarkdown documents and building them with knitr not only provides me a greater opportunity to clarify my code in plain English, it also allows me to rationalise why I did something in the first place.

While this is really useful, this has come at the expense of writing careful, well unit tested code. For instance, last week I discovered that a relatively simple function that I wrote to take the average values from multiple data frames was completely wrong.

As such, I wanted to find a way that let me continue to write Rmarkdown while also testing my code directly  by using a common unit testing framework like testthat.

Here is one solution to that problem: if we isolate our key functions from our Rmarkdown document and place them in a separate R file, we can test them with testthat and include them in our Rmarkdown document by using knitr’s get_chunk function.

Prime Numbers

As an example, let’s create a document which shows our function for finding out if a number if prime or not:

This Rmarkdown document lists the function and tests to see whether 1,000 is prime or not. After rendering the HTML document, I am relieved to see is.prime does indeed yield FALSE for is.prime(1000). If I wanted to introduce more tests, I could simply add more lines to the test-is-prime chunk and, if they passed, comment that chunk out of the file.

This isn’t ideal for a number of reasons, one reason being that in this case, I’m not using a testing framework which would allow me to automatically check if I had broken my code.

I solved this by moving the is-prime-function chunk into a separate R file called is_prime.R:

Knitr interprets ‘## —-‘ at the start of a line as an indicator for a label for a chunk of code that can be reused later. The chunk of code associated with this label goes until the end of the file or until another label is encountered.

To include this code, as before,I just have to change a few small things:

You will note the extra line to the setup chunk which includes a read_chunk function in knitr which includes the path to the newly created R file. To include the function is.prime in the document again, an empty chunk with the same name as the label in is_prime.R has to be created. When I use knitr to create the document, knitr will inject the is.prime function into the external-code chunk and is.prime(1000) will execute successfully.

Now, testing is.prime with testthat is relatively easy by just creating test/test_is_prime.R and writing a few test cases:

And to run our tests in RStudio, we just have to type this into the console:

It’s fairly simple, clean and common sense. An added bonus is that I can now inject is_prime into any other Rmarkdown document by following the same method.

The accompanying source code for this blog can be found at https://github.com/hiraethus/How-to-unit-test-RMarkdown.

Posted in programming, r, rstudio, testing | Leave a comment

Using Packrat with Bioconductor in RStudio

As an R programmer, you may not be familiar with the development processes involved in programming Java. For those of you who have written some production Java code, you may have found that the barrier to entry can seem quite high. With so many tools you need to grok in order to have a basic level of proficiency, particularly if you are thrown in the deep end on a mature code base. Furthermore, soon even more complexity will be added to the Java developer’s toolbox with the additions presented in Java JDK 9 such as the module system that has burst forth from Project Jigsaw.

With this said, one thing tool that has matured in Java software development is the use of dependency management. While in years past, Java libraries (packages) would be committed to repositories along with a project’s source code, it is now common practice to define a pom.xml file which contains a listing of all libraries and their versions. When a developer clones a copy of the source code, she will run ‘mvn build’ to build the project and simultaneously download all library dependencies to her machine if they are not already present. This means that all developers who build their project using maven will be using the same versions of libraries while testing their code.

Packrat provides this very same level of convenience to R programmers. In particular, packrat works by creating a subfolder in your R project which stores a file that specifies all the packages and their versions you used in your project as well as a repository of packages that is used privately by your project.

As this blog addresses using R with Bioconductor, I will discuss what I do in order to set up a project with packrat using RStudio. You will note that I mix using the Graphical User Interface and a handful of packrat commands in the console which I find to be most useful.

The easiest way to set up RStudio to use Packrat is, when creating a new project, is to ensure you choose the ‘Use packrat with this project’ option.

Ensure you choose 'Use packrat with this project'

Now we have an R project with its own package library inside the packrat subfolder. If using a version control system like Git, it’s tempting to commit the whole packrat directory along with the project to your git repository. The consequence of this is that you are potentially installing binary files to the repository making it a much larger repository for others to download if they want to run your code.

To avoid committing R packages to git, packrat provides a function that modifies your .gitignore file. Run this inside your RStudio console:

You can now commit everything to your git repository as an initial commit.

You should now be able to install all packages using install.packages to retrieve packages from CRAN. After installing a package, it will be saved in your private packrat repository. However, in order to update the packrat list of packages (which is described in the packrat/packrat.lock file), you should perform packrat::snapshot() after each package you install in order to avoid any surprises later on.

Finally, one issue I had with using packrat was how to install packages from Bioconductor. In my experience, the easiest way to do this is through setting the available repositories interactively by typing

into the console. This presents you with a text-based prompt:

Select all BioC repositories and then you can simply install all required Bioconductor packages using install.packages. Packrat will keep track of the version of Bioconductor currently being used.

Having done all this, when someone wants to use your code elsewhere, they need only clone your project and load it in to RStudio. RStudio will automatically restore all the packages that are missing into the project by downloading them from the relevant repositories.

Packrat is by no means perfect, for instance, packrat will endeavour to download binary packages on Windows as it lacks a toolchain for compiling any C/C++ code. Some packages in Bioconductor are only available as source and, as such, packrat is unable to find these packages.

I really appreciate the work done to make packrat work with R and it will, I’m sure become increasingly important in the future to make sure that R code that is written is more stable and predictable by keeping R packages consistent across all computers using a particular R project.

Posted in packrat, r, rstudio, statistics | Leave a comment

Bioconductor Tip: Use affycoretools to get Gene Symbols for your ExpressionSet

For whatever reason, following on from my despair with normalizing gene expression data from earlier in the week, my most recent challenge has been to take a Bioconductor ExpressionSet of gene expression data measured using an Affymetrix GeneChip® Human Transcriptome Array 2.0 but instead of labeling each row with its probe ID having it mapped to its corresponding gene symbol.

I have seen a lot of code samples that suggest using variations on a theme of using the biomaRt package or querying a SQL database of annotation data directly:  in the former I gave up trying; in the latter, I ran away to hide, having only interacted with a SQL database through Java’s JPA abstraction layer recently.

It turns out to be very easy to do this using the affycoretools package by James W MacDonald which contains ‘various wrapper functions that have been written to streamline the more common analyses that a core Biostatistician might see.’

As you can see below, you can very easily extract a vector of gene symbols for each of your probe IDs and assign it as the rownames to your gene expression data.frame.

I hope this will save you the trouble of finding this gem of a package.

Posted in bioconductor, microarray, productivity, programming, r | Leave a comment

Be pragmatic about your choice of laptop in Bioinformatics

Recently I have been familiarising myself with analysing microarray data in R.  Statistics and Analysis for Microarrays Using R and Bioconductor by Sorin Draghici is proving to be indispensible in guiding me through retrieving microarray data from the Gene Expression Omnibus (GEO), performing typical quality control on samples and  normalizing expression data over multiple samples.

As an example, I wanted to examine the gene expression profiles from GSE76250 which is a comprehensive set of 165 Triple-Negative Breast Cancer samples. In order to perform the quality control on this dataset as detailed by the book, I needed to download the Affymetrix .CEL files and then load them into R as an AffyBatch object:

The raw.data AffyBatch object representing these data when loaded into R takes over 4 gigabytes of memory. When you then perform normalization on this data using rma(raw.data) (Robust Multi-Array Average), this creates an ExpressionSet that effectively doubles that.

This is where I come a bit unstuck. My laptop is an Asus Zenbook 13-inch UX303A which comes with (what I thought to be) a whopping 11 gigabytes of RAM. This meant that after loading and transforming all the data onto my laptop, I had effectively maxed out my RAM. The obvious answer would be to upgrade my RAM. Unfortunately, due to the small form factor of my laptop, I only have one accessible RAM slot meaning my options are limited.

So, I have concluded that I have three other options to resolve this issue.

  1. Firstly, I could buy a machine that has significantly more memory capacity at the expense of portability. Ideally, I don’t want to do this because it is the most expensive approach to tackling the problem.
  2. Another option would be to rent a Virtual Private Server (VPS) with lots of RAM and to install RStudio Webserver on it. I’m quite tempted by the idea of this but I don’t like the idea of my programming environment being exposed to the internet. Having said this, the data I am analysing is not sensitive data and, any code that I write could be safely committed to a private Bitbucket or Github repository.
  3. Or, I could invest the time in approaching the above problem in a less naive way! This would mean reading the documentation for the various R and Bioconductor packages to uncover a more memory restricted method or, it could mean scoping my data tactically so that, for instance, the AffyBatch project will be garbage collected, thereby freeing up memory once I no longer need it.

In any case, I have learned to be reluctant to follow the final path unless it is  absolutely necessary. I don’t particularly want to risk obfuscating my code by adding extra layers of indirection while, at the same time, leaving myself open to making more mistakes by making my code more convoluted.

The moral of the story is not to buy a laptop for its form factor if your plan is to do some real work on it. Go for the clunkier option that has more than one RAM slot.

Either that or I could Download More Ram.

Posted in productivity, programming, r, Uncategorized | Leave a comment

Converting nginx access logs to tsv using bash

To my humble satisfaction, Gwasanaethau Cymru (Services
Wales)
was launched a mere week and a half
ago. It is my first genuine effort to write a publically accessible
web application that I intend to actively maintain so that I can grow
my Java development skills.

I have nginx web server sitting in front of my instance of Tomcat and
I’ve noticed myself becoming increasingly fascinated by the
access.log found in /var/logs/nginx.

I find myself daily, looking to see how many people are visiting. What
I am most interested in finding out is whether people are visiting my
page multiple times. As the number of visitors is very modest at the
moment, it would still be realistic for me to load this data into
Microsoft Excel (or LibreOffice Calc in my case) and use a pivot table
for this purpose.

Unfortunately the data isn’t in a format that Excel could read in
nicely:

Of course, I could separate the fields using a whitespace character
however, I am only interested in extracting the IP address and the
full timestamp. After some efforts I found that I could use a
combination of grep and xargs to achieve this:

We use cat to concatenate all of our access.log files together. When
we pipe this into grep, we use -e twice to match both the ip
address field and the timestamp field. Each ip/timestamp matched
combination identified by grep is split over two lines,
unfortunately. We can rectify this by simply piping to xargs which
allows us to split up our input into multiples of n, in this case, 2.

The output looks like this:

There are certainly cleaner ways of achieving this but I think using
grep and xargs in this way gives you the flexibility to match a number
of arbitrary patterns in a file such as access.log. This could also be
used to strip out unwanted data from very large log files so that
their size is more manageable for using in an application like Excel.

Posted in bash, logs, nginx | Leave a comment

Using Vagrant to test Apache Spark applications

Apache Spark is fast becoming the established platform for developing
big data applications both in batch processing and, more recently,
processing real-time data with the use of Spark streaming.

For me, Apache Spark really shines in that it allows you to write
applications to run on a Yarn Hadoop cluster and there is little to no
paradigm shift for developers coming from a functional background.

Conveniently, Spark does have a standalone mode in which it can be run
locally. This can be great for local validation of your code but I
felt that having a YARN cluster running HDFS would enable me to make
my code consistent between development and production environments.

There is a project, Apache Bigtop which
provides a means to deploy Apache Hadoop/Spark to virtual machines or
docker containers. This is definitely the avenue I would like to go
down in the future but, I wanted to get an idea of the components of a
YARN cluster as well as coming up with a lighter-weight solution
myself.

I therefore set about developing a very simple Vagrantfile with a
number of bash scripts to set up two machines:

  1. hadoop

    A ubuntu virtual machine to act as a pseudo-single node YARN
    cluster running HDFS and Hadoop. The scripts and configuration for
    this drew much influence from the official Apache Hadoop
    documentation as well as an indispensible
    tutorial
    posted by Sain Technology Solutions. Thanks for making it easy for
    me!

  2. spark

    Another ubuntu virtual machine that simply has Apache Spark
    installed on it.

To get started, git clone
https://github.com/hiraethus/vagrant-apache-spark
, cd into the
repository and type vagrant up.

If all is well, you should be able to ssh into the spark instance and
run the spark interactive REPL:

You should also be able to upload files to the hadoop machine and
subsequently to HDFS for use in your spark application:

While this is by no means a best practice set of deployment scripts,
it proves to be useful for basic smoke tests before attempting to
interact with a real cluster which may be inaccessible during
development. For me, the real utility will be where I integrate a web
application with a spark application that may need to read data from
HDFS independently of Apache Spark.

I would love to hear of any other solutions people have to testing
Spark applications locally that depend on a YARN cluster.

Posted in apache, hadoop, spark, vagrant, yarn | Leave a comment

R XML Package

I’ve spent a number of years programming in Java so, during my MSc in
Bioinformatics, it took me a while to become acquainted with the nuances and
the idioms of writing code in R. It has been discussed extensively elsewhere,
little better than John Cook’s lecture R: The Good, The Bad and The Ugly.
While at first I was frustrated with the language, I am starting to become fond
of the language, if not only because of the increasingly rich tooling (such as
RStudio) as well as the packaging system. While unrelated to the field of
Bioinformatics, I have started to write some sample R code for pleasure and
because of the brevity of the code that I can write. I have been working
towards creating a Shiny web app that can visualise exercise data that is
stored in an XML format that is validated against an XML schema. You can see
the code at http://github.com/hiraethus/workout.tracker. For this I have been
using the XML package available from CRAN (kindly authored and maintained by
Duncan Temple Lang) which contains a really useful method

which will take an XML document with a fairly flat structure containing and create a data frame from them. As an example, the following:

would be rendered as a data.frame of the form

Foo Bar Baz
12 2.1 First
16 1.1 Not first
20 3.3 Last

Each of these columns will be interpreted as strings of characters. The
colClasses attribute of the xmlToDataFrame function allows the classes to be
specified as a vector, for instance c(“integer”, “numeric”, “character”).

This is great! Unfortunately, each of foo, bar and bar elements must be present
in at least one of the foobar elements. If we were to assume that this XML
document could optionally have a foobaz element of the type Boolean and we
specified our colClasses vector as such c(“integer”, “numeric”, “character”,
“boolean”) then if foobaz were not present in our document then xmlToDataFrame
would fail.

The only solution I have come up with to overcome this is to use xmlToDataFrame
without the colClasses argument and then replace each column with another
column that is of the specified type was read in from the XML document. I
currently do this in the
workout.tracker:

I am more than happy with the time savings the XML package has provided me in
converting my XML document into a Data.Frame in R. My solution to providing
types to the columns of my data frame, while probably very inefficient, is
ample for the few hundred entries I will have (or not depending how well I keep
to my fitness regime).

In the future I will reimplement this application in the Gosu programming
language
to show how we can use its type loader
system to use an xsd to statically generate objects directly from the xml
document.

Posted in programming, r, statistics | Leave a comment