If CLC and Galaxy aren’t the answer, what is??

Earlier in the day, I said this:

I don’t think Galaxy/CLC is the solution for biologists with genomic data.. These tools simply allow for sloppy and uninformed analyses.

— Matt MacManes (@PeroMHC) August 30, 2014

Now, this is something that I have struggled with for the last few years.. Increasingly, people have been trying to ride the genomics bandwagon indeed, because there is a lot of power in these techniques – power to uncover interesting biological phenomena at an unprecedented depth. I get that, and of course, this was one of the reasons I myself got into genomics- I could see the promise of these techniques. The timing was just right for me, I was finishing up my PhD and intentionally sought out a postdoc where I could go from naive bioinformatician to expert. This tooling-up phase was difficult. Learning to read and write code, understand sequence analysis, etc is hard and I was lucky to have been able to focus on this for a few years.

I get that other researchers may not have the time to do what I did – the demands of teaching, mentoring, writing grants are large. I understand. What I don\’t believe is the solution, however, is to plug into CLC, or naively into Galaxy. These tools lure people in with their ease – they in fact advertise on this platform. Load your data, click a button and poof – out comes an assembly. Anyone ever having done an assembly – the right way – will know that every single assembly requires some tinkering. No 2 are alike, and blindly doing something will invariably result in an suboptimal assembly (or whatever analysis your doing).

Are there people doing really good analyses on Galaxy – yes, absolutely (note I don\’t think the same can be said for CLC, given the black box nature of the software). Do I think there are people doing sloppy analysis using command line tools – OH YEAH, and obviously I\’ve done some of them myself. It\’s just that the platform, I believe, enables abuse. I see it in student work, and in published work – all over the place.

So, if we believe that not everybody has time to be a bioinformatician, and CLC/Galaxy are not a great options, what is the solution??

I think there are 2 potential good solutions.

Collaborate with a bioinformatician… We are not so rare these days as to make this infeasible. Find somebody, work with them to ensure that a proper analysis be being done.
Train yourself. Honestly, if you\’re an academic, I\’m going to assume you got to that position because of your hard work, intellect, drive, luck etc. In other words, you can train yourself to be a competent sequence analyzer. Do a Software Carpentry class, take an intensive summertime class (see here and here just to name a few). Really, nowadays there are a whole bunch of these types of workshops available, and their availability will only grow over time. If you are a grad student, postdoc, faculty member, and you want to invest a bunch of money in collecting sequence data – why not just go ahead and invest a bunch of time and money in doing appropriate analyses.

More thoughts on this later – but this is all for now..

Share this: