Category Archives: Big Data

Why data-driven science is more than just a buzzword

Tara Murphy, University of Sydney

Forget looking through a telescope at the stars. An astronomer today is more likely to be online: digitally scheduling observations, running them remotely on a telescope in the desert, and downloading the results for analysis.

For many astronomers the first step in doing science is exploring this data computationally. It may sound like a buzzword, but data-driven science is part of a profound shift in fields like astronomy.

A 2015 report by the Australian Academy of Science found that among more than 500 professional astronomers in Australia, around one quarter of their research effort was now computational in nature. Yet many high school and university science, technology and engineering subjects still treat the necessary skills as second-class citizens.

Referring both to the modelling of the world through simulations and the exploration of observational data, computation is central not only to astronomy but a range of sciences, including bioinformatics, computational linguistics and particle physics.

To prepare the next generation, we must develop new teaching methods that recognise data-driven and computational approaches as some of the primary tools of contemporary research.

The era of big data in science

The great empiricists of the 17th century believed that if we used our senses to collect as much data as possible, we would ultimately understand our world.

Although empirical science has a long history, there are some key differences between a traditional approach and the data-driven science we do today.

The change that has perhaps had the most impact is the sheer amount of data that computers can now collect. This has enabled a change in philososphy: data can be gathered to serve many projects rather than just one, and the way we explore and mine data allows us to “plan for serendipity”.

Cleo Loi describes her discovery of plasma tubes in the Earth’s ionosphere.

Take the search for new types of astronomical phenomena. Large data sets can yield unexpected results: some modern examples are the discovery of fast radio bursts by astronomer Duncan Lorimer and the discovery of plasma tubes in the Earth’s ionosphere by a former undergraduate student of mine, Cleo Loi. Both of these depended on mining of archival data sets that had been designed for a different purpose.

Many scientists now work collaboratively to design experiments that can serve many projects at once and test different hypotheses. For example, the book outlining the science case for the future Square Kilometre Array Telescope, to be built in South Africa and Australia, has 135 chapters contributed by 1,200 authors.

Our education system needs to change, too

Classic images of science include Albert Einstein writing down the equations of relativity, or Marie Curie discovering radium in her laboratory.

A page from Albert Einstein’s Zurich Notebook.

Our understanding of how science works is often formed in high school, where we learn about theory and experiment. We picture these twin pillars working together, with experimental scientists testing theories, and theorists developing new ways to explain empirical results.

Computation, however, is rarely mentioned, and so many key skills are left undeveloped.

To design unbiased experiments and select robust samples, for example, scientists need excellent statistical skills. But often this part of maths takes a back seat in university degrees. To ensure our data-driven experiments and explorations are rigorous, scientists need to know more than just high school statistics.

Marie Curie in her chemistry laboratory at the Radium Institute in France, April 1921.

In fact, to solve problems in this era, scientists also need to develop computational thinking. It’s not just about coding, although that’s a good start. They need to think creatively about algorithms, and how to manage and mine data using sophisticated techniques such as machine learning.

Applying simple algorithms to massive data sets simply doesn’t work, even when you have the power of 10,000-core supercomputers. Switching to more sophisticated techniques from computer science, such as the kd-tree algorithm for matching astronomical objects, can speed up software by orders of magnitude.

Some steps are being taken in the right direction. Many universities are introducing courses and degrees in data science, incorporating statistics and computer science combined with science or business. For example, I recently launched an online course on data-driven astronomy, which aims to teach skills like data management and machine learning in the context of astronomy.

In schools the new Australian Curriculum in Digital Technologies makes coding and computational thinking part of the syllabus from Year 2. This will develop vital skills, but the next step is to integrate modern approaches directly into science classrooms.

Computation has been an important part of science for more than half a century, and the data explosion is making it even more central. By teaching computational thinking as part of science, we can ensure our students are prepared to make the next round of great discoveries.

Tara Murphy, Associate Professor and ARC Future Fellow, University of Sydney

This article was originally published on The Conversation. Read the original article.

Bartering for science: using mobile apps to get research data

Olivia Walch, University of Michigan

There’s a transaction that happens every time you load a website, send an email, or click “like” on a friend’s post: You get something you want in exchange for some data about your actions and interests. Entire business models depend on the premise that the data we generate in this way have value, and massive databases have been assembled with this in mind.

Can we harness data collection of this kind for research? So far, companies have been in the vanguard of this type of work, with academics lagging behind. We know analysis of large datasets will transform social sciences, but wearables and other sensors could expand biological understanding too. Some firms, such as Twitter, have released data to academics, and many cool projects have emerged as result, from predicting flu outbreaks to training computer models of language. But so far, researchers haven’t had much control over what data are available for analysis.

Even when the collected data do align well with a researcher’s interests, most companies aren’t open enough to be truly useful. Jawbone, for instance, recently released a survey of sleep habits from college students around the United States on its blog, but didn’t disclose the algorithm it used to measure sleep. It’s understandable: There’s not much business upside to opening their methods to potential competitors. But it does mean that the data do not join the scholarly sleep literature, and don’t help direct the future of research in the field.

What if researchers got directly involved, providing users with something they want and getting data targeted for their exact research questions in return? In 2014, my Ph.D. advisor Daniel Forger and I tried exactly this. What we learned could be used by researchers in many areas, benefiting the public and scholarly study alike.

Offering incentives

I wrote an app that provides travelers with schedules of when to seek and avoid light to help them get over jet lag as quickly as possible. The schedules were computed using a mathematical model of the circadian clock and a kind of mathematics called “optimal control theory.” To return the favor for the free app, users could opt in to anonymously submit their sleep history and light exposure during their trip back to us, delivering analyzable data.

A screen from the Entrain app.
Entrain app, Author provided

About 155,000 users have downloaded our app Entrain to date. Of those, more than 11,000 – seven percent – have sent us data in return. That level of return is a testament to the appeal of what we offered, despite having almost no budget.

Walking around campus here at the University of Michigan, I often see flyers offering to pay me US$10 for taking a survey. Our app is a high-tech version of the same idea, turbo-charged for efficiency: We get lots of data for free, the app itself advertises the schedules our paper describes and we survey a broader audience than just college undergrads. Our potential research pool was limited to smartphone users, but smartphones penetrate into more income brackets and demographics than you might initially expect.

How can other researchers get at mobile data like this? Finding something to exchange for the data is a great first step. This can include educational materials, or information about how a user compares to other survey respondents (for example, the pioneering Munich Chronotype Questionnaire), or individualized theoretical predictions built from mathematical models, like what we did with Entrain.

As a mathematician, I’m particularly partial to the last one: The optimal schedules for reducing jet lag are a neat result, but the techniques used in computing them aren’t specific to any one application. There’s a whole corpus of mathematical models of biology that could be translated to mobile forms to provide compelling reasons for people to give up their data, like modeling how sleep debt builds up over weeks or how your metabolism adjusts to diet.

The future of data collection

Building the app to collect the data is a major hurdle. Making the app myself was a fun exercise, but a graduate student’s home brew can’t keep up with professional app designers. With funding, researchers can hire companies to develop an app for them.

A plan from the Entrain app to help a user minimize jet lag.
Entrain app, Author provided

That said, it was incredibly freeing to be able to release the app without needing a grant to back it, and new tools are making it increasingly easier to make an app on your own. Since our app came out, for instance, Apple has released ResearchKit, which makes it easier for researchers to get signed waivers from app users and to collect data from participants.

Having help with informed consent solves a problem researchers have that for-profit companies don’t: ensuring the people who are the data sources know what information we’re using and for what purposes. We solved that problem in Entrain by requiring people to opt in to sending us their information, and anonymizing the information the app sent. As tools like ResearchKit continue to develop, it will get easier and easier for researchers to steer their own data collection.

Mobile is the future of this kind of data collection. Apps are personal in ways websites aren’t: They’re more closely tied to our identities and can access more private data. With wearables and other new forms of technology connecting to them, our phones are becoming increasingly accurate proxies for ourselves. If researchers can find the right ways to tap into this information and encourage users to share data, they can collect exactly the data their research requires – and lots of it, to boot.

Olivia Walch, Ph.D. Candidate in Applied and Interdisciplinary Mathematics, University of Michigan

This article was originally published on The Conversation. Read the original article.