NeuroOn analysis - results and discussion
Posted on Mon 19 December 2016 in misc • 18 min read
Updated twice, last on 21 December 2016. Notes below.
tl;dr - So, does it work?!
NeuroOn, the self-proclaimed "world's first smart sleep mask" isn't a medical grade device, but it's much better than a coin toss.
Its total accuracy in detecting sleep stages is 65%.
One of the biggest problems with NeuroOn is that when used as an alarm clock almost every third time (31.6%) it will choose the worst possible moment for waking, assuring lack of energy and grogginess1 after awakening.
Comparing NeuroOn's sleep stage results to a professional polysomnography2 scored by a human expert:
When polysomnography detects a sleep phase suitable for waking up, NeuroOn agrees in 73.8% of the cases. In the rest 26.2% it isn't a big deal, since it will just wait until the next good opportunity to wake you up.
When polysomnography says not to wake the user up, NeuroOn agrees in 68.4%. In the rest 31.6% - nearly a third of all cases - it could try to wake the user, resulting in grogginess and complete lack of energy till the end of the day1. This is a big deal and may defeat the purpose of NeuroOn's alarm clock altogether.
And for the people who would like to understand what actually happened here...
Long awaited results
This post is part of a series3 dedicated to NeuroOn sleep mask and its scientific viability. The full code and all signal samples are available on Github4.
Months after the first estimates I would like to present you the results of the NeuroOn experiment in a form accessible to laymen. The analysis took nearly four months due to my stubbornness in assuring that it's scientifically sound, completely open and reproducible at different machines. After roughly two months of dedicating every weekend to it, I realized my own understanding of mathematic, statistical analysis and SciPy / NumPy stack falls short for this challenge and asked Ryszard Cetnarski5 for help. Together we have been able to create a coherent Jupyter Notebook and open it for peer review on the Internet4. In November 2016 we presented a scientific poster6at the Aspects of Neurosciences7 conference in Warsaw. Until the publication of this blogpost the only feedback we gathered regarded only small sample size and graph descriptions. If you have any additional suggestions, please send them to me8.
I'd like to thank Intelclinic9 for providing us a test unit of NeuroOn, Ryszard Cetnarski5 for his tremendous help, as well as everybody who contributed to this analysis (in random order): Michal Kawalec, Adam Golinski, Bartosz Krol, Jaroslaw Hirniak, Karolina Stosio, Karol Benq Siek, Dawid Laszuk, Lorenzo Braschi and Piotr Migdal.
What did the experiment measure?
Our initial experimental hypothesis - "Signal gathered by the NeuroOn mask is of a good enough quality to detect a sleep stage in real time, given processing power of an average smartphone" proved to be too complex for a simple analysis, and while we could infer a lot from the signal's quality (discussed later), we were forced to change it.
Our final hypothesis was "NeuroOn achieves medical-grade results in sleep staging as compared to a human expert working on a PSG signal", as claimed by IntelClinic10 (even though they backed out of that at some point11).
EEG-based polysomnographs are the best medical- and scientific- grade devices for analyzing sleep and sleep stages, used widely in hospitals and research. We used Aura PSG 12 used in clinical trials in Poland - and we will be considering it a single source of truth, being as close to the original brain signals as possible.
We analyzed signal from A1-F1 electrodes of the EEG (detailed description and electrode placement is available in the analysis blogpost3), pre-cleaned by the Aura PSG amplifier12 and sleep staging performed by a human technician.
It is worth noting that there are only two nights (roughly 16 hours) of recordings captured on a healthy13 caucasian male14 in his 25s. To achieve any significant results we should conduct experiments on a more varied population
n > 30 for more than 14 nights, including people with known sleep disorders.
For NeuroOn we used the signal gathered by the three electrodes (single differential channel) on the device pressed to patient's head so firmly they left marks the next day15. The sleep staging was performed by offline (not real-time) algorithm executed on an external machine afterwards. The software used to do it was provided to us by Intelclinic on the 08.03.2016 under a condition that we will not try to reverse engineer16 it, to which we obliged.
We do not have any information about algorithm implementations on mobile devices used with end-user NeuroOn masks or their possible limitations.
NeuroOn's time delay
First, we assumed that both NeuroOn's and PSG's signals do correlate and compared them. It turns out that the devices' clocks were desynchronized, with NeuroOn's running roughly
160.5 seconds late and having a slowly growing delay on the course of the 8-hour recording. For the second night the device's clock was
160.7 seconds late. Both of these results were acquired using cross correlation between the signals as discussed in a Jupyter Notebook17.
After finding the delays from both nights we assumed that the hypnograms - sleep staging graphs from both devices do correlate and decided to analyze their time shift. It turns out that in addition to
160 seconds of signal delay, NeuroOn hypnogram had an additional
90 seconds delay in detecting a sleep phase. This hypnogram was acquired by running the Intelclinic's algorithms offline, using developer's scripts - we currently have no data on delays in real-time taking place on mobile devices, as intended for end users.
Total accuracy in detecting sleep phases
With the clock synchronization no longer an issue, we could start comparing sleep staging between the two sources. The Jupyter Notebook18 is a good read for anyone interested in the code itself.
Since usage of EEG-based polysomnography2 and human-conducted sleep staging are at the moment of writing both academical and industrial standard, we assumed that PSG sleep stages are our single source of truth to which we compared NeuroOn's hypnograms.
We used Cohen's kappa coefficient analysis19. Heatmaps represent confusion matrices20 normalized by rows ("Given a sleep stage detected by PSG, what was probability of NeuroOn to detect it as...?") and joint probability matrices21 which can give insight in the frequency of respective sleep phases.
precision recall f1-score support rem 0.70 0.60 0.64 4033 N1 0.00 0.00 0.00 2190 N2 0.57 0.91 0.70 10050 N3 0.62 0.50 0.56 6690 wake 0.28 0.01 0.03 2238 avg / total 0.53 0.60 0.53 25201 accuracy: 0.60
And the second night:
precision recall f1-score support rem 0.80 0.85 0.82 6030 N1 0.00 0.00 0.00 1181 N2 0.70 0.70 0.70 10640 N3 0.64 0.76 0.70 6750 wake 0.31 0.14 0.19 600 avg / total 0.67 0.70 0.68 25201 accuracy: 0.70
What's interesting, NeuroOn's staging algorithm never detected N1 sleep stage, which affected its total score.
Accuracy22 describes all sleep stages detected by NeuroOn compared to those detected by human in PSG signal. Average value of NeuroOn's accuracy from both nights is
0.65, putting it far below any requirements for medical usage. It doesn't have to disqualify NeuroOn from personal use however.
To illustrate how different NeuroOn's results are different from a purely random "coin toss", here's a bootstrapped23 analysis from the second night:
If NeuroOn's sleep staging was truly random, the score would fall much closer to the randomly permuted sleep scores.
Specific tests and current promises
After initial campaign marketing NeuroOn as a medical-grade device10 allowing tracking multiple sleep scores and helping in polyphasic sleep24 the company has backed off from their promises11, replacing them with something much more manageable. Maybe they don't need overall accuracy to deliver them?
Most users may be interested in these two questions:
- Will NeuroOn wake me up when it has an opportunity to?
- Will NeuroOn not wake me up when it shouldn't?
Basing on Tassi, P., & Muzet, A. (2000). Sleep inertia. Sleep Medicine Reviews, 4(4), 341–353.1 we can select WAKE, N1 and N2 sleep stages as those allowing wake up call, and N3 and REM as those during which NeuroOn's user should not be disturbed.
We aggregated the results of these sleep stages, allowing NeuroOn to misidentify stages within families - WAKE/N1/N2 and N3/REM, since the errors shouldn't be detectable by an end-user.
Normalized and aggregated results from the first night:
There are two indicators of specific significance:
- There is
26.2%chance that NeuroOn will not detect a stage which allows an easy wake up. This isn't harmful to an end-user, since only consecutive misidentification of several stages might cause the alarm clock to go off too late.
- There is however
31.6%chance to misidentify a stage which doesn't allow easy wake up in a healthy person. This may be the single disqualifying feature of NeuroOn. If a person is woken up in N3 or REM (which NeuroOn interprets as N1, N2 or WAKE), they will suffer from sleep inertia and grogginess.
This means that using NeuroOn's alarm clock - in perfect conditions, keeping it well pressed against one's forehead and while not having any sleep disorders - may result in extremely bad waking up nearly 1/3 of the times.
Since lucid dream induction25 is quite complex and still discussed by many researchers, we don't feel that discussing its application in NeuroOn's app26 is within the scope of this analysis. What we can assess is NeuroOn's ability to detect REM sleep - roughly
72.3% of PSG-detected REM stages are detected as REM by NeuroOn (mean from both nights).
Beyond sleep staging - NeuroOn's signal quality
Our initial goal was not only to analyze NeuroOn's staging quality, but also its signal gathered by just 3 dry electrodes on the forehead. Is it possible to create a real-time sleep staging algorithm based that signal?
Answering that question fully would require us to build a perfect and much more advanced version of NeuroOn, de-facto taking on IntelClinic's role in developing the device. We could conduct a much simpler analysis instead, looking for well known EEG indicators within the signal.
EEG waves defined as respective frequencies of EEG signal differ between sleep stages and are one of the most important indicators used in polysomnography. Slow waves between
3Hz are called Delta Waves and are used for discriminating deep non-rem sleep phases27. It is reasonable to assume that NeuroOn staging algorithms use these indicators to create its own hypnograms.
With that knowledge we resorted to spectral analysis, studying delta power in NeuroOn's single-channel signal. Full analysis with more details, code and signal samples can be found in our Jupyter Notebook28.
Data from the first night:
We examined how NeuroOn-recorded delta wave amplitude differs between (PSG-defined) N2 and N3 sleep stages. The results indicate that it is possible to differentiate between those two phases with approximately
75% accuracy basing on a box-plot distribution.
The delta band powers are similarly distinct in both NeuroOn and PSG, which may imply that the signal gathered can be used for advanced and precise sleep staging - maybe even more precise than the current NeuroOn's27. This invalidates my initial assumption I approached NeuroOn with - that it's impossible to gather signal of good enough quality to reliably discern sleep stages from just 3 electrodes.
Contrast with current claims
The initial Kickstarter campaign30 was full of unfounded claims, neuro-buzzwords29 and outright misinformation31. The team even promoted NeuroOn with "Wanna sleep 2 hours/day ASK ME :)" t-shirts32. After years in development and Facebook battles with skeptics IntelClinic was forced to back off from many of them.
NeuroOn's Final Press Release reads:
Inteliclinic is a Polish startup whose Neuroon crowdfunding campaign on Kickstarter was a spectacular success. Initially, the project aimed at creating a device that would analyze users’ sleeping patterns and provide tips to people who want to sleep polyphasically (take a few shorter naps instead of a single nightly sleep episode.) After over a year of consultations with leading authorities on sleep medicine, including Christopher Drake, PhD, Director of Sleep Research, Henry Ford Hospital and former Chairman of the Board, National Sleep Foundation, Project Neuroon grew beyond sleep monitoring to include pulse tracking and light therapy. The core functionality of the device is nearly medical-grade sleep measurements and helping people who work shifts, suffer jet lags or have problems falling asleep. The device does not support polyphasic sleep.
I wouldn't say it was growing beyond, but rather realizing that the previous promises made by IntelClinic were completely unfounded in contemporary scientific knowledge. Without spending significant amounts of money on research33 the company wasn't able to deliver, so the startup pivoted and changed scope.
After backing off from "medical grade" device, "near-medical grade" might mean virtually anything.
Intelclinic did register two patents: "System for polyphasic sleep management, method of its operation, device for sleep analysis, method of current sleep phase classification and use of the system and the device in polyphasic sleep management" 34 and "System, apparatus and method for treating sleep disorder symptoms" 35. While both of them contain technical overview into the mask's working, I am not aware of any whitepapers showcasing respective functions' feasibility33.
Evaluating its effectiveness of light therapy or jet lag adjustment would require a separate experiment and should be conducted (and released together will all the data) by IntelClinic itself in order to prove its effectiveness.
Where I was wrong, where I was right
Over two years ago36 I wrote:
The NeuroOn sleep mask cannot work exactly as advertised - it cannot utilize a proper EEG signal. While it can detect a REM phase in sleep very roughly, it's very far from reliable sleep analysis. The majority of the population isn't able to achieve polyphasic sleep, since their brains aren't capable of that. A similar thing goes with lucid dreaming. NeuroOn at its best would be not too useful of a gadget to be put away after several uses.
WRONGLooking at the spectral analysis of NeuroOn's signal above it seems that I have been wrong in saying that it's impossible to reliably discern sleep stages from 3 electrodes located on the forehead. It looks like it should be possible, but requires much more research than IntelClinic has put in NeuroOn.
RIGHTIt holds true that NeuroOn can conduct only a rough (not medical-grade) analysis of the sleep phases.
RIGHTPolyphasic Sleep isn't supported by NeuroOn, as confirmed by the IntelClinic.
?We don't know much about lucid dreaming yet, but the research required to change that might be quite costly.
Addressing possible replies
NeuroOn staging software you used was several months old. Here's a new version, and look, now the accuracy is better than 95%!
I agree that the software I was sent by IntelClinic was several months old at the time of analysis, but from my understanding it was the version which eventually landed in the consumer units.
Providing any version issued after we published our initial experiment description and sources will not give us any significant results, since it could have been tweaked to match exactly our signals.
The only way to prove that NeuroOn's algorithms have gotten better is to conduct new experiment at a third party's lab (like Sleep Disorders Center at the Institute of Psychiatry and Neurology in Warsaw37) and release the hypnograms immediately afterwards.
It'd be a good idea to test it on a patient suffering from sleep disorders, or anyone else than a 25-year-old caucasian male. Preferably several people.
Our code is completely open and should work with newly acquired signals.
Alarm clock isn't the main functionality now, we're using only REM anyway!
Addressing lucid dreaming and light therapy is beyond the scope of this experiment.
Startup marketing vs research-based development
(this is my personal opinion)
Winding up several months of research, tweaking the code, trying to make sense of the data, wondering if every method is statistically significant - I can say I'm happy I could have done that. No one paid me - quite the opposite, I rented the hospital lab and PSG with my own money - yet still, it was worth it.
I'd like everyone to make their own opinion on NeuroOn by reading this pretty detailed analysis. If you don't trust it, feel free to re-check all my computations in the Jupyter Notebook4.
Personally I consider NeuroOn to be a failed project, not researched enough from the start, running mostly on daring marketing promising the impossible.
Real innovation requires research. It's tedious, takes much more time than the startup community promises. But it's honest - and it's the only way that yields any results.
I view startups similar to Intelclinic as deeply harmful for everyone - customers don't get what they pay for, investors are being misguided about what they support, researchers see their work being abused for the sake of a marketing campaign, and finally the society is being manipulated to see some kind of progress and hope in all that.
At the same time as NeuroOn, another neuro-device was put on Kickstarter - OpenBCI38. It's a small open hardware EEG amplifier which allows to conduct experiments much cheaper than with university equipment. It didn't promise to make everyone's life better and it wasn't marketed as well as the IntelClinic's product. Despite earning much less money, OpenBCI delivered a device fulfilling all their promises.
When it comes to real progress and innovation, I'm much more inclined to believe researchers, hackers and makers showing open whitepapers and working prototypes first.
In my original blog post I wrongly assumed that NeuroOn's sleep stage detection accuracy may be compared to actigraphy sleep/non-sleep accuracy, which is a much simpler indicator of sleep state. Its current limits are far below any device using EEG signals39. These paragraphs are now removed.
I also clarified that sleep inertia is perceptible right after awakening and doesn't necessarily last all day (even though it might affect person's mood).
Added IntelClinic's patents to footnotes.
Previous version of this blogpost may be found on my Github40.
Sleep Intertia on Wikipedia or from a publication: Sleep inertia is a transitional state of lowered arousal occurring immediately after awakening from sleep and producing a temporary decrement in subsequent performance. Many factors are involved in the characteristics of sleep inertia. The duration of prior sleep can influence the severity of subsequent sleep inertia. Although most studies have focused on sleep inertia after short naps, its effects can be shown after a normal 8-h sleep period. One of the most critical factors is the sleep stage prior to awakening. Abrupt awakening during a slow wave sleep (SWS) episode produces more sleep inertia than awakening in stage 1 or 2, REM sleep being intermediate. - Tassi, P., & Muzet, A. (2000). Sleep inertia. Sleep Medicine Reviews, 4(4), 341–353. ↩↩↩
Polysomnography is the most accurate scientific sleep study available without giving a person a brain implant - more on Wikipedia ↩↩
Jupyter Notebook and all the signal files are available on Github ↩↩↩
"Open-science: validation of neuro-startups" scientific poster is available on my blog ↩
alxd (at) alxd (dot) org ↩
"The application will also allow its users to access and setup the many features we have introduced so far, such as advanced sleep analytics, heart monitoring, intelligent alarm clock, jet lag, and alertness management, all with medical-grade accuracy." source and to a lesser extent source, emphasis mine ↩↩
The core functionality of the device is nearly medical-grade sleep measurements and helping people who work shifts, suffer jet lags or have problems falling asleep. The device does not support polyphasic sleep. source ↩↩
Full examination of my sleep patterns by a certified medical expert is available in Polish from the analysis blogpost3. ↩
It's a running joke in neuroscience, since a lot of experimental subjects are just campus students - usually caucasian males, which leads to ignoring biodiversity. Sex Bias in Neuroscience and Biomedical Research, Annaliese K. Beery and Irving Zucker and Androcentrism on Wikipedia ↩
The electrodes were pressed to my head so firmly they left visible marks the next day (photo) ↩
From Wikipedia: Reverse engineering, also called back engineering, is the processes of extracting knowledge or design information from anything man-made and re-producing it or re-producing anything based on the extracted information. The process often involves disassembling something (a mechanical device, electronic component, computer program, or biological, chemical, or organic matter) and analyzing its components and workings in detail. ↩
Cross-correlation computed here with various correlation tests available here ↩
Hypnogram comparison description, code and generated graphs are available on Github ↩
Confusion matrix is defined on Wikipedia: "Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabelling one as another)." The sleep stages are normalized in rows, displaying probability of detecting respective sleep phases by NeuroOn (columns) given that the polysomnograph has detected a given (row) one. ↩
Joint probability distribution on Wikipedia: given at least two random variables X, Y, ..., that are defined on a probability space, the joint probability distribution for X, Y, ... is a probability distribution that gives the probability that each of X, Y, ... falls in any particular range or discrete set of values specified for that variable. ↩
Accuracy is a measure of
(True Positives + True Negatives) / (Positives + Negatives), as defined in Evaluation of binary classifiers ↩
As Wikipedia defines it, bootstrapping can refer to any test or metric that relies on random sampling with replacement. Bootstrapping allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates. This technique allows estimation of the sampling distribution of almost any statistic using random sampling methods. Generally, it falls in the broader class of resampling methods. ↩
Initial NeuroOn campaign promised customers polyphasic sleep functionalities on Kickstarter ↩
In lucid dreams the dreamer is aware of dreaming and often able to influence the ongoing dream content. Lucid dreaming is a learnable skill and a variety of techniques is suggested for lucid dreaming induction. This systematic review evaluated the evidence for the effectiveness of induction techniques. A comprehensive literature search was carried out in biomedical databases and specific resources. Thirty-five studies were included in the analysis (11 sleep laboratory and 24 field studies), of which 26 employed cognitive techniques, 11 external stimulation and one drug application. The methodological quality of the included studies was relatively low. None of the induction techniques were verified to induce lucid dreams reliably and consistently, although some of them look promising. On the basis of the reviewed studies, a taxonomy of lucid dream induction methods is presented. Several methodological issues are discussed and further directions for future studies are proposed. - Induction of lucid dreams: A systematic review of evidence - Tadas Stumbrys, Daniel Erlacher, Melanie Schädlich, Michael Schredl ↩
Delta Power in EEG - Aeschbach, D., & Borbely, A. a. (1993). All-night dynamics of the human sleep EEG. Journal of Sleep Research., Mukai, J., Uchida, S., Miyazaki, S., Nishihara, K., & Honda, Y. (2003). Spectral analysis of all-night human sleep EEG in narcoleptic patients and normal subjects. J Sleep Res, 12(1), 63–71. ↩↩
Spectral Analysis Jupyter Notebook can be found on our Github ↩
Widespread usage of non-scientific neurological claims by many startups is a well known problem, well described by NeuroCritic and other science journalists. ↩
NeuroOn Kickstarter campaign made many unfounded claims (link) ↩
Apparently Da Vinci, Tesla, Churchill and even Napoleon used polyphasic sleep to rest. It allowed them to fully regenerate, reducing sleep time to 6.5 hours or sometimes just 2 hours. And those guys got things done! NeuroOn's Kickstarter vs Polyphasic Sleep: Facts and Myths by dr Piotr Wozniak ↩
T-Shirts worn by IntelClinic team at WebSummit Dublin 2013 (photo) ↩
If you are aware of any whitepapers or peer-reviewed scientific papers by IntelClinic regarding NeuroOn project, please contact me, and I will officially apologize for my previous criticism. IntelClinic mentions their patents on their homepage and in multiple interviews with startup portals. ↩↩
System for polyphasic sleep management, method of its operation, device for sleep analysis, method of current sleep phase classification and use of the system and the device in polyphasic sleep management on Google Patents, filed 2015-01-26 ↩
System, apparatus and method for treating sleep disorder symptoms on Google Patents, filed 2015-01-05 ↩
Original blogpost: NeuroOn: The Emperor is Naked! (only in Polish) on my blog ↩
Sleep Disorders Center at the Institute of Psychiatry and Neurology in Warsaw homepage ↩
The original OpenBCI Kickstarter campaign from 2013 - link and their homepage ↩
Purpose of the investigation was to evaluate the differences of movement density during the sleep stages and waking. 22 diurnally active, healthy, male volunteers of mean age 30.7 (+/-Standard deviation +/- 3.3) years and a Body-Mass-Index 23.6 +/- 3.3 kg/m2 participated in the study. All subjects were recorded in the sleep lab via cardiorespiratory polysomnography and wrist actigraphy (Ambulatory Monitoring, Ardsley, USA) worn on the non-dominant hand, for two consecutive nights. The activity data, consisting of the number of zero crossings (NZC) were recorded in 1-minute periods. Sleep stages were scored visually according to standard criteria. EEG- and actigraphy data were converted to the same data format (European Feature Files). Attaching the actimetry data to the sleep stages was calculated mean NZC for every sleep stage and Wake. In spite of high differences in total individual NZC we observed that most NZC occurred during Wake. NREM 1 movement density was significantly higher in 19 recordings (86%) than in any other sleep stage. In 18 cases (82%) lowest movement density was found in NREM 3/4 with significant difference to all other sleep stages. Within 50% of the recordings were found decreasing activity in the following sequence of stages: Wake > NREM 1 > REM > NREM 2 > NREM 3/4 However, in all other cases there was a varying pattern of activity. Conclusion: Although there is some correlation between motor activity and sleep stages, the predictive value of actimetry data analysis in the assessment of sleep structure appeared to be limited mainly by individual movement density, especially during REM and NREM 2. Actigraphy: methodological limits for evaluation of sleep stages and sleep structure of healthy probands - Conradt R, Brandenburg U, Ploch T, Peter JH. ↩
All files used to create this webpage - including raw blogpost texts in Markdown - are stored on my Github ↩