Monday, December 5, 2016

Supervised vs Unsupervised Data Science

After I finished my Ph.D, I had the opportunity to work for a startup company located in Reno, NV, called LoadIQ.  The domain subject of which I would be completing my initial postdoctoral work with them was in energy and non-intrusive load monitoring (NILM).  Of course this field to me at the time was a blind one -- I was a software engineer with very little to no exposure to anything electrical.  Luckily, I was able to successfully dive into that field and found out that I was absorbing the material very rapidly.  I thought then what I had always thought -- I was first and foremost an engineer, talented at finding out how things work.  Writing code for the energy domain was no different: I had to discover that necessary understanding and then I was able to interface with the domain through my code.

Granted, there are still many things about energy that I do not know.  My experience with the field is very hands-on, and I often find myself stumbling over new and old buzzwords.  I have an understanding, but it is my own, with my own terminology and syntax that relates to my own coding style and the systems I engineered during my time at LoadIQ.  One way to overcome this is to study as much as I can by reading research papers and looking at other companies in the energy software domain - by seeing how others talk about energy and then consolidate their words with my own.

I then later wrote a paper for the NILM Workshop back in March of 2016.  My initial drafts here were certainly void of what the proper phrasings for many of my concepts were, but together with my supervisor, I was able to polish out a good submission that then later became accepted by the peer-review board for their workshop.  I had written papers for publications in journals and conferences before -- this one was quite a bit less hassle as it was accepted on its first review and this I know to be a proud accomplishment.

One other way to become more engrossed with the energy domain is to work with other companies.  Indeed just this week, I have the opportunity to interview with another company.  From researching their company, I see right away a dozen new buzzwords making up their phraseology that were strange and new to me -- DERs, DROMS, VPP, and more.  And I realized then that working with other companies was the absolute perfect way to expand my knowledge and experience with the energy domain.  I am always and ever will be first a software engineer, prominent at writing code, especially Python.  But new things do excite me, and I take pleasure in being able to work with new things and apply my engineering expertise to them.

One thing had always worried me: I was given the title of Chief Data Scientist -- I was the only data scientist at LoadIQ at the time, but if we were to hire and expand, I would certainly be in leadership and management roles over others.  I viewed my title as one in where I was given control and leadership of how we pursued our everlasting quest for further algorithmic improvements and quality assurance.  I'm proud of that, but the thing that worries me is what others think when they see the phrase "data scientist", because as I hear it, there seems to be many different styles of data science.  Was I actually a data scientist or was I just a software-systems engineer?

However, the more I think of it, the less I feel confused.  I believe I really am a data scientist, just not the kind most people might think of when they hear that job role.  To be specific: I believe there are really only two kinds of data science.  There's the more common "supervised" data science as it is related to studying training data and building models that can successfully predict missing class labels (of data in the future or past).  A clear example of that is to estimate the amount of energy a building might use for a particular hour of a day in a year.  By studying past data from the energy usage of that building, we can build a very solid model that can take advantage of features such as temperature, day of week, hour of the day, etc, and predict to less than 5% error margin, what the energy usage might be.

Back to LoadIQ: as a data scientist there, my primary task was a little different.  Due to the nature of the product we were serving, there was no prior training data to learn from.  Any model built to deliver our predictions were entirely devoid of knowing the "truth" of the class labels, and we were forced to fall back on internal consistency metrics to make sure our results were tangible enough to provide quality assurance.  Moreover, our models being built were not typical of data science -- we didn't use Random Forests or Neural Networks or any other mathematical model such as Linear Regression.  Instead, we had to be much more clever and devise very innovative algorithms which may in time, come to gain names of fame in our field of NILM.

As I explore my career, I remain primarily interested in "data science", whether it be unsupervised or supervised.  This I realize explores a greatly nuanced field that crosses over into Machine Learning.  One very interesting and simple method is called Decision Trees, which can be used in the typical model building nature of supervised data science and training data.  If there's one thing you should know when you study decision trees, its to understand what is meant by "entropy and information gain".  I end this blog post by leaving you a superb answer for this, written in response to a StackOverflow question by username 'Amro', located at http://stackoverflow.com/questions/1859554/what-is-entropy-and-information-gain.