What They Don't Teach You About Data Science in School: Nick Benavides
Main content start
Nick Benavides, BS MS&E ’19, MS CS ’20, hosted an insightful data science session, covering the differences between different job roles, and the skills you need that aren’t taught in school.
- Nick has an impressive pedigree, having interned at Thumbtack as a product analyst, and in a data science/machine learning role at Vectra. Since graduation, he’s been in the fraud detection field at NS8, Sift, and is currently at SentiLink, a company working on identity verification for the financial services industry.
- Data roles discussed: product analyst, data scientist, ML engineer (see details below).
- What they don’t teach you in school: root cause analysis, 80/20 stats, data quality (see details below).
- The Interview process for the industry can be all over the map. Sometimes it’ll look like consulting interviews, other times like a software engineering interview. There might be a case study component, or SQL tests, or a take-home assignment to analyze a data set.
- The best way to grow in this career is to have business impact – build something that will drive revenue or cost savings.
Details from the Discussion:
- Differences between the three data roles:
- Product analyst – responsibilities are more retroactive: for example: analyze behavior patterns to improve core metrics for product, interpret results of a/b testing to make decisions, report on metrics, create dashboards, product planning eg sizing estimates, don’t write much production code.
- Data scientist – masters in relevant degree. Improve performance of production models – error analysis; prototyping; validating; retraining models; monitoring performance; evaluating new datasets; new data product exploration. Writing code for insights/prototyping – notebooks then in production. Most time on analysis, less on coding.
- ML engineer – bachelors in related degree, or masters in related subject if bachelors is unrelated. Focused on engineering side – building automated pipelines for model training; evaluation; feature extraction; prototyping for new features or modeling. Bridge the gap between data science and software engineering: mostly writing production code. More time on coding, less on analysis. Will see these people in deep learning companies.
- Data scientists and MLE’s have a lot of the same skills, it’s the focus of their time that is different.
- What they don’t teach you in school:
- Root cause analysis – knowing how to do this in a systematic way has got Nick promoted! Most people can’t get to clean answers with this. RCA is answering the question “why did metric x go up/down by y% over z time period?”. Break down the metrics, how is it trending over time? Eg, sales have declined by 20% in last month – need to explain what has changed. Point estimates are easy to report, but distributions give deeper insights. Look at %iles, as well as raw numbers. Mix shift often drive metric changes.
- 80/20 stats – no clear approach. Think statistically, but likely not with same rigor as in academic situation, validate results empirically. Good enough answer when its fast moving. Techniques - Queue blending with isotonic regression, like a 1 variable regression. Feature distribution monitoring.
- Data quality – Nick didn’t fully appreciate this until he was working! In school, data sets are pretty manicured. In industry, esp. w. data from customers, there is lots that can go wrong here – missing values; test/nonsense values; duplicate ID’s; date issues (mixed formats; local time v’s UTC); gaps; varying taxonomies, etc. Main takeaway – assume nothing, and always do a quality check before jumping into analysis. External API’s – shifting distributions, new values you’ve not been trained on, changes in data coverage. A lot of companies underestimate the importance of this area, so monitoring can be a quick and easy win for someone new to the company to make an impact and show initiative and value.