Data science 관련 공부

I just completed the second of two finals to end the first semester of Berkeley's MIDS program--a new data science program created by the School of Information at UC Berkeley. It was disappointingly easy and expensive ($13k per semester for 5 semesters for an online program). The level of comprehension required to do well was about that of a Coursera course. And this is not to say that Coursera is easy; it isn't if you really dig your heels in. There is a higher level of accountability that comes with a structured program, but the incremental learning that came with the structure didn't make the degree worth it. I'm dropping the program today.

A Few (Huge) Caveats

Data Science is a new field, in that it combines old fields in a new way. All the content is out there, figuring out how to put it together is the challenge.
This is a new program for Berkeley, so they're still figuring it out.
I was a part-time student. It's difficult to enjoy what you're doing without fully dedicating yourself to it. I've heard the same to be true of other part-time programs like Executive MBAs.
For the past five years, I have been in academic settings and tech companies where there was plenty of technical wisdom, mentorship, and stretch projects.
Both tech companies I worked for were/are in growth phases at the time and thus didn't sponsor formal programs. The economics woul'dve been different if I were at, say, an Amazon, Boeing, or Microsoft.
I have a great idea of what I need to learn to be where I want to be in 10 years. Doing this was less a matter of exploring than learning what I need to in the fastest way possible. I still thoroughly believe in the value of most undergraduate and graduate programs. This just wasn't the case for me.

Why I Joined the Program
The traditionalist in me thought a degree program was the way to go in developing and signaling a technical skill set to potential investors and employers. Taking an honest look at the people I've worked with in the tech space, it's clear this matters less for high-functioning individuals here as much as in other spaces.

What I'm Doing Instead
When completing a project or pursuit, we're often at a loss for what to do with our newfound free time. I chose to leave the program because I knew my free time could be better used if managed at my own pace. At work, I've taken on two additional projects. At home, I'm taking a refresher of Linear Algebra through Khan Academy while spending two months on Logistic Regression through Coursera. By self-pacing, I'm getting through about 1.5x the academic content I previously was.

Recommendations for Data People
Sign up for some online classes, get a pile of books, schedule two hours into every week night, and sit at an empty desk working through them. Don't leave the desk. Here are sets of resources you can look into, in order of importance, whether you're learning this for the first time or as a refresher:

SQL: If you can't get data, you can't analyze data. Whether you retrieve data from a SQL database or Hadoop cluster with a SQL-language layer on top of it, this is where you start. http://sqlschool.modeanalytics.com/ is a great interactive learning interface. O'Reilley's SQL Cookbook is a masterpiece that traverses all levels of SQL proficiency.
Full-Stack Data Science: Coursera offers a full stack online curriculum on a continuous basis for a reasonable price. This DOES NOT teach you SQL. If you're in SF or NYC, you can attend General Assembly's pricier in-person full stack curriculum. This gives you a cursory introduction to data storage, retrieval, prep, light analysis, and deeper predictive and inferential analysis.
Python: Code Academy or Udemy will teach you the basics. Python can play two functions in the skill stack: 1) to conduct ad-hoc statistical analysis as you would with R, 2) to do everything else. Python is important for the "everything else." You might use it to get data from APIs, scrape, write ETL jobs, refresh data in your warehouse, or retrain models. This is the piece of the skill stack moves you from being a Static Data Scientist (one who works with data in a manual fashion), to a Live Data Scientist (one who has automated many of the processes contributing to data science output, loosely defined).
Basic Statistics: Khan Academy Probability and Statistics.
Linear Algebra and Multivariable Calculus: Go to a local college or Khan Academy to brush up on Multivariable Calculus and Linear Algebra. Their curriculums have largely been the same for the past 5 decades.
Mapreduce/Hadoop: Focus on this last**. There are so many technologies that enable SQL-like interfacing with Hadoop that to know how to write a MapReduce job is, for the most part, not necessary. To build real MapReduce pipelines is a behemoth of a task that might be the work of an early-stage startup Data Scientist, but shouldn't be if you have a solid BI infrastructure team. This is why companies hire the rockstars we know as backend and data engineers. Side note: if you ever meet one and aren't sure what their company does, thank them for their service to our country, regardless.

Cleaning: plan to spend most of your time cleaning and transforming in these languages/technologies. The analysis is the fast and fun part.

Footnotes (5/25/2015)
* This unit in Khan academy, specifically, is negligible. I don't however believe that Probability Theory has no place in Stat or ML. Quite the opposite, you absolutely have to understand probability theory. I don't think that combinatorics (guessing the probability of red balls in an urn, cards in a deck, or outcome of a dice roll) is an essential step to understanding general probability theory.
** Unless you plan to be the sole Data Scientist at Pre-Series B company or are running your own team, I would not make this a point of focus. I say this because if you need to get data, every major tech company I've talked to implements Hive or Pig to abstract away MapReduce. In the earlier days at Jawbone, we did have to write our own MapReduce ETL, but this was only before we hired a squad of Data Engineers to handle the heavy lifting of storing UP user behavior and weblogs. The same has been largely true at Optimizely--if I wanted to get my hands dirty with a project writing MapReduce jobs, I could, but my time would be more efficiently spent doing my job versus attempting to do someone else's.

기술경영(MOT) 해외대학원 정리 (0)	2014.03.02
최소 자승법 (0)	2012.05.29
수요예측 bass vs logistic 모형 (2)	2012.05.29
SCM 으로 미래를 경영하라 (0)	2012.05.21
조직 몰입에 대한 개념정의(선행연구) (0)	2012.03.22

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Data science 관련 공부

https://www.linkedin.com/pulse/why-i-left-my-masters-program-charles-pensig-1

Why I Left My Data Science Master's Program

'기술경영 > 참고자료' 카테고리의 다른 글

ARTICLE CATEGORY

ARCHIVE & SEARCH

CALENDAR

RECENT ARTICLE

TAG CLOUD

RECENT COMMENT

RECENT TRACKBACK

MY LINK

COUNTER

티스토리툴바