The course duration is from Sept 11th - Dec 8th, 2017.
This 12-week course is a right blend of Business Analysis, Machine Learning, Data Engineering, and Software Development to build a Data Product. This course is well suited for:
- Developers who want to transition to a new role of a Data Scientist
- Entrepreneurs who want to launch new products covering IoT and analytics
- PhD students who want to transition to the business world
If it’s been a long time since you used any linear algebra, this is a good time for a refresher. Here are the relevant concepts on Metacademy. Each one gives a number of pointers, but the Khan Academy links are especially useful since they have auto-graded exercises you can use to check your understanding.
- dot product
- linear systems as matrices
- matrix multiplication
- matrix inverse matrix transpose
If you haven’t taken a programming class, you may need to spend some time learning the basics. You should watch Lectures 2, 3, 4, and 6 of MIT 6.001 on EdX. (Lectures 7 and 11 are also helpful.)
If you have programming experience but not in Python, read Learn X in Y Minutes for a concise summary of the language. You can probably pick up Python quickly if you are familiar with another general-purpose language (C, Java, Matlab, etc.). Read this tutorial on NumPy, the library we’ll use for array manipulation and linear algebra.
12-WEEK COURSE OUTLINE
- WEEK #1 - EXPLORATORY DATA ANALYSIS
- GIT/GITHUB, Database Programming
- Python Language Review, Regular Expressions
- Pandas - Descriptive Statistics, Split-Apply-Combine Techniques, Slicing and Dicing
- Plotting with Seaborn
- Web scraping
- Pipelines - transfer notebook code to small scripts
- WEEK #2 - MODELING FOR INFERENCE
- Frequentist Statistics: Hypothesis Testing, and t-tests
- Bayesian Modeling: Probability Distributions, Conditional Probability Tables,
- Online Experiments - A/B Testing & Bayesian A/B Testing, Multi-armed Bandits
- WEEK #3 - REGRESSION AND REGULARIZATION
- K-Nearest Neighbors, Generalized Linear Model techniques
- Regularization - Lasso & Ridge, Stochastic Gradient Descent
- Model Evaluation
- Bias-Variance Tradeoff, ROC/AUC Curves, Confusion Matrix
- Pre-processing: Standarization and Normalization, Data Imputation
- WEEK #4 - NATURAL LANGUAGE PROCESSING
- Decision Trees and Ensembling with Boosting & Bagging methods
- Cross-validation, and Hyper-parameter Tuning
- Bayes Theorem, Part-of-Speech Tagging
- Sentiment Analysis
- Topical Modeling, Word2Vec
- WEEK #5 - UNSUPERVISED LEARNING
- k-means, DBScan, Hierarchical Clustering,
- Dimensionality Reduction using Principal Component Analysis (PCA)
- Feature Selection, Feature Transformation, Feature Engineering
- WEEK #6 - SCALING BIG DATA ANALYSIS
- ETL with Sqoop, Hive and Pig
- Python Functional Programming
- Spark RDD and Spark SQL
- Spark ML, and Spark Graph Frames
- Spark Clusters on AWS
- WEEK #7 - DATA VISUALIZATION AND DEEP LEARNING
- Tableau - Mapping, Calculations, and Dashboards
- d3.js, nvd3.js
- Deep Learning with TensorFlow
- WEEK #8 - ADVANCED TOPICS
- Data Products - Flask REST API, Flask Web Application Development
- Spark Streaming, Timeseries Analysis in the context of Internet of Things
- Collaborative Filtering
- Model Deployment
- Model Operations
- WEEK #9 - PROJECT CAPSTONE
- WEEK #10 - PROJECT CAPSTONE
- WEEK #11 - COURSE REVIEW & INTERVIEW PREP
- WEEK #12 - PROJECT PRESENTATIONS
Beyond the guest lectures, students also get to apply the concepts and the alogirthms using the latest industry tools. Some of the tools are listed below:
|6||Hortonworks HDP, PrestoDB|
|8||Dato, Azure Remote Monitoring|
ONLINE RESOURCES FOR GETTING STARTED WITH DATA SCIENCE AND MACHINE LEARNING
For someone trying to get started with ML, here is a resource where the complexity is just right. It introduces you to a lot of the essential Mathematics, but doesn’t go too deep into it. It is an equivalent of the Applied ML course at Stanford.
Very briefly, here are the ML algorithms which are very useful and basic, and will help you solve a lot of problems.
- Regression- Single, Multiple Variables, Logistic Regression
- Overfitting and Underfitting issues- ‘Bias’ and ‘Variance’
- Simple clustering algorithms- K-Means
- Applying basic linear algebra: Principal Component Analysis
- Recommendation Systems and Large Scale Systems.
- Many people have gone on to become top Kaggle contestants (a popular data science contest portal) after doing this course. These introductory algorithms can be extremely useful.
Apart from this we’d also recommend learning a bit about text processing such as regular expressions, string functions and language models. You might find them in the first few lectures and tutorials on this Natural Language Processing Course.
- We’d like to emphasize that a lot of the Mathematics involved doesn’t require much more than an introductory statistics code.
- For a quick and general introduction to Data Science, the course material from this Coursera course is great, and introduces R, Python, Map-Reduce and Data Visualization techniques.
- At a later stage, and for those looking for more academically challenging and abstract courses, you might be interested in the Learning from Data course from Caltech and the Probabilistic Graphical Models course from Stanford. The first course gives more theoretical insights into the foundations of Machine Learning and Statistical Leaning Theory; the second is about mixing Data Structures with Statistics to evolve Bayesian Networks and Hidden Markov Models: powerful tools, which are used in medical diagnostics, speech recognition engines, Kinect – and have been found to be significant improvements on traditional Machine Learning techniques.
Credit: Our motivation to deliver something in Dallas, a place for many Enterprise Customers, has come from pioneers that have come before us like Zipfian Academy and Datasciencemasters.org. The difference in our course structure is that it has been more blended to suit the needs of customers in Midwest and Heartland states.