Yazılım Çorbası: Machine Learning

Giriş
Machine Learning gelişimine tarihsel olarak şöyle bakılabilir.

1980'ler Business Rules
2000'ler Finite State Machines
2010'lar Behavior Trees
2020'ler Machine Learning

Machine Learning
Açıklaması şöyle

A subset of AI, ML, can automatically learn from past experiences and improve itself without any manual intervention. ML categorizes new pieces of data by analyzing how the old ones were processed.

Probabilistic Machine Learning

Algoritmalarda Probabilistic yaklaşım kullanmak bir şekilde daha iyi sonuç veriyor. Özellikle yetersiz veri varsa. Açıklaması şöyle. ML benim konum olmadığı için yorum yapamıyor, ama ileride kullanmak üzere not almak istedim

Ghahramani elaborates on these points in many great tutorials and in this non-specialist overview article from Nature (2015) on Probabilistic Machine Learning and Artificial Intelligence.

Ghahramani's article emphasizes that probabilistic methods are crucial whenever you don't have enough data. He explains (section 7) that nonparametric Bayesian models can expand to match datasets of any size with a potentially infinite number of parameters. And he notes that many datasets that may seem enormous (millions of training examples) are in fact large collections of small datasets, where probabilistic methods remain crucial to handle the uncertainties stemming from insufficient data. A similar thesis grounds Part III of the renowned book Deep Learning, where Ian Goodfellow, Yoshua Bengio, and Aaron Courville argue that "Deep Learning Research" must become probabilistic in order to become more data efficient.

Because probabilistic models effectively "know what they don't know", they can help prevent terrible decisions based on unfounded extrapolations from insufficient data. As the questions we ask and the models we build become increasingly complex, the risks of insufficient data rise. And as the decisions we base upon our ML models become increasingly high-stake, the dangers associated with models that are confidently wrong (unable to pull back and say "hey, wait, I've never really seen inputs like this before") increase as well. Since both of those trends seem irreversible--ML growing in both popularity and importance--I expect probabilistic methods to become more and more widespread over time. As long as our datasets remain small relative to the complexity of our questions and to the risks of giving bad answers, we should use probabilistic models that know their own limitations. The best probabilistic models have something analogous to our human capacity to recognize feelings of confusion and disorientation (registering huge or compounding uncertainties). They can effectively warn us when they are entering uncharted territory and thereby prevent us from making potentially catastrophic decisions when they are nearing or exceeding their limits.

Deep Learning

Deep Learning yazısına taşıdım

Machine Learning Algoritmaları

Machine Learning algoritmaları kabaca şu kategorilere ayrılabilir.

                                  Machine Learning Algorithms
                                              |
                                              |
---------------------------------------------------------------------------------
|                                    |                                          |
supervised learning         unsupervised learning           reinforcement learning
|                                            |
|--->Naive Bayes Classifier                  |--->Clustering
|--->Support Vector Machine                  |--->Neural Networks
|--->Decision Tree                           |--->Anomaly Detection
|--->Random Forest
|--->Regression
|--->Classification

Machine Learning Kullanacağımıza Nasıl Karar Veririz?

Bazı kriterler şöyle. Eğer kurallar yani cevaplar bilinmiyorsa, kurallar çok hızlı değişiyorsa, çok fazla veri varsa kullanılabilir.

Here are a few rules that you can use to classify a problem as a machine learning problem or otherwise:

- It is not easy to identify a finite set of rules based on which one can determine output related to numerical problems or classification problems.
- Although the finite set of rules can be identified, however, the fact that rules change very fast makes it difficult to deploy the solution changes in the production
- Whether the solution requires a large volume of data for testing/quality assurance (QA)
- Whether the solution improves with the improvement in a variety of data

Why Data Matters to Machine Learning

Açıklaması şöyle. Verinin kalitesi çok önemli

All machine learning relies on data. Generally speaking, the more data that you can provide your model, the better the model. Your ML model needs to have high-quality data, which must be related to the problem you aim to solve. So in addition to volume, data quality matters as well. Finding relationships within your data and exposing them in your model’s training data can greatly improve its predictability.

Put candidly, high-quality data creates high-quality training features, producing a high-quality model that can more accurately generalize unseen data. As a result, understanding and explaining what your ML model means, and its behavior, is much easier.

Overfitting Problemi

Overfitting Problemi yazısına taşıdım

1. Supervised Learning

Supervised Learning yazısına taşıdım

2. Unsupervised Learning
Unsupervised Learning yazısına taşıdım.

3. Reinforcement Learning

Reinforcement Learning yazısına taşıdım.

Cross Validation Nedir

Açıklaması şöyle

In simplified terms you can think about it like this: suppose you are preparing pupils for an exam. You have three sets exercises from previous exams: A, B, and C. But the exercises in the upcoming exam will be different. Nonetheless you want to test how well students will do on the unseen exam tests, when trained on similar tests from the past.

Here is how you can do it: you give exercises A and B to student one, and after he learns to solve them you test his ability on C. For another student you give exercises A and C and test how well she does on the remaining set B. And for the third student you give B and C and test on A.

This way the scores obtained on the unseen tests, by all students, will be the average score you can reasonably assume those students will get in the upcoming exam. However if instead you show all your exercise sets: A, B, and C, to a student - then how will you able to test how well is he or she prepared? If you give the student the exercise he or she saw in training then the student might answer it perfectly from memory alone.

Same with classification methods. If you show them all the data - what data will you use to check how well "trained" they are? A simple silly method that memorises everything would score 100% on such a testing strategy. But the same method might be completely lost when presented with an unseen data point.

Bir başka açıklama şöyle

Cross-validation is generally used to find parameters or model structure to ensure it works well on new, unseen data. If you train your model using all the data A, B and C without doing the cross-validation, you risk overfitting and ending up with a model that performs well during training, but doesn't generalize to new data.

By performing the training on two of the folds and testing its performance on the third, you can optimize for performance on the new data

Yazılım Çorbası

21 Temmuz 2020 Salı

Machine Learning

Hiç yorum yok:

Yorum Gönder

Blog Arşivi