By: Roberto Ibañez
Supervised and unsupervised algorithms are a traditional way of dividing problem types within machine learning. The difference between the two is that, in a supervised network, “labeled data” is given as training input, so the algorithm learns to solve a task with known values iteratively. In contrast, in an unsupervised network, it is the algorithm itself the one who mathematically interprets the data. This approach is of great help when we work with large amounts of data since quality labeled data is often difficult to obtain given the labeling process requires human labor.
An example of an unsupervised network is the GPT-3 network, an artificial intelligence model trained with 295 TiB of uncompressed data content, which allows us to imitate human language. Developed by OpenAI, one of the pioneer open-source companies in artificial intelligence, they used millions of language data available on the internet for training to generate a network that can have the ability to express itself as a human being with excellent results.
Nevertheless, derived from almost a philosophical problem, we can ask ourselves, how do we know if our network is performing the task well or poorly in unsupervised networks?
Let’s imagine that we give our network a lot of data that corresponds to novels; after observing some, the algorithm will begin to group them into children’s novels, romantic novels, police novels, etc., but suddenly in the training process, the network finds a judicial paper and classifies it as a crime novel, someone needs to tell it that this is incorrect, we need metrics to assess whether a model works well or poorly, under this paradigm, the concept of “unsupervised network” loses force, and perhaps the best-used term would be “self-supervised” since the researchers “supervise” that the response of the assigned task carries out correctly, although it is the model itself that adjusts its mathematical parameters to solve specific tasks.
Supervised and self-supervised Machine Learning are tools that respond to various problems that may arise in all kinds of industries. In biology, the most explored models are NLP, allowing researchers to predict biological properties through genomic or amino acid sequences. In the case of proteins, employing combinations of amino acids, we can predict physical-chemical properties that respond to usual problems within molecular biology, such as solubility, expression of proteins in organisms, or toxicity.
The recent revolution of Alpha Fold, an artificial intelligence network that allows protein structures to be predicted based on amino acid sequences, recently solved all the protein structures in the UniProt database. This artificial intelligence opens a new field of exploration. We move from sequence-based problems to complex three-dimensional structures that allow us to obtain more information and, therefore, characteristics that have possibly been unexplored today.
Therefore, Machine Learning can play a leading role in helping us understand hidden molecular properties of the nature of proteins.
Links of information: