Lecture by Jan Drugowitsch at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
A simple understanding would believe that a brain has a state has is changed via a function and input to produce a behaviour. However, the complexity of the brain makes the function intractable. It is more amenable to make smart hypotheses about how these functions could be structured as a proxy for the real behaviour.
Idea oberserver modelling
The main axiom is that information is uncertain. Typical approaches include boltzmann machines (stochastic Hopfield networks), Bayesian networks, statistical learning (support vector machines) and Variational Bayes and MCMC. Deep learning did not have uncertainty initially but more recent work does include uncertainty.
To understand the environment, the brain needs an understanding of uncertainty. If we can understand how the brain represents and uses uncertainty, we can improve AI algorithms.
Based on Bayesian decision theory, uncertainty is handled by having a prior on the state of the world P(sw) and an observation with sensory evidence P(es | sw ) provides a posterior P(sw | es ). The P are functions over multiple states.
A typical application in the brain is to combine uncertain evidence from multiple sources such as audio and visual information. Each source is providing an estimate of the value and an uncertainty estimate. The linear combination of estimates is usually sharper and is biased towards the more certain estimates.
Priors allow us to explain several optical illusions (Weiss, Simoncelli & Adelson, 2002). We seem to have a preference for slow speed priors meaning that the barber shop illusion is caused by our preference for the slower upward velocity in contrast to the sideway velocity.
Lecture by Tomer Ullman at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
The development of intuitive physics and intuitive psychology
Turing proposed that an AI could be developed very much like a human – from a empty notebook or child to a developed adult. However, early development research in psychology and cognitive science has shown that the note may not be “empty”. There is some core knowledge that seems to be either innate to or very early developing in humans but the notion is generally contested and still under research.
Evolutionarily, it may make sense to kickstart a “new” being with some innate knowledge to give it a head-start instead of having to acquire the knowledge on its own. The core knowledge is very limited to several domains. In core physics knowledge, infants have expectations about objects amongst others permanence, cohesive, solid, smooth paths and contact causality. There is not much more than these principles. At the moment, for these core knowledge expectations, there seems to be more a limitation of how we inquire the knowledge rather than them knowing it. Research is actively conducted to find a lower limit.
Side note: For preverbal infants, surprise is measured by the time they spend looking at something but it can be confused with looking at things or people they are attached to (like parents).
Alternatives to Core Physics?
Physical Reasoning Systems
There could be physical reasoning systems (Luo & Baillargeon, 2005) where visually observable features are evaluated to make a decision what physical result would happen. For infants, it appears that the reasoning system is refined with development. A feedforward deep network by Lerer at al. (2016) trained to evaluate whether a piling of stones is stable but the system did not generalize.
Since 2010 a cognitive revolution of sorts has happened in neural network architecture consisting of decoders, LSTMs, memory, and attention that have become “off the shelves”. Using these systems (Piloto et al., 2018), it is possible to generalize better (51% success classifying surprise).
Mental Game Engine Proposal
Maybe, the human brain works like a game engine that emulates physics to approximate reality (Battaglia et al., 2013, in PNAS). A minimal example is a model, a test stimuli and data. This is an ongoing area of research. A model of physics understanding at 4 months consists of approximate objects, dynamics, priors, re-sampling and memory (Smith et al., 2019) is used to predict the next state which is compared to the real next state. In this context, surprise can be defined as the difference between the prediction and the outcome.
Core Psychology
Mental planning engine proposal
There are also expectation about agents. Infants have ideas about agents goals, actions, and planning.
Takeaway
Models have many possible routes.
Human brains could be the way to model intelligence.
Intelligence can be modelled in another way and need not be human-like.
A general/universal function approximator may eventually converge to human behaviour/ability.
A general/universal function approximator actually represents human behaviour/ability.
The problem is that any input output problem can be represented with a look-up-table and thus have no intelligence. Many models may eventually end up in “look-up-table land” where they don’t learn an actual model but only a simple look-up. These models can be useful to solve some tasks but they do not respond to common sense and fail easily on variation.
At the time, reinforcement learning is only a solution in as much evolution seems to have “used” it to produce human behaviour. But how that worked and what the conditions are to make that work are still unknown and thus reinforcement learning is still not the solution to get human-like-behaviour.
Lecture by Cengiz Pehlevan at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
Inductive bias of neural networks
A brain can be understood as a network with parameters as 10^11 neurons (nodes) and 10^14 synapses (parameters. Geoffrey Hinton cleverly observed that “The brain has about 10^14 synapses and we only live for about 10^9 seconds. So we have a lot more parameters than(supervised C.P.) data.” But biologist Anthony Zador argues that animal behaviour is not the result of algorithms (supervised or unsupervised) but encoded in the genome. When born, an animals structured brain connectivity enables them to learn very rapidly.
Deep learning is modeled on brain functions. While we cannot answer (yet), why brains don’t overfit, we can maybe understand why modern deep learning networks with up to 10^11 parameters don’t overfit even when they have orders of magnitude more parameters than data. Double Deep Descent implies that at the interpolation threshold – that is one parameter per data point – the test error actually starts to fall again.
Using the simplest possible architecture, we analyze how to map x -> y(x) with two hidden layers and 100 units per layer. We obtain bout 10000 parameters which out to be heavily overparameterized. Any line shape between two data points would be possible but only a line estimation is produced. It is as if the neural network applied Occam’s Razor. It seems that neural networks are strongly biased towards simple functions.
What are the inductive biases of overparametrized neural networks?
What makes a function easy/hard to approximate? How many samples do you need?
Can we have a theory that actually applies to real data?
Hwat are the signatures of inductive biases in natural population codes?
Goal of network Training
Cost function for training: min(theta, 1/2 sum(mu=1,P)(f(x^mu;theta)-f_T(x_^mu))^2 )
However, with more unknowns than equations, we end up with a hyperplane of possible solutions. The gradient descent method end us somewhere on the hyperplane and therefore produces a bias to land at a specific point (based on random initialization). To understand the bias, we need to understand the function space (what can the network express?), the loss function (how do we define good match?) and the learning algorithm (how do we update?).
My own thoughts that I need to check whether they are correct: A neural network projects such a hyperplane into the output space. Therefore the simplest/closest projection is probably a line or approximating a line. The answer is that only for linear regression and special setups. The theta space is to complex, only in the weight space we can argue for linearization.
Can we simplify this to solve this hard problem?
Looking at a infinitely wide network, we can have a look at the function space. In the neural tangent kernel, we can see that that most of the time we produce some thing close to a line. The wider the network, the easier it is to fit the points with the random initialization and only requiring minimal change. Looking at a Taylor-Expansion, we see that wide networks linearizes with respect to the loss produced by the gradient flow.
Kernel Regression
The goal is to learn a function f from X -> R from a finite number of observations. The function f is a reproducing Kernel Hilbert Space (RKHS) – essentially a special kind of smooth function space with an inner product. The regression then is a minimizer with a lamba times inner product term to penalize complex functions. Under RKHS there is a unique solution that approximates the quadratic loss on a zero training error for infinite width neural networks. Kernel Regression is easier to study than neural networks and can shed light on how neural networks work. Eigenfunction of Kernel is an orthonormal basis (like an eigenvector) under RKHS.
The functions that a neural network can express in the infinite width scenario are part of RKHS. Taken the space of these functions, we can say that a neural network of infinite width is just the set of eigenfunctions and select a particular weight.
Own thought: The eigenfunctions are fixed and the weight can be learned like in a perceptron. Can we layer these eigenfunctions layers and get something new, or is that just linearizable as well? The answer is it linearizes again!
Application to real datasets
Image data sets can be reduced to kernel PCA. Take the kernel eigenvalues from KPCA and project target values to kernel eigenbasis to get target weights for the one hot encoding.
Applying the generalization error based on the eigenfunctions, we can look at the relative error of the networks weight compared to the eigenfunctions space weights and produce a spectral bias that tells us which eigenfunctions are primarily selected. The larger the spectral bias, the more a network is likely to rely the particular eigenfunction to produce the output.
We can use KPCA to understand how the data is split and how many eigenfunction (KPCA principal components) we need to discriminate between outputs.
Lecture by Richard Born at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
Warren Weaver was the head at the Rockefeller Center in the 1950s and he said the future of engineering is to understand the tricks that nature has come up with over the millennia.
Anatomy of visual pathways
The visual system spawns large areas of the brain and often any damage to the brain causes malfunctions of the visual system.
The world is mirrored on the retina. About one million axons are connected to the retina. Vision is also connected to the brainstem to orient the head in space. This is a semi-automatic system to pay attention. It also connected to the Cycadian rhythm to manage sleep cycle through brightness.
The important point in primates is the V1 striate area 17. A lession here makes humans blind. A brain has no visual understanding. It only produces action potentials that are interpreted to be visual. In monkeys, there are more than 30 visual areas roughly grouped into two. The ventral stream (down) is concerned with the what (object recognition) and the dorsal stream (up) is concerned with the where (spatial perception). Retinotopic representations are aligned with the retina space but object recognition ought to be object-centred. How the brain converts this to world coordinates is still an open question. Mishkin showed in 1983 that monkeys taught to associate food with a specific object or with a specific location solve tasks at random if they had a lesions in the respective brain area.
Receptive Fields
The sensory epithelium can influence a given neuron’s firing rate. Hubel and Wiesel showed that the Lateral geniculate nucleus (LGN) excited by a light signal is triggered. Hartline showed that surround suppression helps to locate points of interest. The brain is interested in points in the visual space where the derivative is not zero. Brains locate contrast (space), color contrast (wavelength), transience (time), motion (space&time), and space&color.
Hierarchical receptive fields
There is a hierarchical elaboration of receptive fields. Hubel & Wiesel also measured the signal in the primary visual cortex and found that the neurons encode orientation of an edge with a stronger off response on one side but no response to diffuse edges. Essentially, we can think of the neurons as a filter or a convolution (simplification). A brain does it in parallel in contrast to a computer. Horace Barlow noted that the brain focuses on suspicious coincidence (e.g. unusual changes).
We go from LGN (center-surround) to simple cells (orientation) to complex (contrast invariant across area or pooling/softmax). In the 1950s, the psychologist Attneave found the 17 points of maximal curvature on an image of a cat and connected the lines and produced an abstract representation that was recognizable as a cat.
Convolutional Neural Networks
An engineered alternation of selectivity (convolution) and generalization (pooling) has led to great success early on in vision research but then came deep networks. However, deep networks actually does apply convolutions, rectification (ReLU), pooling, and lastly normalization. The non-engineered application of these features improved performance.
Yamins et al., 2014 showed that alexnet has some non-trival similarity with monkey brains in the ventral stream visual areas.
What is missing?
Adding noise to images, CNNs failed quickly at 20% noise whereas human performance reduced gracefully at the level of noise. Even worse, the CNNs can learn solutions with specific kind of noise but end up failing if the noise changes.
In Ponce et al., 2019, random codes are fed to a generative neural network to synthesize images. The neuron is used as an objective function to rank the synthesize images. A genetic algorithm is applied to find the codes that maximally trigger the neurons.
PredNet predictes on videostreams with unsupervised learning.
Tootell and Born showed in 1990 that clustering visual cortex the data is still very retrinotopic but in the MT it is organized in hypercolumns to detect motion direction rather than spatial connectivity.
Neurons near each other seem to like to do the same. Brains are not just look-up tables but have a sematic structure the spatial organisation.
Lecture by Gabriel Kreiman at Harvard University. My personal takeaway on auditing the presented content.
Course overview at https://klab.tch.harvard.edu/academia/classes/BAI/bai.html
Biological and Artificial Intelligence
Eventually, we will have ultraintelligent machines, that is machines that are more intelligent than humans.
Going back to 1950, The Turing test denotes the first test to understand whether a machine is indistinguishable from a human. However, the test is limited to language interaction. Turing tests can be adapted to other modes such as vision. The adapted Turing test for vision would be to ask any question on the image.
Intelligence is the greatest problem in science. Tomaso Poggio postulates that if we can understand the brain, we can understand intelligence. Consequently, we might become more intelligent ourselves and intractable problems might be resolved.
What can’t deep convolutional networks do?
An adversarial attack through adding noise immediately can change the predicted outcome of a deep convolutional network. Such a network cannot pass the visual Turing test because a human would not be fooled by this simple manipulation.
Action recognition often uses anything else in the image but the action to determine the outcome by using biases in the data collection processes. Again, the visual Turing test would fail where a human would succeed.
The most powerful computational devices on Earth
The human brain is still the most powerful device because it generalises and can solve unknown tasks.
The Biophysics of computation (Neuroscience 101)
A neuron has dendrites that collection information, an internal function in the cell that sums the dendrites information, an activation function that leans to an axon that outputs a value. Axons then are connected to other dendrites via synapses.
Source: Wikipedia
Studying animal is critical not only for behaviour but also to understand underlying mechanisms. David Hubel and Torsten Wiesel placed electrons in the back of the brain to find neurons that reacted to the reaction of orientation of objects. A lot of the computer scientists used their cartoons of brain functions (and others) to design neural networks for AI. Indeed, nearly every type of neural network has an analogue in biology.
Disruptive Neuroscience
Circuit diagrams that show the full connectivity in a cube micron of brain matter.
Listening to a concert of lots of neurons. We are now able to record from many neurons (ten thousands) simultaneously over prolonged periods of time.
Causally interfering with neural activity. We are able to turn on and off specific neurons through the iron channels with light triggers.
Together, today we have a better understanding of the connectivity of neurons, the activity of neurons and even the manipulation of neurons. In the long-run we may be able to understand biological intelligence.
Tangents to the topic
Consciousness is a matter of major debate. If a machine passes the Turing test, does it have a consciousness? Christof Koch argues that consciousness and intelligence are separate phenomena.
Humans already prescribe attachment to simple machines like a Tamagotchis. The Atlas robot from Boston Dynamics is trained by being pushed over. The human-like appearance makes it a question whether cruel behaviour towards machines is ethical?
The perils of AI are:
Redistribution of jobs
Unlikely terminator-like scenarios
Military applications
To err is algorithmic (just like humans)
Biases in training data (note that humans have biases too or create them for the machine -> garbage in / garbage out)
Lack of understanding (we still don’t understand how humans make decisions either)
Social, mental, and political consequences of rapid changes in labour force
Rapid growth, faster than development of regulations
But robots playing football are still years if not decades away from human-like behaviour.
Another point is to comprehend humour. Humour is based on higher level abstractions of content. Therefore, the systems require access to knowledge and make decisions about the contents shown to infer the humorous component (e.g. a picture of Abraham Lincoln with an iphone).
From Twitter: https://twitter.com/BuckWoodyMSFT/status/1242557978215145473
P.S.: Note from Lecture 2: Gabriel Kreiman believes that the next revolutions in machine learning will be based on something that we can learn from biological intelligence.
All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.
Grow circles around point data to generate a graph whenever other points meet the circle and produce a persistent homology of filtered simplicial complexes (e.g adding edges to possibly change homology). Persistent barcode and persistence diagrams encode the same information produced by this process.
Measures are nicer to work with than sets of points for statistical purposes. If the persistence diagram D is a random variable, then E[D] is a determnistic measure on R². Persistence images reveal E[D] and are more interpretable than persistance diagramms which may be too crowed for visual inspection with a large sample.
Persistence can be used as an additional feature on a dataset. For example, a random sample from the data set can be taken and the persistence diagram/image can be computed and compared between random samples giving us an idea of the stability of the homology.
To randomly sample a density f_0 there are generally two appraoches parametric and non-parametric methods. A density f is log-concave if log f is concave. The super level sets need to be convex. Univariate examples are normal, logistic and more. The class is closed under marginalisation, conditioning, convlution and linear transformations.
In an unbounded likelihood, the density surface is spiky. The log-concave density addresses this.
Part 1: When we are learning we find a good fit (of weights) for the data. What kind of functions can be approxmiated by Neural Net? Essentially all, but the question is how large does the network have to be to approximate f to within error e. The question should be: what class function can be approximated by low norm Neural Nets? Another question should be: Given a bounded number of units what norm is required to approximate f to within any error e? The cost of the weights is taken as the parameter. This results in linear splines. A neural net with infinite width and one hidden layer solves the Green’s function.
Part 2: How does depth influence this? Deep learning should be considered with infinitive width and implemented with a finite approximation. Deep learning focuses on searching parameter space that maps into a richer function space.
All talks are summarised in my words which may not accurately represent the authors’ opinion. The focus is on aspects I found interesting. Please refer to the authors’ work for more details.
Learning directed acyclical graphs (DAGs) can traditionally be done in two ways: conditional independence and score-based . The latter poses a local search-problem with out a clear answer. More recently the problem has been posted as a continuous (global) optimisation for undirected graphs.
A loss function is a log-likelihood of the data and we need to find the most appropriate W such that X = XW + E. They provide a new M-estimator.
The talk focuses on how to align data when the intersubject variation is large but consistent and the intrasubject variation could be mapped. Parallel transport has the goal to align the intersubject values on an symmetric positive definite (SPD) embedding in n-dimensional space. SPD matrices are embedded on a hyperbole and all computations can be performed in closed-form.
Data from multiple subject and multiple session, it does not matter whether to first adapt the sessions or the subject – which only works for parallel transport and not with identy transformations.
Session 3 – Persistence framework for data analysis
Persistence diagrams can be used to describe complexity. The features are simpler but persistent to the underlying object. A geometric object through a filtration perspective produces a summary. Filtration is a growing sequence of spaces. The time that sets get created and destroyed can be mapped onto a persistence diagram with death time on the y axis and birth dime on the x-axis.
The bottleneck distance is a matching between two persistence diagram such that each feature is matched with the shortest distance. Features may be matched to a zero-feature (capturing noise) if they are to close to the diagonal. More complex approaches include persistence images that transform the diagram (after transforming it) into a kernel density.
The weight function should be application dependent and thus can be learned instead of pre-assigned. We can just take the difference between two persistence images as a weighted kernel for persistence images (WLPI).
For graphs the following metrics can be used for persistence. The Discrete Ricci curvature captures the local curvature on the manifold. The Jaccard index function compares for nodes who has common neighbors which is good for noisy networks.
In general, a descriptive function must be found for the domain and may even encode meaningful knowledge on how the object behaves. High weights would describe the more distinct features.
A d-dimensional point cloud can be converted to a graph representation using a kernel that connects close edges (with a fall-off or discontinuity). As the number of nodes n goes to infinity, the kernel bandwidth should shrink to 0.
The error bandwidth is critical. The take-away is that instead of producing single labeled data points, the label should be extended beyond the kernel bandwidth. A single data label can produce spikes because essentially the minimiser obtains smaller values for a flat surface with a single spike than for an appropriate surface.
Structured machine learning is not structure learning. It refers to learning functional dependencies between arbitrary output and input data. Classical approaches include likelihood estimation models (struct-svm, conditional random fields, but limited guarantees) and surrogate approaches (strong theoretical guarantees but ad hoc and specific).
Applying empirical risk minimisation (ERM) from statistical learning we can expect that the mean of the empirical data is close to the mean of the class. However, it is hard to pick a class. The inner risk (decomposing into marginal probability) reduces the class size. Making a strong assumption the structured encoding loss function (SELF) requires a Hilbert space and two maps such that the loss function can be presented as an inner product. Using a linear loss function helps. For a crazy space Y (need not be linear) the SELF gives enough structure to proceed. This enlarges the scope of structured learning to inner risk minimisation (IRM).
There is a function psi hidden in the loss function that encodes and decodes from Y to the Hilbert space. The steps are encode Y in H, learn from X to H, and decode H to Y. In linear estimation with least squares, the encoding/decoding disappears and the output space Y is not needed for computation.
Today is the first day of my stay at the Institute for Pure and Applied Mathematics (IPAM) at University of California Los Angeles (UCLA). Over the coming weeks I wil try to discuss interesting talks here at the long course Geometry and Learning from Data in 3D and Beyond. Stay tuned for the first workshop on Geometry of Big Data.
I recently came across Women Also Know Stuff. I think it is a great initiative that helps to slowly combat systemic and structural inequality. They point to many female scientists in most social sciences and I wondered whether I could find a similar program in computer science. The answer was no because apparently we first need to get women into computer science. I would still love to see #WomenAlsoKnowComputerScience on twitter, alas the search results are empty. It is not that I don’t know great female compute scientist but maybe they lack exposition which makes it all the more harder to convince women to join the field.
What I thought could help would be a larger exposition in scientific citations. I will need to go a bit off-topic to explain my thinking but bare with me. Citations produce scale-free networks(Klemm & Eguiluz, 2002).
Comparison of a random network and a scale-free network. The scale free network shows super connecting nodes in grey. Taken from wikipedia.
That means that a few super-connected nodes (so-called hubs) take up almost all the citation. In general, if we as scientists need a citation to underline a concept, we are much more likely to end up citing such a super-connected node. What that means is that highly cited scientists will get even more cited and less cited scientists remain so. That is even if their science was better. Network effects (or economies of scales) ensure that not necessarily the best science is cited the most, but usually the one preserving the status quo (Wang, Veugelers, & Stephan, 2017). But the effect is even stronger than that. The big names (not only the citations) dominate the field to such an extend that alternative explanations favored by other scientists are locked out of the discussion until such a star departs from the field (Azoulay, Fons-Rosen, & Zivin, 2015).
So where does that leave us with citing female scientists? They are at a triple-disadvantage:
They have been structurally excluded from the discipline
They (usually) don’t have a big name so their citation counts don’t increase
As there are no role models young women may not take up the field
However, and this is what I would like to stress most, it is not the quality of their research. Now, if citations are usually not awarded for merit only but mainly due to structural reasons, why not use them to start shifting the scales today such that in some day in the future women are equally represented in this field (and in many others) such as the statistical distribution of people would predict.
The Manifesto to Cite 50/50
Making a citation to underline a concept does not require us to only cite that one citation that we always use. We can vary whom we cite and we can choose to cite female scientists as well.
Citing a female scientist does not cost us anything in our career but it may help build those careers that eventually will bring equality.
Citing a female scientist when we only have male scientists at our hand makes us critically reflect our own field and possibly help us to engage with research more deeply to find female scientists.
We probably won’t reach a 50/50 quota any time soon in our citation lists but maybe we can start climbing towards it. I admit I am not there yet and I haven’t done this for any publication I produced yet, but I am of a mind to change this. Maybe you would like to contribute as well? Change is hard and so my first goal is to have at around 50% of publications having a female co-author (though first author would be preferable). I am sure I will fail miserably to reach that goal in the next few publications I make. But yesterday I sat down and tried to find a few women in the field that I could cite and it was surprising how relevant their research was and shocking how I barely heard of any of them (except those who despite the odds managed to become a big name of their own). I think that in the long-term this practice will also make me a better and more engaged scholar that (at least sometimes) manages to look beyond the in-group in which my work is circulating.
Computer Science and more
Now I know I specifically focused on computer science but probably such an attempt should not be confined to one discipline. It should be a truly interdisciplinary endeavor.
Azoulay, P., Fons-Rosen, C., & Zivin, J. S. G. (2015). Does science advance one funeral at a time? National Bureau of Economic Research.
Klemm, K., & Eguiluz, V. M. (2002 NaN). Highly clustered scale-free networks. Physical Review E. APS.
Wang, J., Veugelers, R., & Stephan, P. (2017 NaN). Bias against novelty in science: A cautionary tale for users of bibliometric indicators. Research Policy. Elsevier.
Today I write you as part of a mini-series on my stay at the Chicago Forum on Global Cities (CFGC). I have been kindly sponsored by ETH Zurich and the Chicago Forum to participate in the event. I am currently sitting in my train to Zurich airport and I am looking forward to 3 days of intensive discussions on the future of global cities. You will also find a post about this event on the ETH Ambassadors Blog and ETH Global Facebook and you may look out for some tweets.
I hope for many interesting meetings and conversations at the Forum, especially about my main topics of interest Big Data in Smart Cities – for which I have a short policy brief with me designed in the Argumentation and Science Communication Course of the ISTP – as well as ways to design better cities based on Big Data and knowledge of human (navigation) behaviour – the topic of my soon to start PhD.