I feel like there's an interesting tie-in with "The Bitter Lesson" here ( http://www.incompleteideas.net/IncIdeas/BitterLesson.html ). My amateur summary is that many early AI researchers tried to model AIs after human domain knowledge of different areas. Ultimately applying raw computation at massive scale produced better AIs, even though the approach was a very "blunt-force" .
I feel like there are parallels between the "engine" and domain-expert approach, and the camera and computation-centric approaches. The computation-centric approach doesn't try to anthropomorphize information, and as a result reflects our information back to us in all of it's weirdness and complexity. And it reflects exactly the kind of weirdness that no team of engineers and experts could or would ever try to embed in an AI. To use your framing, we "discovered" something with this approach, which I doubt would have happened with the domain-expert approach.
"Computer Science is not a Science, it is more like magic as you will see in this course. It is neither about Computers anymore than biology is about microscopes or Petri dishes and Physics is about particle accelerators" - Harold Abelson, Structure and Interpretation Of Computer Programs
Excellent write-up--one of the best since I began reading your work with the "manufactured normalcy field" text circa 2011 or so (which I still use as a text in my Grad Theory II course, during a unit on the role of discourse in post-Foucault art theory)
Point 1: For better or worse claim towards the end about the importance of care align in interesting ways with Heidegger's thought re: care
Point 2: More substantially, it would be interesting to consider how (or whether) the ideas described here relate to the Tononi/Koch idea of integrated information theory, which attempts to offer measures of "intelligence" or experience or consciousness among entities ranging from rocks on up to humans and beyond, via a measure of intra-systemic complexity and entanglement among parts. Most recent iteration here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011465. May or may not be interesting to see what applies and what doesn't
All that aside, thanks much for the greatly inspiring insights over the years!
very interesting thoughts. When talking about data or information we are talking about something that is inherently dependent on a model (i.e., the map, not the territory). For example, information is defined by two things: the number of states in a space and the number of symbols. But the space is an abstraction. For example, imagine I say that humans can have one of two states: they can either be right-side-up (R) or up-side-down (U). This is an overly simplistic model of humans, but it might be useful in some circumstances. Then if I have 5 humans they can make a sequence, like RUURR. Then we can measure the information of RUURR. But in order to do that I had to make a vast oversimplification about humans.
It is a good insight to say that an AI model can viewed as a property of data. It is not the data itself, but something interesting to say about the data, say, like the number of primes in set of numbers. But data is just a set of information symbols, and information symbols don't have meaning outside of models. Of course, like numbers, information can be abstracted and reasoned about independent of the model. But at some point we must ask: what is the model underlying this data? And importantly, what abstractions/simplifications make the model what it is?
I feel people often overlook this part and just say that our models represent matter and space, etc. But there are choices we have to make when choosing which map to use, and the map is never equivalent to the territory. There is nothing in the universe that is so simple that there are not infinitely many ways to model it. So we exist as this package of models that 'work' in some sense...that provide us fitness. We take an input from the world, simplify it, generate one or more inputs for our package of models (i.e., information...sorry, I know I am using "model" ambiguously), make a prediction...and then either kill or get killed because of our prediction. Since the models are (when it comes to a capital T truth perspective) rather arbitrary abstractions, they just get selected based on fitness. But there is a lot of interesting math when it comes to fitness of models.
"Some commentators assert that some AI-generated works should receive copyright protection, arguing that AI programs are like other tools that human beings have used to create copyrighted works. For example, the Supreme Court has held since the 1884 case Burrow-Giles Lithographic Co. v. Sarony that photographs can be entitled to copyright protection where the photographer makes decisions regarding creative elements such as composition, arrangement, and lighting. Generative AI programs might be seen as a new tool analogous to the camera, as Kashtanova argued." https://crsreports.congress.gov/product/pdf/LSB/LSB10922
Well said. Buried in/through your Top 10 is evolutionary theory (e.g. its computational incarnations), which seems to provide the pseudo-teleology in your "We care therefore we are"--we are (made of) things that care because those that didn't, "aren't" (i.e. were selected against by the 2nd law of thermodynamics). In an adjacency,, the "suspiciously simple grinding and compressing processes" should admit that it wasn't GOFAI that got us GPT and it wasn't just the piles of data, it was also Ilya's bet on predictive coding. To your point about The Data, the scaling 'laws' (e.g. Chincilla) that are emerging from LxM don't 'really' depend on transformers (see GPT-4 paper note that LSTMs could work too just 10x less efficient) or attention (see Princeton's Mamba model or Chris Re's equivalent Based replacing 'attention' with Structured State Space models). Those thinking similar thoughts to focusing on the data would be Ludwig Schmidt's DataComp project (see his recent Simons Institute talk on LLMs) which sets up the right objective for this 'lens' on data: given a fixed large model and fixed training budget for a large training pile, how do you select the best pile subset that yields the greatest downstream performance across tasks (generalization), including ICL, presuming only next token prediction during training? Predictive coding in rocks or meat bags under evolutionary pressure (real or simulated) can turn information into seeming intelligent behavior--operating at the timescale of that same evolutionary pressure.
This is excellent. Started me thinking about the importance of good taste. In the future AGIs will distinguish themselves with how well they select their training data. Particularly tasteful human data selectors will be prized.
Amazing connection of dots. This is what philosophy looks like in the 21. century
I feel like there's an interesting tie-in with "The Bitter Lesson" here ( http://www.incompleteideas.net/IncIdeas/BitterLesson.html ). My amateur summary is that many early AI researchers tried to model AIs after human domain knowledge of different areas. Ultimately applying raw computation at massive scale produced better AIs, even though the approach was a very "blunt-force" .
I feel like there are parallels between the "engine" and domain-expert approach, and the camera and computation-centric approaches. The computation-centric approach doesn't try to anthropomorphize information, and as a result reflects our information back to us in all of it's weirdness and complexity. And it reflects exactly the kind of weirdness that no team of engineers and experts could or would ever try to embed in an AI. To use your framing, we "discovered" something with this approach, which I doubt would have happened with the domain-expert approach.
"Computer Science is not a Science, it is more like magic as you will see in this course. It is neither about Computers anymore than biology is about microscopes or Petri dishes and Physics is about particle accelerators" - Harold Abelson, Structure and Interpretation Of Computer Programs
The "Rocks we tricked into thinking with lightning" statement shows that indeed we are in magical territory. Abelson/Sussman were not messing around when they put the image of a wizard as the cover image for their course in 1986. https://groups.csail.mit.edu/mac/classes/6.001/abelson-sussman-lectures/wizard.jpg
Excellent write-up--one of the best since I began reading your work with the "manufactured normalcy field" text circa 2011 or so (which I still use as a text in my Grad Theory II course, during a unit on the role of discourse in post-Foucault art theory)
Point 1: For better or worse claim towards the end about the importance of care align in interesting ways with Heidegger's thought re: care
Point 2: More substantially, it would be interesting to consider how (or whether) the ideas described here relate to the Tononi/Koch idea of integrated information theory, which attempts to offer measures of "intelligence" or experience or consciousness among entities ranging from rocks on up to humans and beyond, via a measure of intra-systemic complexity and entanglement among parts. Most recent iteration here: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011465. May or may not be interesting to see what applies and what doesn't
All that aside, thanks much for the greatly inspiring insights over the years!
The “caring field” notion of humans is already applicable in many contexts, both good and bad. Interesting to see it added at such a fundamental level
very interesting thoughts. When talking about data or information we are talking about something that is inherently dependent on a model (i.e., the map, not the territory). For example, information is defined by two things: the number of states in a space and the number of symbols. But the space is an abstraction. For example, imagine I say that humans can have one of two states: they can either be right-side-up (R) or up-side-down (U). This is an overly simplistic model of humans, but it might be useful in some circumstances. Then if I have 5 humans they can make a sequence, like RUURR. Then we can measure the information of RUURR. But in order to do that I had to make a vast oversimplification about humans.
It is a good insight to say that an AI model can viewed as a property of data. It is not the data itself, but something interesting to say about the data, say, like the number of primes in set of numbers. But data is just a set of information symbols, and information symbols don't have meaning outside of models. Of course, like numbers, information can be abstracted and reasoned about independent of the model. But at some point we must ask: what is the model underlying this data? And importantly, what abstractions/simplifications make the model what it is?
I feel people often overlook this part and just say that our models represent matter and space, etc. But there are choices we have to make when choosing which map to use, and the map is never equivalent to the territory. There is nothing in the universe that is so simple that there are not infinitely many ways to model it. So we exist as this package of models that 'work' in some sense...that provide us fitness. We take an input from the world, simplify it, generate one or more inputs for our package of models (i.e., information...sorry, I know I am using "model" ambiguously), make a prediction...and then either kill or get killed because of our prediction. Since the models are (when it comes to a capital T truth perspective) rather arbitrary abstractions, they just get selected based on fitness. But there is a lot of interesting math when it comes to fitness of models.
"Some commentators assert that some AI-generated works should receive copyright protection, arguing that AI programs are like other tools that human beings have used to create copyrighted works. For example, the Supreme Court has held since the 1884 case Burrow-Giles Lithographic Co. v. Sarony that photographs can be entitled to copyright protection where the photographer makes decisions regarding creative elements such as composition, arrangement, and lighting. Generative AI programs might be seen as a new tool analogous to the camera, as Kashtanova argued." https://crsreports.congress.gov/product/pdf/LSB/LSB10922
You're definitely onto something!
Well said. Buried in/through your Top 10 is evolutionary theory (e.g. its computational incarnations), which seems to provide the pseudo-teleology in your "We care therefore we are"--we are (made of) things that care because those that didn't, "aren't" (i.e. were selected against by the 2nd law of thermodynamics). In an adjacency,, the "suspiciously simple grinding and compressing processes" should admit that it wasn't GOFAI that got us GPT and it wasn't just the piles of data, it was also Ilya's bet on predictive coding. To your point about The Data, the scaling 'laws' (e.g. Chincilla) that are emerging from LxM don't 'really' depend on transformers (see GPT-4 paper note that LSTMs could work too just 10x less efficient) or attention (see Princeton's Mamba model or Chris Re's equivalent Based replacing 'attention' with Structured State Space models). Those thinking similar thoughts to focusing on the data would be Ludwig Schmidt's DataComp project (see his recent Simons Institute talk on LLMs) which sets up the right objective for this 'lens' on data: given a fixed large model and fixed training budget for a large training pile, how do you select the best pile subset that yields the greatest downstream performance across tasks (generalization), including ICL, presuming only next token prediction during training? Predictive coding in rocks or meat bags under evolutionary pressure (real or simulated) can turn information into seeming intelligent behavior--operating at the timescale of that same evolutionary pressure.
This is excellent. Started me thinking about the importance of good taste. In the future AGIs will distinguish themselves with how well they select their training data. Particularly tasteful human data selectors will be prized.
I love it that you brought up Pterry’s trolls :)
“Synthography”
Great piece! If modern AI can achieve the same performance with increasingly few parameters, is it still a good tape measure?