simmol t1_jdjq815 wrote on March 24, 2023 at 9:55 PM

I think for this to be truly effective, the LLM would need to take in huge amounts of computer screen images in its training set, and I am not sure if that was done for the pre-trained model for GPT-4. But once this is done for all possible computer screen image combinations that one can think of, then it would probably be akin to the self-driving car type of algorithm where you can navigate accordingly based on the images.

But this type of multi-modality would be useful if you have the person actually sitting in front of the computer working side-by-side with the AI, right? Because if you want to eliminate the human from the loop, then I am not sure if this is an efficient way of training the LLM since these type of computer screen images are what helps a human navigate the computer, and not necessarily optimal for the LLM.

MyPetGoat t1_jdk8icb wrote on March 25, 2023 at 12:10 AM

You’d need the model to be running all the time observing what you’re doing on the computer. Could be done

simmol t1_jdkd4pf wrote on March 25, 2023 at 12:45 AM

Seems quite inefficient though. Can't GPT just access the HTML or other type of codes associated with the website and just access the websites via the text as opposed to image?