This is a detailed reading of Google’s paper Deep Neural Networks for YouTube Recommendations
So Google, fxxk you.
After self-implementing a grid-search but having a horrible time writing pyplot visualizing the result, I finally decided to find an existing tool to do the HP tuning for me.
Before this, see 2024/06/17 Conducting Multi-Round Conversation with Transformers for why we need cache. But we have query, key, value three matrices. Why do you only cache past keys and values? How about past queries?
I was using LLaVA to query in an image how many characters there are. For higher accuracy, I decided to employ Chain of Thought, but struggled to implement it. CoT is conducted through a multiple round conversation. It is easily done in a graphical chat interface but how is it done internally with code?
One day before Google I/O, OpenAI made a Spring Update Release, introducing multi-modal end-to-end model GPT4-o
CLIP investigates whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain (CV).
Loss Scaling / Gradient Scaling was mentioned in Mixed-Precision Training as one of the 3 techniques, but there are many points to be careful with when in practice.