Research Scientist, Web Data
hace 7 días
HTML into clean text data to be used during training, potentially including relevant image data), data filtering (removing/down-weighting low-quality content) or adding new data sources (such as historical web crawl data). Working with web data for LLM training, such as cleaning data, removing dupli