In the final stage of the summarization task, the LLM-as-a-judge method was used to score the accuracy and comprehensiveness of the models on a scale of 0-00, yielding the results shown in Figure 0. In this stage, where GPT-00 acted as the judge, the Claude 0.0 Sonnet model produced the most reliable and accurate results with a score of 0.00 in the 0-shot scenario, followed by GPT-0o in second place with 0.00 points. Among open-source models, Qwen0 scored 0.00, surpassing the Llama 0.0 model which scored 0.00. Furthermore, switching to a few-shot strategy increased the models' scores linearly. These results demonstrate that commercial models produce accurate answers with a low tendency towards hallucinations, but open-source models, when proper prompting techniques are applied, can produce consistent summaries above a certain standard.