Computational Linguistics,
Journal Year:
2023,
Volume and Issue:
50(1), P. 237 - 291
Published: Dec. 12, 2023
Abstract
Large
language
models
(LLMs)
are
capable
of
successfully
performing
many
processing
tasks
zero-shot
(without
training
data).
If
LLMs
can
also
reliably
classify
and
explain
social
phenomena
like
persuasiveness
political
ideology,
then
could
augment
the
computational
science
(CSS)
pipeline
in
important
ways.
This
work
provides
a
road
map
for
using
as
CSS
tools.
Towards
this
end,
we
contribute
set
prompting
best
practices
an
extensive
evaluation
to
measure
performance
13
on
25
representative
English
benchmarks.
On
taxonomic
labeling
(classification),
fail
outperform
fine-tuned
but
still
achieve
fair
levels
agreement
with
humans.
free-form
coding
(generation),
produce
explanations
that
often
exceed
quality
crowdworkers’
gold
references.
We
conclude
today’s
research
two
ways:
(1)
serving
data
annotators
human
annotation
teams,
(2)
bootstrapping
challenging
creative
generation
(e.g.,
explaining
underlying
attributes
text).
In
summary,
posed
meaningfully
participate
analysis
partnership
Nature,
Journal Year:
2023,
Volume and Issue:
620(7972), P. 172 - 180
Published: July 12, 2023
Abstract
Large
language
models
(LLMs)
have
demonstrated
impressive
capabilities,
but
the
bar
for
clinical
applications
is
high.
Attempts
to
assess
knowledge
of
typically
rely
on
automated
evaluations
based
limited
benchmarks.
Here,
address
these
limitations,
we
present
MultiMedQA,
a
benchmark
combining
six
existing
medical
question
answering
datasets
spanning
professional
medicine,
research
and
consumer
queries
new
dataset
questions
searched
online,
HealthSearchQA.
We
propose
human
evaluation
framework
model
answers
along
multiple
axes
including
factuality,
comprehension,
reasoning,
possible
harm
bias.
In
addition,
evaluate
Pathways
Language
Model
1
(PaLM,
540-billion
parameter
LLM)
its
instruction-tuned
variant,
Flan-PaLM
2
MultiMedQA.
Using
combination
prompting
strategies,
achieves
state-of-the-art
accuracy
every
MultiMedQA
multiple-choice
(MedQA
3
,
MedMCQA
4
PubMedQA
5
Measuring
Massive
Multitask
Understanding
(MMLU)
topics
6
),
67.6%
MedQA
(US
Medical
Licensing
Exam-style
questions),
surpassing
prior
state
art
by
more
than
17%.
However,
reveals
key
gaps.
To
resolve
this,
introduce
instruction
prompt
tuning,
parameter-efficient
approach
aligning
LLMs
domains
using
few
exemplars.
The
resulting
model,
Med-PaLM,
performs
encouragingly,
remains
inferior
clinicians.
show
that
recall
reasoning
improve
with
scale
suggesting
potential
utility
in
medicine.
Our
reveal
limitations
today’s
models,
reinforcing
importance
both
frameworks
method
development
creating
safe,
helpful
applications.
IEEE/CAA Journal of Automatica Sinica,
Journal Year:
2023,
Volume and Issue:
10(5), P. 1122 - 1136
Published: May 1, 2023
ChatGPT,
an
artificial
intelligence
generated
content
(AIGC)
model
developed
by
OpenAI,
has
attracted
world-wide
attention
for
its
capability
of
dealing
with
challenging
language
understanding
and
generation
tasks
in
the
form
conversations.
This
paper
briefly
provides
overview
on
history,
status
quo
potential
future
development
helping
to
provide
entry
point
think
about
ChatGPT.
Specifically,
from
limited
open-accessed
resources,
we
conclude
core
techniques
mainly
including
large-scale
models,
in-context
learning,
reinforcement
learning
human
feedback
key
technical
steps
developing
Chat-GPT.
We
further
analyze
pros
cons
ChatGPT
rethink
duality
various
fields.
Although
it
been
widely
acknowledged
that
brings
plenty
opportunities
fields,
mankind
should
still
treat
use
properly
avoid
threat,
e.g.,
academic
integrity
safety
challenge.
Finally,
discuss
several
open
problems
as
ACM Transactions on Intelligent Systems and Technology,
Journal Year:
2024,
Volume and Issue:
15(3), P. 1 - 45
Published: Jan. 23, 2024
Large
language
models
(LLMs)
are
gaining
increasing
popularity
in
both
academia
and
industry,
owing
to
their
unprecedented
performance
various
applications.
As
LLMs
continue
play
a
vital
role
research
daily
use,
evaluation
becomes
increasingly
critical,
not
only
at
the
task
level,
but
also
society
level
for
better
understanding
of
potential
risks.
Over
past
years,
significant
efforts
have
been
made
examine
from
perspectives.
This
paper
presents
comprehensive
review
these
methods
LLMs,
focusing
on
three
key
dimensions:
what
evaluate
,
where
how
.
Firstly,
we
provide
an
overview
perspective
tasks,
encompassing
general
natural
processing
reasoning,
medical
usage,
ethics,
education,
social
sciences,
agent
applications,
other
areas.
Secondly,
answer
‘where’
‘how’
questions
by
diving
into
benchmarks,
which
serve
as
crucial
components
assessing
LLMs.
Then,
summarize
success
failure
cases
different
tasks.
Finally,
shed
light
several
future
challenges
that
lie
ahead
evaluation.
Our
aim
is
offer
invaluable
insights
researchers
realm
evaluation,
thereby
aiding
development
more
proficient
point
should
be
treated
essential
discipline
assist
We
consistently
maintain
related
open-source
materials
at:
https://github.com/MLGroupJLU/LLM-eval-survey
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Journal Year:
2022,
Volume and Issue:
unknown
Published: June 1, 2022
With
the
rise
of
powerful
pre-trained
vision-language
models
like
CLIP,
it
becomes
essential
to
investigate
ways
adapt
these
downstream
datasets.
A
recently
proposed
method
named
Context
Optimization
(CoOp)
introduces
concept
prompt
learning—a
recent
trend
in
NLP—to
vision
domain
for
adapting
models.
Specifically,
CoOp
turns
context
words
a
into
set
learnable
vectors
and,
with
only
few
labeled
images
learning,
can
achieve
huge
improvements
over
intensively-tuned
manual
prompts.
In
our
study
we
identify
critical
problem
CoOp:
learned
is
not
generalizable
wider
unseen
classes
within
same
dataset,
suggesting
that
overfits
base
observed
during
training.
To
address
problem,
propose
Conditional
(CoCoOp),
which
extends
by
further
learning
lightweight
neural
network
generate
each
image
an
input-conditional
token
(vector).
Compared
CoOp's
static
prompts,
dynamic
prompts
instance
and
are
thus
less
sensitive
class
shift.
Extensive
experiments
show
CoCoOp
generalizes
much
better
than
classes,
even
showing
promising
transferability
beyond
single
dataset;
yields
stronger
generalization
performance
as
well.
Code
available
at
https://github.com/KaiyangZhou/CoOp.
arXiv (Cornell University),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Jan. 1, 2023
The
recent
breakthroughs
in
natural
language
processing
for
model
pretraining
on
large
quantities
of
data
have
opened
the
way
similar
foundation
models
computer
vision.
These
could
greatly
simplify
use
images
any
system
by
producing
all-purpose
visual
features,
i.e.,
features
that
work
across
image
distributions
and
tasks
without
finetuning.
This
shows
existing
methods,
especially
self-supervised
can
produce
such
if
trained
enough
curated
from
diverse
sources.
We
revisit
approaches
combine
different
techniques
to
scale
our
terms
size.
Most
technical
contributions
aim
at
accelerating
stabilizing
training
scale.
In
data,
we
propose
an
automatic
pipeline
build
a
dedicated,
diverse,
dataset
instead
uncurated
as
typically
done
literature.
models,
train
ViT
(Dosovitskiy
et
al.,
2020)
with
1B
parameters
distill
it
into
series
smaller
surpass
best
available
OpenCLIP
(Ilharco
2021)
most
benchmarks
pixel
levels.
arXiv (Cornell University),
Journal Year:
2021,
Volume and Issue:
unknown
Published: Jan. 1, 2021
Large
language
models
have
recently
been
shown
to
attain
reasonable
zero-shot
generalization
on
a
diverse
set
of
tasks
(Brown
et
al.,
2020).
It
has
hypothesized
that
this
is
consequence
implicit
multitask
learning
in
models'
pretraining
(Radford
2019).
Can
instead
be
directly
induced
by
explicit
learning?
To
test
question
at
scale,
we
develop
system
for
easily
mapping
any
natural
into
human-readable
prompted
form.
We
convert
large
supervised
datasets,
each
with
multiple
prompts
wording.
These
datasets
allow
benchmarking
the
ability
model
perform
completely
held-out
tasks.
fine-tune
pretrained
encoder-decoder
(Raffel
2020;
Lester
2021)
mixture
covering
wide
variety
The
attains
strong
performance
several
standard
often
outperforming
up
16x
its
size.
Further,
our
approach
subset
from
BIG-bench
benchmark,
6x
All
trained
are
available
https://github.com/bigscience-workshop/t-zero
and
all
https://github.com/bigscience-workshop/promptsource.
We
study
the
staggered
introduction
of
a
generative
AI-based
conversational
assistant
using
data
from
5,179
customer
support
agents.Access
to
tool
increases
productivity,
as
measured
by
issues
resolved
per
hour,
14
percent
on
average,
with
greatest
impact
novice
and
lowskilled
workers,
minimal
experienced
highly
skilled
workers.We
provide
suggestive
evidence
that
AI
model
disseminates
potentially
tacit
knowledge
more
able
workers
helps
newer
move
down
experience
curve.In
addition,
we
show
assistance
improves
sentiment,
reduces
requests
for
managerial
intervention,
employee
retention.
Journal of University Teaching and Learning Practice,
Journal Year:
2023,
Volume and Issue:
20(2)
Published: Jan. 1, 2023
This
paper
explores
the
academic
integrity
considerations
of
students’
use
Artificial
Intelligence
(AI)
tools
using
Large
Language
Models
(LLMs)
such
as
ChatGPT
in
formal
assessments.
We
examine
evolution
these
tools,
and
highlight
potential
ways
that
LLMs
can
support
education
students
digital
writing
beyond,
including
teaching
composition,
possibilities
co-creation
between
humans
AI,
supporting
EFL
learners,
improving
Automated
Writing
Evaluations
(AWE).
describe
demonstrate
have
creating
original,
coherent
text
avoid
detection
by
existing
technological
methods
trained
staff
alike,
demonstrating
a
major
concern
related
to
students.
Analysing
various
issues
raise
for
both
Higher
Education
Institutions
(HEIs)
students,
we
conclude
it
is
not
student
any
AI
defines
whether
plagiarism
or
breach
has
occurred,
but
made
clear
student.
Deciding
particular
be
defined
misconduct
determined
policies
given
HEI,
which
must
updated
consider
how
will
used
future
educational
environments.