Using hash instead of timestamps to check for changes.

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Using hash instead of timestamps to check for changes.

Glen Stark
Greetings.

I hope that this is the correct forum for this question.  As a quick
search on Google will verify, there's quite a bit of interest in being
able to have Make use a hash to check if a new build is required, as
opposed to a timestamp.

I searched the mailing lists regarding this issue, and found a rather
old patch suggestion, but have not been able to determine if a decision
was made regarding whether to implement this feature.  Nor have I been
able to find relevant information on the Make project website, so I'm
asking you Make maintainers if you see this as a desirable
functionality, or at least worthy of discussion.  I can say that it
would benefit my colleagues and I enormously, and I'm willing to attempt
the implementation if you agree it would be a feature worth adding.

So to start with:

Is this planned?  Has the idea already been rejected, and if so could
you point me to the discussion so I can inform myself?

If it is planned, or you agree it's worth doing, how can I help?  I'm
willing to write the code if someone is willing to help me work into the
code a little.  Until now I'm only a user, not maintainer of Make, and
would need some tips about how to fit the functionality into the overall
design of Make.  Someone to bounce ideas off, and direct questions to
would be wonderful.  If someone else is working on it already, I'd like
to help however I can -- testing, debugging, etc.

Thank you for your time.

Glen Stark




_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Yukimasa Sugizaki
Hi.

Notice that I am not a GNU Make developer.

The feature which checks updates depend on file content differ of each file is not implemented on GNU Make currently.
(I think it won't be implemented because timestamp-based update checking is normal behavior of Make.)

If you want to check updates by such way,
you can do this by using GNU Make macros like this:


all: prog

$(shell test -f src.c.prev && (diff -q src.c.prev src.c || touch src.c.flg) || touch src.c.flg)

prog: src.c.flg
    $(CC) src.c $(OUTPUT_OPTION)


Regards.

2015/03/27 22:42、Glen Stark <[hidden email]> のメッセージ:

> Greetings.
>
> I hope that this is the correct forum for this question.  As a quick
> search on Google will verify, there's quite a bit of interest in being
> able to have Make use a hash to check if a new build is required, as
> opposed to a timestamp.
>
> I searched the mailing lists regarding this issue, and found a rather
> old patch suggestion, but have not been able to determine if a decision
> was made regarding whether to implement this feature.  Nor have I been
> able to find relevant information on the Make project website, so I'm
> asking you Make maintainers if you see this as a desirable
> functionality, or at least worthy of discussion.  I can say that it
> would benefit my colleagues and I enormously, and I'm willing to attempt
> the implementation if you agree it would be a feature worth adding.
>
> So to start with:
>
> Is this planned?  Has the idea already been rejected, and if so could
> you point me to the discussion so I can inform myself?
>
> If it is planned, or you agree it's worth doing, how can I help?  I'm
> willing to write the code if someone is willing to help me work into the
> code a little.  Until now I'm only a user, not maintainer of Make, and
> would need some tips about how to fit the functionality into the overall
> design of Make.  Someone to bounce ideas off, and direct questions to
> would be wonderful.  If someone else is working on it already, I'd like
> to help however I can -- testing, debugging, etc.
>
> Thank you for your time.
>
> Glen Stark
>
>
>
>
> _______________________________________________
> Bug-make mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/bug-make

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Paul Smith-20
In reply to this post by Glen Stark
On Fri, 2015-03-27 at 14:42 +0100, Glen Stark wrote:
> Is this planned?  Has the idea already been rejected, and if so could
> you point me to the discussion so I can inform myself?

There is no formal planning around it right now, and it's not at the top
of my TODO list for GNU make.

> If it is planned, or you agree it's worth doing, how can I help?  I'm
> willing to write the code if someone is willing to help me work into the
> code a little.  Until now I'm only a user, not maintainer of Make, and
> would need some tips about how to fit the functionality into the overall
> design of Make.  Someone to bounce ideas off, and direct questions to
> would be wonderful.  If someone else is working on it already, I'd like
> to help however I can -- testing, debugging, etc.

I'm not aware of anyone working on it.  It sounds like a simple thing,
but actually there are a lot of issues that need to be considered before
any implementation can be started.  The important thing to remember is
that currently make is completely stateless... or rather, it uses the
filesystem to maintain its state (in the form of modification times).
Any change to a method of determining "out-of-date-ness" such as a hash
of the file content means introducing a separate state that make has to
maintain: this adds a lot of complexity and corner cases to work
through.

Before anyone can consider writing code of this magnitude, they should
familiarize themselves with the FSF's requirements for contributing to
the GNU project; you'll need to assign copyright to the FSF for the work
contributed to GNU make, which involves some legal paperwork on your
part and, if your employer has rights to your work which most do, at
least in the U.S., even if you don't do the work on the job, your
employer will have to agree as well.

On the technical side, there are various things to consider:
      * What form will the extra state be kept in?  One file per
        directory?  One file per target?  Something else?
      * If we use one file per target things are simpler, although that
        adds up to a LOT of files in bigger builds and some platforms
        might have problems.
      * If we use one file per directory, there are lots of issues:
              * When is the file written?  Every time a target is
                updated?  Once at the end of the build?
              * How will make handle the state file if it's killed in
                the middle of a build?
              * How will make handle missing/corrupted state files?
                Will it fall back on modification times, or just rebuild
                everything?
              * How do we handle recursion, where multiple instances of
                make could be running in the same directory?
      * We need to consider platform-specific issues; for example on
        UNIX systems a cheap/fast method of keeping per-file metadata
        might be to make a symbolic link containing the data, but that
        won't work on Windows or VMS, etc.
      * What type of extra state will we use?  My suspicion is that
        md5sum is not the best.  We don't really need it: we want
        fingerprinting not a cryptographic hash.  We don't even need to
        do de-dup so we won't run into the birthday paradox: we only
        want to know if the file has changed since the last time we saw
        it.  Probably a straightforward, well-distributed hash like
        xxhash would be sufficient.  If you combine both mod time AND
        the hash that's pretty definitive; you can probably get away
        with a 32bit hash.
      * What are the performance implications?  You're committing to
        having make read the entire content of every single file
        involved in the build into memory, just to decide what to
        update!  That's definitely going to hurt: a simple "nothing to
        do" build will suffer a big performance penalty.  In fact, in a
        way the fewer jobs make needs to run the slower it will be,
        since it will have to check the hash of every target where the
        mod time doesn't give an answer.  Maybe the hashing could be
        done per-block instead of on the entire file so you could fail
        faster, or something.  But now you're storing more state per
        target (multiple hashes per target).
      * Do we really need to hash the file?  Maybe simply expanding the
        current checking is sufficient.  For example, if in addition to
        mod time we also considered the size of the file (and maybe
        other things maintained by the filesystem like inode, for tools
        which don't just overwrite the same file) we could increase our
        accuracy WITHOUT resorting to a separate state file.  Is that
        good enough?
      * What if people want to define their own "out-of-date-ness" test?
        Maybe someone wants to integrate with inotify, or they want to
        check the preprocessor output so that files are not considered
        changed just because a comment changes, or something.


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Paul Smith-20
On Fri, 2015-03-27 at 11:45 -0400, Paul Smith wrote:
>       * Do we really need to hash the file?  Maybe simply expanding the
>         current checking is sufficient.  For example, if in addition to
>         mod time we also considered the size of the file (and maybe
>         other things maintained by the filesystem like inode, for tools
>         which don't just overwrite the same file) we could increase our
>         accuracy WITHOUT resorting to a separate state file.  Is that
>         good enough?

Actually I typed faster than my brain: we still need a state file of
course to compare sizes.  But at least it's still based on filesystem
metadata and doesn't require make to hash the contents of every file in
the build.


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Glen Stark

Hello Paul

Sorry to take so long to reply.  I wanted to think your input over, and
I've had a pretty heavy load lately.

Signing over the copyright, and any other legal steps won't be a
problem.  My company has no rights to work I do in my own time.  I'm
mainly worried about the technical issues, and finding the time to do
the work.  Until now I've been pretty happy to let Make run in the
background, and haven't put a lot of thought into how it works.
Obviously that will have to change.

I'd like to thank you for your thoughtful response.  I'm gratified that
you took the time to engage in a technical analysis, and start the ball
rolling on the design discussion.  The points you raised merit thought
and discussion.

After reading over your mail a couple of times, I realized that I hadn't
thought things through very well.  In fact, rather than saying "hash
instead of time", I should have said "optional additional hash check
when timestamp has changed".  I think this fixes all the performance
concerns, and opens the door to adding additional checks (like the
is-a-comment-only) check, which I think is an exciting idea.

Here are my additional thoughts:


1 Maintaining the state.
========================

  Your point about Make not maintaining any external state beyond what
  the filesystem tracks is well made.  I'm reluctant to add the extra
  complexity of tracking extra state, and it's clear to me that this
  will likely be the source of some "Oh, I hadn't thought of that"
  moments.  But in this case I think the benefit is worth the cost.


2 Adding additional "is-changed" checks.
========================================

  You asked "what if people want to define their own "out-of-date-ness"
  test?".  I found that a really exciting idea.  As I thought about
  this, I realized I what I really want is not to replace Make's current
  behavior, but to add an additional check to the existing timestamp
  check.

   My thinking is that the timestamp is in fact an overly conservative
  test.  We never have the case that the timestamp indicates something
  *has not* been changed when in fact it has (i.e. we always build if
  something has changed), but we do have an issue that building is
  unecessarily performed, causing an undue performance penalty -- the
  cost of building the target and its dependants. Thus we get a big
  build-time win whenever the additional test takes less time than
  building the target and its dependants.

  I think it's very important that Make remain reliable from the point
  of view that if something *should* be built, it *will* be built.
  Unecessarily rebuilding something is less of a fail than failing to
  rebuild something which should be.

  So I propose modify Make to accept a tool to perform additional
  checks, the first being a hash checker.  Any additional checkers
  should have the property that while they may return a false positive,
  they never return a false negative (they never incorrectly say no,
  nothing important was changed).

  We need only specify the interface of that tool, and people can write
  tools which satisfy their needs -- I'm interested in exploring the
  hash tool first, but might be interested in making further such
  'plugins', and projects with special needs could specify their own.
  Very exciting.

  As I see it, like this, the project becomes a way of simplifying the
  syntax of Yukimasa Sugizaki's suggestion, and officially supporting
  that workflow.

  My off-the-cuff suggestion for the interface of the external tool
  would be a simple executable, returning 0 if no rebuid is needed, 1 if
  one is needed, and perhaps another number(s) for error cases .  This
  strikes me as having several advantages -- the biggest being the
  flexibility it offers Make users.  For the case where users want to
  apply mutltiple additional criteria requiring state, this could be
  done in a single file.

  The only downside I see is the performance cost of starting and
  terminating the executable, but I'm assuming this will be small in
  comparision to the file-access operations, and non-existant compared
  to the cost of unecessary builds.  I guess the relevant benchmark will
  be increase in clean build time, which I imagine will be negligent for
  most real cases.


3 One file per target
=====================

  - The issues you raised regarding one-file-per-directory are tricky
    and would significantly slow development.  I especialy think the
    concurrency issues would be nice to avoid, at least in a first
    iteration.
  - One file per target would mean approximately factor 2 increase in
    the number of build targets.  Not beautiful, but only systems which
    are already approaching their limits would be affected.  These
    systems could continue using the default Make (timestamp based)
    behavior.
  - This somehow seems more consistent with Make's current behavior to
    me, which in turn seems lower risk.
  - I don't have any better ideas.
  - For projects on teams where 2n build targets is impractical, they
    can use the default, timestamp only behavior.


4 What kind of state?
=====================

  Based on the performance and reliability of GIT, I'm inclined to
  suggest using SHA1 stored in a one-file-per-target basis.  To start
  with I think making it a text file is reasonable.  I'm unfamiliar with
  xxhash, but I'm open to trying anything.  With the right
  implementation it should be trivial to evaluate a few possibilities.


5 Perfromance implications
==========================

  As mentioned earlier, if we change the goal from replacing the
  time-stamp to supplementing the time-stamp, I think a lot of the
  performance implications fall away.  The 'nothing-to-do' build will
  remain unchanged.

  The worst case scenario, I'm thinking is a full build, where no hashes
  have yet been written.  As long as hash-generation and file-saving is
  negligible compared to build-time, that should be no problem.  In the
  use-cases I deal with on a daily bassis (building big ugly c++ files),
  this will be easility satisfied.  If you can think of some good
  test-cases where this might not be satisfied, let me know, and I'll
  run some benchmarks.  Again though, if we keep the timestamp as
  default, project can decide based on their circumstances if the
  tradeoff is worthwhile.

  Per block sounds like a good idea as a later optimization, if we, or
  someone else determines it would be valuable.  To start with I woulde
  keep it simple.


6 Next steps
============

  My tentative suggestion, depending on your next feedback, is to do
  something like the following:

  - Determine a syntax for makefiles to specify which additional checks
    (and perhaps in what order) should be perfomed.  I think this should
    be easy to use for one additional test, but open to adding
    additional tests later.  It should be easy for Makefile generators
    (like autotools and cmake) to take advantage of.  I could see using
    an environment variable, but I could also imagine being able to
    steer the beavior on a Makefile to Makefile, or target to target
    basis.  I ask for input from the experts here.
  - Hash out in rough strokes how the call would be made -- my ad-hoc
    approach would be a seperate executable with integer return value
    indicating needs-rebuild, doesn't-need, or error, but again I ask
    for input from the experts.

  If that sounds reasonable, I should probably start poking around the
  Make codebase so I can get started at some point.

  Again, many thanks for your time,

  Glen Stark

On Fri, 2015-03-27 at 11:48 -0400, Paul Smith wrote:

> On Fri, 2015-03-27 at 11:45 -0400, Paul Smith wrote:
> >       * Do we really need to hash the file?  Maybe simply expanding the
> >         current checking is sufficient.  For example, if in addition to
> >         mod time we also considered the size of the file (and maybe
> >         other things maintained by the filesystem like inode, for tools
> >         which don't just overwrite the same file) we could increase our
> >         accuracy WITHOUT resorting to a separate state file.  Is that
> >         good enough?
>
> Actually I typed faster than my brain: we still need a state file of
> course to compare sizes.  But at least it's still based on filesystem
> metadata and doesn't require make to hash the contents of every file in
> the build.
>



_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Edward Welbourne-2
> After reading over your mail a couple of times, I realized that I hadn't
> thought things through very well.  In fact, rather than saying "hash
> instead of time", I should have said "optional additional hash check
> when timestamp has changed".

Even so, I'm unclear about why "hash" is the thing you want here.  You
anticipate saving lots of time on builds, presumably when immaterial
changes get ignored, or when the only change is to a timestamp.  (The
latter could be fixed by touch -t if it's really important.)  The
situations I've seen where that felt like it might happen have been
where some intermediate files often don't change in response to changes
in the files from which they're generated, much as a change to a comment
doesn't change the result of compiling code.

Some colleagues wrote tools with the superficially nice behaviour that,
when about to write a file, they would check to see whether it was
changed from what's already on disk; if it was unchanged, they would not
overwrite the file.  This saved regenerating files dependent on the
output file; but had the drawback that the file would stay out of date
relative to those on which *it* depended, so got remade every time we
ran make (once an irrelevant change had happened upstream).

The problem with any "is this change material" check, to evade doing
downstream build steps, is that you have to do the check on every make
run, once there is a maybe-material change present that it's saving you
from responding to.  You can use a timestamp check as a cheap pre-test
to that (file hasn't changed since last time, so can't contain a
material change) but once it *has* saved you doing some downstream work,
you are doing some checking that you must repeat each time make runs.
Something depends on something that's newer, somewhere in your
dependency tree, forcing make to re-run some of your rules, albeit these
work out that they should do a no-op.

My ideal solution to this would be to have an extra timestamp as part of
the file-system's meta-data: "up to date at" as distinct from "created"
and "modified".  (To make it generic, rather than make-specific, I'd
probably call it "validated" or some such.)  If we had this, make could
compare it, on each generated file, with "modified" on its
prerequisites; a file is out of date if a prerequisite has been modified
since it was up to date.  When regenerating a file, we could then see
whether it has changed; if it hasn't, we leave "modified" alone and
update "up to date" to the present; otherwise, we over-write the file
and change both.  I think this would do most of what I suspect you
really want.  However, file-systems don't have an extra time-stamp for
us to use in this way, so we can't do this.

Of course, we could abuse the existing time-stamps to achieve this; I
find "created" an almost useless datum - many tools create a new file to
replace the old one when "modifying", renaming the new one on success,
so the old version's creation time is forgotten and "create" is mostly
synonymous with "modify".  If we could assume that of all tools, we
could then use "creation" time as "modified" in the above and use
"modified" time as the "up to date at" time and all would swim nicely.
However, I suppose some tools really do modify files in place, so would
leave "created" unchanged while revising "modified"; so I doubt this
scheme would fly (and it *is* an abuse of the defined time-stamps).  I
dare say others can think of other problems with it.

I spent a few hours trying to work out how to fake this up with a
secondary file whose "modified" time-stamp serves as "up-to-date" for
the primary it represents.  It might contain a hash or other meta-data
as you describe.  For files fully under make's control (generated files)
this looks feasible - albeit there's a mess of details to sort out -
without needing to regenerate the secondary on every make run; it just
gets generated when make creates the primary (or when make finds it has
mysteriously vanished since last run).  However, source files get
randomly hacked about by users and version control systems, so would
still need their secondaries reevaluated at least whenever the source is
newer than its secondary - as discussed above.

As long as (primary) files can be modified without the meta-data you
want being updated in parallel (as a file-system time-stamp would be), I
think you are doomed to having to regenerate your meta-data more often
than you anticipate, which I suspect shall eat up all the hoped-for
benefits of saving some build steps when they're redundant.

        Eddy.

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

RE: Using hash instead of timestamps to check for changes.

Martin Dorey-3
> I spent a few hours trying to work out how to fake this up with a
> secondary file whose "modified" time-stamp serves as "up-to-date" for
> the primary it represents.

I imagine we're not alone, but perhaps an existence proof would have some value: we have generic makefile code that provides this service for parts of our build system that generate source code.  Instead of writing rules, the application makefile author writes variables and the generic code $(eval)s up the rules.  Each rule's command gets run in a subdirectory that's uniquely named for each run, from where any changed results are atomically renamed over the targets they replace.  This gives us automatic safety against unsynchronized concurrent builds and protection against fail-stops where make is unable to remove the half-overwritten target.  The code's spattered around a long-feeped build system, hacked for portability and legacy cases and rife with excessively long bash commands, but it's not actually that large and it certainly didn't involve any changes to make.  While the cursory documentation says "Higher order code, though, is always more difficult to debug", I don't think that's been much of a problem.  It keeps the application makefiles simple, clear and compliant with Dorey's first rule of writing makefiles that contain rules: "don't".

-----Original Message-----
From: bug-make-bounces+martin.dorey=[hidden email] [mailto:bug-make-bounces+martin.dorey=[hidden email]] On Behalf Of Edward Welbourne
Sent: Thursday, April 02, 2015 10:49
To: [hidden email]
Cc: [hidden email]
Subject: Re: Using hash instead of timestamps to check for changes.

> After reading over your mail a couple of times, I realized that I hadn't
> thought things through very well.  In fact, rather than saying "hash
> instead of time", I should have said "optional additional hash check
> when timestamp has changed".

Even so, I'm unclear about why "hash" is the thing you want here.  You
anticipate saving lots of time on builds, presumably when immaterial
changes get ignored, or when the only change is to a timestamp.  (The
latter could be fixed by touch -t if it's really important.)  The
situations I've seen where that felt like it might happen have been
where some intermediate files often don't change in response to changes
in the files from which they're generated, much as a change to a comment
doesn't change the result of compiling code.

Some colleagues wrote tools with the superficially nice behaviour that,
when about to write a file, they would check to see whether it was
changed from what's already on disk; if it was unchanged, they would not
overwrite the file.  This saved regenerating files dependent on the
output file; but had the drawback that the file would stay out of date
relative to those on which *it* depended, so got remade every time we
ran make (once an irrelevant change had happened upstream).

The problem with any "is this change material" check, to evade doing
downstream build steps, is that you have to do the check on every make
run, once there is a maybe-material change present that it's saving you
from responding to.  You can use a timestamp check as a cheap pre-test
to that (file hasn't changed since last time, so can't contain a
material change) but once it *has* saved you doing some downstream work,
you are doing some checking that you must repeat each time make runs.
Something depends on something that's newer, somewhere in your
dependency tree, forcing make to re-run some of your rules, albeit these
work out that they should do a no-op.

My ideal solution to this would be to have an extra timestamp as part of
the file-system's meta-data: "up to date at" as distinct from "created"
and "modified".  (To make it generic, rather than make-specific, I'd
probably call it "validated" or some such.)  If we had this, make could
compare it, on each generated file, with "modified" on its
prerequisites; a file is out of date if a prerequisite has been modified
since it was up to date.  When regenerating a file, we could then see
whether it has changed; if it hasn't, we leave "modified" alone and
update "up to date" to the present; otherwise, we over-write the file
and change both.  I think this would do most of what I suspect you
really want.  However, file-systems don't have an extra time-stamp for
us to use in this way, so we can't do this.

Of course, we could abuse the existing time-stamps to achieve this; I
find "created" an almost useless datum - many tools create a new file to
replace the old one when "modifying", renaming the new one on success,
so the old version's creation time is forgotten and "create" is mostly
synonymous with "modify".  If we could assume that of all tools, we
could then use "creation" time as "modified" in the above and use
"modified" time as the "up to date at" time and all would swim nicely.
However, I suppose some tools really do modify files in place, so would
leave "created" unchanged while revising "modified"; so I doubt this
scheme would fly (and it *is* an abuse of the defined time-stamps).  I
dare say others can think of other problems with it.

I spent a few hours trying to work out how to fake this up with a
secondary file whose "modified" time-stamp serves as "up-to-date" for
the primary it represents.  It might contain a hash or other meta-data
as you describe.  For files fully under make's control (generated files)
this looks feasible - albeit there's a mess of details to sort out -
without needing to regenerate the secondary on every make run; it just
gets generated when make creates the primary (or when make finds it has
mysteriously vanished since last run).  However, source files get
randomly hacked about by users and version control systems, so would
still need their secondaries reevaluated at least whenever the source is
newer than its secondary - as discussed above.

As long as (primary) files can be modified without the meta-data you
want being updated in parallel (as a file-system time-stamp would be), I
think you are doomed to having to regenerate your meta-data more often
than you anticipate, which I suspect shall eat up all the hoped-for
benefits of saving some build steps when they're redundant.

        Eddy.

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Paul Smith-20
In reply to this post by Glen Stark
On Thu, 2015-04-02 at 13:20 +0200, Glen Stark wrote:

>   You asked "what if people want to define their own "out-of-date-ness"
>   test?".  I found that a really exciting idea.  As I thought about
>   this, I realized I what I really want is not to replace Make's current
>   behavior, but to add an additional check to the existing timestamp
>   check.
>
>    My thinking is that the timestamp is in fact an overly conservative
>   test.  We never have the case that the timestamp indicates something
>   *has not* been changed when in fact it has (i.e. we always build if
>   something has changed),

That's interesting, because in my experience the main reason people are
upset about timestamps these days is the exact opposite: with the
increase in capabilities of systems, in particular larger build servers,
it is possible to have situations where targets are updated too quickly
to reliably determine out-of-date-ness based solely on timestamps.
Filesystems which support sub-second modified time stamping mitigate the
issue somewhat, but not completely, and not all users can use these
filesystems.

At the same time, it's rare that I (at least) modify the timestamp on a
file unless I've changed it.  Sure, sometimes it might happen (mostly by
accident) but this is rare enough to not be a big problem.  And as you
point out, this is annoying in that it could result in extra rebuilds,
but it's safe: it's much more significant to have the problem that make
decides NOT to rebuild things which DO need to be rebuilt.

For targets which OFTEN have timestamps incorrectly updated (say, for
example, autogenerated files which end up not changing) there are
well-defined methods for dealing with this, already used by autoconf,
etc.: they just generate the file to a temporary location, compare it,
and only replace the target if it's really different.

Possibly your environment has a higher-than-normal incidence of this,
for some reason, but maybe thinking about ways to address that situation
might be simpler?

I'm not saying that alternative methods of "file changed" detection are
not interesting to me, but it's a big, big problem to address in a
holistic way.

>   So I propose modify Make to accept a tool to perform additional
>   checks, the first being a hash checker.  Any additional checkers
>   should have the property that while they may return a false positive,
>   they never return a false negative (they never incorrectly say no,
>   nothing important was changed).

I don't agree with this.  You are looking at this in only one direction:
how to avoid builds when timestamps indicate they should happen but
other, specialized results would show that the build is not needed.

But in fact we already know that our current timestamp model is
insufficient in the opposite direction: how to know that a build is
needed, even though a timestamp says it's not.  Any new support should
make it possible to help with that, in a way much more serious, problem
as well.

>   My off-the-cuff suggestion for the interface of the external tool
>   would be a simple executable, returning 0 if no rebuid is needed, 1 if
>   one is needed, and perhaps another number(s) for error cases .

You haven't specified the INPUT to this tool.  What does "a rebuild is
needed" mean?  Are you suggesting that make would invoke this tool with
targets and prerequisites and ask the tool to decide whether the targets
are out of date?  Or are you suggesting that the tool would take one
file as an argument and determine whether that file has been updated
since the last time make was run?

>   The only downside I see is the performance cost of starting and
>   terminating the executable, but I'm assuming this will be small in
>   comparision to the file-access operations, and non-existant compared
>   to the cost of unecessary builds.  I guess the relevant benchmark will
>   be increase in clean build time, which I imagine will be negligent for
>   most real cases.

Another option is to take advantage of the loadable object and/or Guile
support capabilities in newer versions of make.  Or some combination.

>   - One file per target would mean approximately factor 2 increase in
>     the number of build targets.  Not beautiful, but only systems which
>     are already approaching their limits would be affected.  These
>     systems could continue using the default Make (timestamp based)
>     behavior.

Well, it's not clear what you are defining as a "target" here.  Remember
that for your model to work it must keep records for not just the files
people typically think of as targets (.o files, libraries, etc.) but
also every prerequisite: so every .c, .h, etc. file.  That's basically
doubling the number of files in a built version of your source tree.

> 4 What kind of state?
> =====================
>
>   Based on the performance and reliability of GIT, I'm inclined to
>   suggest using SHA1 stored in a one-file-per-target basis.  To start
>   with I think making it a text file is reasonable.  I'm unfamiliar with
>   xxhash, but I'm open to trying anything.  With the right
>   implementation it should be trivial to evaluate a few possibilities.

I recommend against a cryptographically secure algorithm like SHA.
First, it's slow (comparatively speaking).  Second, its output is large,
per file.  And finally, it's just not needed.  Git has excellent reasons
for wanting this, but none of them apply in this situation.  A simple,
well-distributed hashing function will be significantly faster and the
resulting value much smaller, and it will be just as reliable for what
you want, which is just to know if the file is different than it was
before.


Finally, Eddy Welbourne's followup has this critical observation:

On Thu, 2015-04-02 at 17:48 +0000, Edward Welbourne wrote:
> The problem with any "is this change material" check, to evade doing
> downstream build steps, is that you have to do the check on every make
> run, once there is a maybe-material change present that it's saving you
> from responding to.  You can use a timestamp check as a cheap pre-test
> to that (file hasn't changed since last time, so can't contain a
> material change) but once it *has* saved you doing some downstream work,
> you are doing some checking that you must repeat each time make runs.

This is an excellent point and needs to be considered.  Suppose that
computing the hash takes 1/10th the time of doing the compile.  That
means that after 10 builds of your system the cumulative time of those
builds is actually LARGER than if you'd just bitten the bullet and
rebuilt it the first time.


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Tim Murphy-4
>    My thinking is that the timestamp is in fact an overly conservative
>   test.  We never have the case that the timestamp indicates something
>   *has not* been changed when in fact it has (i.e. we always build if
>   something has changed),

That's interesting, because in my experience the main reason people are
upset about timestamps these days is the exact opposite: with the
increase in capabilities of systems, in particular larger build servers,
it is possible to have situations where targets are updated too quickly
to reliably determine out-of-date-ness based solely on timestamps.

That may be a use case but I've not experienced it more than once, IIRC, but I've often  experienced things which are automatically generated or copied and which
trigger huge rebuilds for no ultimate purpose.

A most obvious use for customisable "uptodateness" would be the trick used in the Meson build system where there's no need to relink some code if a library changes unless the API
of the library has changed.  This is exceedingly powerful because linking is a slow activity and it reads a lot of files.

Regards,

Tim


On 4 April 2015 at 19:11, Paul Smith <[hidden email]> wrote:
On Thu, 2015-04-02 at 13:20 +0200, Glen Stark wrote:
>   You asked "what if people want to define their own "out-of-date-ness"
>   test?".  I found that a really exciting idea.  As I thought about
>   this, I realized I what I really want is not to replace Make's current
>   behavior, but to add an additional check to the existing timestamp
>   check.
>
>    My thinking is that the timestamp is in fact an overly conservative
>   test.  We never have the case that the timestamp indicates something
>   *has not* been changed when in fact it has (i.e. we always build if
>   something has changed),

That's interesting, because in my experience the main reason people are
upset about timestamps these days is the exact opposite: with the
increase in capabilities of systems, in particular larger build servers,
it is possible to have situations where targets are updated too quickly
to reliably determine out-of-date-ness based solely on timestamps.
Filesystems which support sub-second modified time stamping mitigate the
issue somewhat, but not completely, and not all users can use these
filesystems.

At the same time, it's rare that I (at least) modify the timestamp on a
file unless I've changed it.  Sure, sometimes it might happen (mostly by
accident) but this is rare enough to not be a big problem.  And as you
point out, this is annoying in that it could result in extra rebuilds,
but it's safe: it's much more significant to have the problem that make
decides NOT to rebuild things which DO need to be rebuilt.

For targets which OFTEN have timestamps incorrectly updated (say, for
example, autogenerated files which end up not changing) there are
well-defined methods for dealing with this, already used by autoconf,
etc.: they just generate the file to a temporary location, compare it,
and only replace the target if it's really different.

Possibly your environment has a higher-than-normal incidence of this,
for some reason, but maybe thinking about ways to address that situation
might be simpler?

I'm not saying that alternative methods of "file changed" detection are
not interesting to me, but it's a big, big problem to address in a
holistic way.

>   So I propose modify Make to accept a tool to perform additional
>   checks, the first being a hash checker.  Any additional checkers
>   should have the property that while they may return a false positive,
>   they never return a false negative (they never incorrectly say no,
>   nothing important was changed).

I don't agree with this.  You are looking at this in only one direction:
how to avoid builds when timestamps indicate they should happen but
other, specialized results would show that the build is not needed.

But in fact we already know that our current timestamp model is
insufficient in the opposite direction: how to know that a build is
needed, even though a timestamp says it's not.  Any new support should
make it possible to help with that, in a way much more serious, problem
as well.

>   My off-the-cuff suggestion for the interface of the external tool
>   would be a simple executable, returning 0 if no rebuid is needed, 1 if
>   one is needed, and perhaps another number(s) for error cases .

You haven't specified the INPUT to this tool.  What does "a rebuild is
needed" mean?  Are you suggesting that make would invoke this tool with
targets and prerequisites and ask the tool to decide whether the targets
are out of date?  Or are you suggesting that the tool would take one
file as an argument and determine whether that file has been updated
since the last time make was run?

>   The only downside I see is the performance cost of starting and
>   terminating the executable, but I'm assuming this will be small in
>   comparision to the file-access operations, and non-existant compared
>   to the cost of unecessary builds.  I guess the relevant benchmark will
>   be increase in clean build time, which I imagine will be negligent for
>   most real cases.

Another option is to take advantage of the loadable object and/or Guile
support capabilities in newer versions of make.  Or some combination.

>   - One file per target would mean approximately factor 2 increase in
>     the number of build targets.  Not beautiful, but only systems which
>     are already approaching their limits would be affected.  These
>     systems could continue using the default Make (timestamp based)
>     behavior.

Well, it's not clear what you are defining as a "target" here.  Remember
that for your model to work it must keep records for not just the files
people typically think of as targets (.o files, libraries, etc.) but
also every prerequisite: so every .c, .h, etc. file.  That's basically
doubling the number of files in a built version of your source tree.

> 4 What kind of state?
> =====================
>
>   Based on the performance and reliability of GIT, I'm inclined to
>   suggest using SHA1 stored in a one-file-per-target basis.  To start
>   with I think making it a text file is reasonable.  I'm unfamiliar with
>   xxhash, but I'm open to trying anything.  With the right
>   implementation it should be trivial to evaluate a few possibilities.

I recommend against a cryptographically secure algorithm like SHA.
First, it's slow (comparatively speaking).  Second, its output is large,
per file.  And finally, it's just not needed.  Git has excellent reasons
for wanting this, but none of them apply in this situation.  A simple,
well-distributed hashing function will be significantly faster and the
resulting value much smaller, and it will be just as reliable for what
you want, which is just to know if the file is different than it was
before.


Finally, Eddy Welbourne's followup has this critical observation:

On Thu, 2015-04-02 at 17:48 +0000, Edward Welbourne wrote:
> The problem with any "is this change material" check, to evade doing
> downstream build steps, is that you have to do the check on every make
> run, once there is a maybe-material change present that it's saving you
> from responding to.  You can use a timestamp check as a cheap pre-test
> to that (file hasn't changed since last time, so can't contain a
> material change) but once it *has* saved you doing some downstream work,
> you are doing some checking that you must repeat each time make runs.

This is an excellent point and needs to be considered.  Suppose that
computing the hash takes 1/10th the time of doing the compile.  That
means that after 10 builds of your system the cumulative time of those
builds is actually LARGER than if you'd just bitten the bullet and
rebuilt it the first time.


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make



--
You could help some brave and decent people to have access to uncensored news by making a donation at:

http://www.thezimbabwean.co.uk/friends/

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Eric Melski-2
On 04/04/2015 11:38 AM, Tim Murphy wrote:

>     >    My thinking is that the timestamp is in fact an overly conservative
>     >   test.  We never have the case that the timestamp indicates something
>     >   *has not* been changed when in fact it has (i.e. we always build if
>     >   something has changed),
>
>     That's interesting, because in my experience the main reason people are
>     upset about timestamps these days is the exact opposite: with the
>     increase in capabilities of systems, in particular larger build servers,
>     it is possible to have situations where targets are updated too quickly
>     to reliably determine out-of-date-ness based solely on timestamps.
>
>
> That may be a use case but I've not experienced it more than once, IIRC,
> but I've often  experienced things which are automatically generated or
> copied and which
> trigger huge rebuilds for no ultimate purpose.

This problem is relatively common when using an SCM system that
preserves *checkin* time on files rather than *checkout* time.
ClearCase does this in various configurations, and Perforce will if your
client spec has "modtime" set.  I'm sure other SCM systems can be setup
this way too.

In such environments, the timestamp on a file may not change *enough*
even when the content has changed.  For example, suppose the following:

1.  I have file foo.c with time A.
2.  My teammate modifies foo.c and checks in at time B.
3.  I build in my workspace, without sync'ing, generating foo.c with time C.
4.  I sync my workspace, updating foo.c but leaving it with time B.
5.  I try to rebuild but foo.o is not updated because C is later than B.

ClearCase's "clearmake" utility handles this, of course.  The problem
occurs frequently enough that we added a feature to Electric Make called
the "ledger" expressly to allow us to notice this type of out-of-dateness.

Regards,

Eric Melski
Chief Architect
Electric Cloud, Inc.


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

David Boyce-3
In reply to this post by Glen Stark
It should be noted that there are published methods for hash reliance
within the current syntax, e.g.
http://www.cmcrossroads.com/article/rebuilding-when-files-checksum-changes.

I haven’t ever tried this, in fact I haven’t even read the article in
detail, but you might want to play with it before undertaking to
enhance make.

-David

On Fri, Mar 27, 2015 at 6:42 AM, Glen Stark <[hidden email]> wrote:

> Greetings.
>
> I hope that this is the correct forum for this question.  As a quick
> search on Google will verify, there's quite a bit of interest in being
> able to have Make use a hash to check if a new build is required, as
> opposed to a timestamp.
>
> I searched the mailing lists regarding this issue, and found a rather
> old patch suggestion, but have not been able to determine if a decision
> was made regarding whether to implement this feature.  Nor have I been
> able to find relevant information on the Make project website, so I'm
> asking you Make maintainers if you see this as a desirable
> functionality, or at least worthy of discussion.  I can say that it
> would benefit my colleagues and I enormously, and I'm willing to attempt
> the implementation if you agree it would be a feature worth adding.
>
> So to start with:
>
> Is this planned?  Has the idea already been rejected, and if so could
> you point me to the discussion so I can inform myself?
>
> If it is planned, or you agree it's worth doing, how can I help?  I'm
> willing to write the code if someone is willing to help me work into the
> code a little.  Until now I'm only a user, not maintainer of Make, and
> would need some tips about how to fit the functionality into the overall
> design of Make.  Someone to bounce ideas off, and direct questions to
> would be wonderful.  If someone else is working on it already, I'd like
> to help however I can -- testing, debugging, etc.
>
> Thank you for your time.
>
> Glen Stark
>
>
>
>
> _______________________________________________
> Bug-make mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/bug-make

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Enrico Weigelt, metux IT consult
In reply to this post by Eric Melski-2
On 07.04.2015 00:17, Eric Melski wrote:

Hi,

> This problem is relatively common when using an SCM system that
> preserves *checkin* time on files rather than *checkout* time.

I'd consider that a misbehavious of the SCM (IMHO, that's the reason
why Git does not track the mtime). From the filesystem perspective,
the mtime represents the time when the actual file was changed in the
filesystem. So, resetting the mtime from some SCM repo actually is
tricking the filesystem - pretty obvious that the mtime then isn't
reliable anymore.

> ClearCase
> does this in various configurations, and Perforce will if your client
> spec has "modtime" set.  I'm sure other SCM systems can be setup this
> way too.

The correct solution is to configure the SCM correctly (so it does not
artificially manipulate the mtime).


cu
--
Enrico Weigelt,
metux IT consulting
+49-151-27565287

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Tim Murphy-4


On 11 April 2015 at 16:38, Enrico Weigelt, metux IT consult <[hidden email]> wrote:
On 07.04.2015 00:17, Eric Melski wrote:


> ClearCase
> does this in various configurations, and Perforce will if your client
> spec has "modtime" set.  I'm sure other SCM systems can be setup this
> way too.

The correct solution is to configure the SCM correctly (so it does not
artificially manipulate the mtime)


I always thought the correct solution was whatever you were able to do that works most reliably.  It's not always that you get to tell the company how to run it's SCM system.

It's not like we are "build ingénues". There's a lot of software out there, a lot of build problems some of us bump into which includes things one would never contemplate until one actually had that problem oneself.  After a lot of miserable experiences we come here to mention the things we think would have helped us, got us out of some problem, allowed us to do a better job.  I can think of a range of issues which customised handling of up-to-dateness would make much easier both conceptually and practically not to mention the benefit of potentially being able to apply it to existing builds without rewriting them.

Regards,

Timothy Murphy



--
You could help some brave and decent people to have access to uncensored news by making a donation at:

http://www.thezimbabwean.co.uk/friends/

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Eric Melski-2
In reply to this post by Enrico Weigelt, metux IT consult
On 04/11/2015 08:38 AM, Enrico Weigelt, metux IT consult wrote:

> On 07.04.2015 00:17, Eric Melski wrote:
>
> Hi,
>
>> This problem is relatively common when using an SCM system that
>> preserves *checkin* time on files rather than *checkout* time.
>
> I'd consider that a misbehavious of the SCM (IMHO, that's the reason
> why Git does not track the mtime). From the filesystem perspective,
> the mtime represents the time when the actual file was changed in the
> filesystem. So, resetting the mtime from some SCM repo actually is
> tricking the filesystem - pretty obvious that the mtime then isn't
> reliable anymore.

It doesn't necessarily require "resetting" or "manipulating" the mtime
at all.  ClearCase has its own filesystem, MVFS, which simply behaves
the way I described.  In any case, it's not an utterly irrational
position to consider the last modification time of a file tracked by the
version control system to be the time that a new revision of the file
was created.  That *is* the last modification time, after all.

However, this is not the place for this type of philosophical debate.
The question at hand is whether or not make would benefit from a
non-timestamp-based notion of "up-to-dateness", and the answer seems to
be clearly yes, for this and other reasons mentioned elsewhere in this
thread.

Regards,

Eric Melski
Chief Architect
Electric Cloud, Inc.
http://blog.melski.net/


_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make
Reply | Threaded
Open this post in threaded view
|

Re: Using hash instead of timestamps to check for changes.

Daniel Herring-2
In reply to this post by Glen Stark
Hi all,

There's been a lot of good discussion on this.  Extensibility, avoiding
unnecessary rebuilds, preventing missed rebuilds, ensuring checks don't
get re-run every time, etc.

Those are hard issues to get right.


Persistent storage of the non-timestamp checks is another issue that has
been mentioned.  Here are three ideas that might help.

http://en.wikipedia.org/wiki/Extended_file_attributes

http://git-scm.com/book/en/v2/Git-Internals-Git-Objects

https://github.com/apenwarr/redo#how-does-redo-store-dependencies



- Daniel

_______________________________________________
Bug-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/bug-make