GNU Make 4.2 Query

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

GNU Make 4.2 Query

nikhil jain
Hi GNU Team,

I have a query regarding GMAKE code. I have been working on GMAKE for a
while and wanted to know if there is a way to find the number of rules to
be run at the very beginning ? I wanted to send this info to my server so
that it knows that make will run that many number of rules.

The server then provides the info to the user about how much % of make job
is done.

Please let me know. This is really an urgent requirement for me. Let me
know about the functions or variables where I can find like how many
targets have to executed.

Thanks a lot,

Nikhil Jain
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Paul Smith-20
On Fri, 2019-05-31 at 08:27 +0530, nikhil jain wrote:
> I have a query regarding GMAKE code. I have been working on GMAKE for
> a while and wanted to know if there is a way to find the number of
> rules to be run at the very beginning ?

I already replied to your Savannah bug on this topic.  See:

https://www.mail-archive.com/bug-make@.../msg11420.html

If anyone else has some ideas or you want to discuss further we can do
that here.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Paul Smith-20
I'm adding back the mailing list.  Please always keep it on the CC list
at least (or just reply to the list directly: I am subscribed).

On Fri, 2019-05-31 at 21:16 +0530, nikhil jain wrote:

> Yes sure. Thanks for your response.
>
> I see there is an option of make -n which lists the rules to build.
> As you were telling, It will need a huge change in the GMAKE code. I
> was actually surprised that there is no way to know how many rules
> are left. I was trying to build a server which would let the user
> know about the % of their job completion. I feel this is the basic
> feature from the end user point of view. I have actually implemented
> remote-stub.c which comes as a stub in the MAKE code. If there is a
> way to find how many rules are left, that would be really great.
> Cannot we have a single command which just runs -n first and then
> starts building ? It can save the total number of rules in a global
> variable which I can access from remote_setup() which is called just
> once before the make starts to execute job remotely.

I'm not interested in implementing this but if someone else wanted to
(along with appropriate copyright assignment, regression tests, etc.)
and the implementation was reasonable, I'd consider it.

Otherwise, it will have to be done through some external scripting and
the results provided to make by, for example, setting a variable on the
command line:

  make NUMRECIPES=$(make -n | count-recipes.sh)

You can query this variable in your code.  Writing count-recipes.sh
probably has to rely on a knowledge of the recipes being run, if you
want to count the number of targets being built.

Note that make -n might sometimes actually run some rules: it will run
any rule which is marked as a recursive make.  It will also run any
rules needed to update included makefiles.  So, that has to be taken
into consideration.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
OK, Thanks for your valuable suggestion.
Also, will you consider remote-stub.c or you have left it alone for the
users to make their own implementation for remotely executing a rule ?

There can be multiple implementation like rsh, ssh, tcp/ip for remotely
executing a job. Which one do you prefer ? I had done all 3 of them with
fantastic results.
The time taken is faster when I spread the rules on multiple machines in
the cloud rather executing on the same machine.

In this, I do not care about -j option as I have kept 'make' open to
execute as many rules as it can (a little bit tweak in job.c and
job_slots_used, job_slots).

Anyways, Thanks for your response. I will find a way to use -n (which is
also not as per the expectation as it would run some rules but still....)

If you feel to implement my feature in near or later future, just let me
know. I will be glad.

Thanks you so much.

On Fri, May 31, 2019 at 10:45 PM Paul Smith <[hidden email]> wrote:

> I'm adding back the mailing list.  Please always keep it on the CC list
> at least (or just reply to the list directly: I am subscribed).
>
> On Fri, 2019-05-31 at 21:16 +0530, nikhil jain wrote:
> > Yes sure. Thanks for your response.
> >
> > I see there is an option of make -n which lists the rules to build.
> > As you were telling, It will need a huge change in the GMAKE code. I
> > was actually surprised that there is no way to know how many rules
> > are left. I was trying to build a server which would let the user
> > know about the % of their job completion. I feel this is the basic
> > feature from the end user point of view. I have actually implemented
> > remote-stub.c which comes as a stub in the MAKE code. If there is a
> > way to find how many rules are left, that would be really great.
> > Cannot we have a single command which just runs -n first and then
> > starts building ? It can save the total number of rules in a global
> > variable which I can access from remote_setup() which is called just
> > once before the make starts to execute job remotely.
>
> I'm not interested in implementing this but if someone else wanted to
> (along with appropriate copyright assignment, regression tests, etc.)
> and the implementation was reasonable, I'd consider it.
>
> Otherwise, it will have to be done through some external scripting and
> the results provided to make by, for example, setting a variable on the
> command line:
>
>   make NUMRECIPES=$(make -n | count-recipes.sh)
>
> You can query this variable in your code.  Writing count-recipes.sh
> probably has to rely on a knowledge of the recipes being run, if you
> want to count the number of targets being built.
>
> Note that make -n might sometimes actually run some rules: it will run
> any rule which is marked as a recursive make.  It will also run any
> rules needed to update included makefiles.  So, that has to be taken
> into consideration.
>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Hi again,

Does GMAKE has a retry option. If a command in a rule is failed, is there
an option to retry it or I have to implement it ?
Waiting for response.

Thanks
Nikhil

On Fri, May 31, 2019 at 10:59 PM nikhil jain <[hidden email]> wrote:

> OK, Thanks for your valuable suggestion.
> Also, will you consider remote-stub.c or you have left it alone for the
> users to make their own implementation for remotely executing a rule ?
>
> There can be multiple implementation like rsh, ssh, tcp/ip for remotely
> executing a job. Which one do you prefer ? I had done all 3 of them with
> fantastic results.
> The time taken is faster when I spread the rules on multiple machines in
> the cloud rather executing on the same machine.
>
> In this, I do not care about -j option as I have kept 'make' open to
> execute as many rules as it can (a little bit tweak in job.c and
> job_slots_used, job_slots).
>
> Anyways, Thanks for your response. I will find a way to use -n (which is
> also not as per the expectation as it would run some rules but still....)
>
> If you feel to implement my feature in near or later future, just let me
> know. I will be glad.
>
> Thanks you so much.
>
> On Fri, May 31, 2019 at 10:45 PM Paul Smith <[hidden email]> wrote:
>
>> I'm adding back the mailing list.  Please always keep it on the CC list
>> at least (or just reply to the list directly: I am subscribed).
>>
>> On Fri, 2019-05-31 at 21:16 +0530, nikhil jain wrote:
>> > Yes sure. Thanks for your response.
>> >
>> > I see there is an option of make -n which lists the rules to build.
>> > As you were telling, It will need a huge change in the GMAKE code. I
>> > was actually surprised that there is no way to know how many rules
>> > are left. I was trying to build a server which would let the user
>> > know about the % of their job completion. I feel this is the basic
>> > feature from the end user point of view. I have actually implemented
>> > remote-stub.c which comes as a stub in the MAKE code. If there is a
>> > way to find how many rules are left, that would be really great.
>> > Cannot we have a single command which just runs -n first and then
>> > starts building ? It can save the total number of rules in a global
>> > variable which I can access from remote_setup() which is called just
>> > once before the make starts to execute job remotely.
>>
>> I'm not interested in implementing this but if someone else wanted to
>> (along with appropriate copyright assignment, regression tests, etc.)
>> and the implementation was reasonable, I'd consider it.
>>
>> Otherwise, it will have to be done through some external scripting and
>> the results provided to make by, for example, setting a variable on the
>> command line:
>>
>>   make NUMRECIPES=$(make -n | count-recipes.sh)
>>
>> You can query this variable in your code.  Writing count-recipes.sh
>> probably has to rely on a knowledge of the recipes being run, if you
>> want to count the number of targets being built.
>>
>> Note that make -n might sometimes actually run some rules: it will run
>> any rule which is marked as a recursive make.  It will also run any
>> rules needed to update included makefiles.  So, that has to be taken
>> into consideration.
>>
>>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Paul Smith-20
On Fri, 2019-08-30 at 16:37 +0530, nikhil jain wrote:
> Does GMAKE has a retry option. If a command in a rule is failed, is
> there an option to retry it or I have to implement it ?

GNU make has no built-in facility to retry failed commands.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Ok, thanks.

On Sat, 31 Aug 2019, 19:44 Paul Smith, <[hidden email]> wrote:

> On Fri, 2019-08-30 at 16:37 +0530, nikhil jain wrote:
> > Does GMAKE has a retry option. If a command in a rule is failed, is
> > there an option to retry it or I have to implement it ?
>
> GNU make has no built-in facility to retry failed commands.
>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Kaz Kylheku (gmake)
Hi nihkil,

Try using a "macro":

# $(1) is retry count: integer constant in shell test syntax
# $(2) is command
#
define retry-times
   i=0; while [ $$i -lt $(1) ]; do ($(2)) && exit 0; i=$$(( i + 1 ));
done ; exit 1
endef

target: prerequisite
   $(call retry-times,5,your command here)


Fixed retries version:

# $(1) is command
#
define retry-3-times
   ($(1)) && exit 0 || ($(1)) && exit 0 || ($(1)) && exit 0 || exit 1
endef

target: prerequisite
   $(call retry-3-times,your command here)

We insert the command in parentheses not just for syntactic hygiene but
to get it to run in a subshell. This way our command can contain  
subcommands like "exit 1" without terminating the retry logic.

Cheers ...

On 2019-09-01 12:48, nikhil jain wrote:

> Ok, thanks.
>
> On Sat, 31 Aug 2019, 19:44 Paul Smith, <[hidden email]> wrote:
>
>> On Fri, 2019-08-30 at 16:37 +0530, nikhil jain wrote:
>> > Does GMAKE has a retry option. If a command in a rule is failed, is
>> > there an option to retry it or I have to implement it ?
>>
>> GNU make has no built-in facility to retry failed commands.
>>
>>
> _______________________________________________
> Help-make mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/help-make

_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Hi,

Thanks for your reply.

Actually, this is not possible in my case. The builds are run by the R&D
teams who are using legacy GNUmakefile. So, I can't force them to change
their way and there are around 10k+ commands in a makefile.
I would rather implement the retry feature in GMAKE like I had implemented
the Remote Execution feature. Thanks for your concern.

Just a point - I think a retry feature is really needed in GMAKE. it is a
very useful for everybody.
Remote execution , I understand can be implementation specific so you guys
left it as a stub. But, retry SHOULD be part of GMAKE.

Let me know whenever this feature can be embedded in GMAKE.

make -e 2

Something like that which can retry 2 times a failed command. Or even
better -

make -e 2 -t 4

Retry 2 times and 4 seconds of delay between each retry. It will help in
resolving NFS issues like -

all:
    mkdir <dir>
    cd <dir>

In my implementation these 2 commands will run on 2 different systems (as I
implemented remote execution facility in GMAKE). All my execution hosts are
in NFS. Sometimes, it takes time to sync (like a second or 2), so second
command fails.

That's why I needed a retry mechanism. Let me know if you guys thing to
make it part of some future releases.

Thanks
Nikhil Jain



On Mon, Sep 2, 2019 at 10:01 AM Kaz Kylheku (gmake) <
[hidden email]> wrote:

> Hi nihkil,
>
> Try using a "macro":
>
> # $(1) is retry count: integer constant in shell test syntax
> # $(2) is command
> #
> define retry-times
>    i=0; while [ $$i -lt $(1) ]; do ($(2)) && exit 0; i=$$(( i + 1 ));
> done ; exit 1
> endef
>
> target: prerequisite
>    $(call retry-times,5,your command here)
>
>
> Fixed retries version:
>
> # $(1) is command
> #
> define retry-3-times
>    ($(1)) && exit 0 || ($(1)) && exit 0 || ($(1)) && exit 0 || exit 1
> endef
>
> target: prerequisite
>    $(call retry-3-times,your command here)
>
> We insert the command in parentheses not just for syntactic hygiene but
> to get it to run in a subshell. This way our command can contain
> subcommands like "exit 1" without terminating the retry logic.
>
> Cheers ...
>
> On 2019-09-01 12:48, nikhil jain wrote:
> > Ok, thanks.
> >
> > On Sat, 31 Aug 2019, 19:44 Paul Smith, <[hidden email]> wrote:
> >
> >> On Fri, 2019-08-30 at 16:37 +0530, nikhil jain wrote:
> >> > Does GMAKE has a retry option. If a command in a rule is failed, is
> >> > there an option to retry it or I have to implement it ?
> >>
> >> GNU make has no built-in facility to retry failed commands.
> >>
> >>
> > _______________________________________________
> > Help-make mailing list
> > [hidden email]
> > https://lists.gnu.org/mailman/listinfo/help-make
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Kaz Kylheku (gmake)
On 2019-09-01 23:04, nikhil jain wrote:
> Hi,
>
> Thanks for your reply.
>
> Actually, this is not possible in my case. The builds are run by the
> R&D teams who are using legacy GNUmakefile. So, I can't force them to
> change their way and there are around 10k+ commands in a makefile.
> I would rather implement the retry feature in GMAKE like I had
> implemented the Remote Execution feature. Thanks for your concern.

Oh you want to put in a feature for GLOBALLY retrying every command
in every recipe?

That's just a non-starter; I can't imagine such a thing would ever
be upstreamed.

One reason I suspect it would not be upstreamed is that the behavior can
be
achieved with an alternative shell interpreter, which can be specified
via the SHELL variable. Whatever you put into the SHELL is the
interpreter
used for all recipe commands. Look more into it in the doc.

If your R&D team would allow you to add just one line to the
legacy GNU Makefile to assign the SHELL variable, you can assign that
to a shell wrapper program which performs command re-trying.


> Just a point - I think a retry feature is really needed in GMAKE.
> it is a very useful for everybody.
> Remote execution , I understand can be implementation specific so you
> guys left it as a stub. But, retry SHOULD be part of GMAKE.

Remote execution /per se/ should also be doable via SHELL. But, how
would
it work if the target and prerequisites are local objects. The purpose
of a recipe is to update a target when it's missing or older than at
least
one of the prerequisites.

If you have a common namespace via NFS mounts or whatever, it can make
sense. Say that /path/to/foo/bar.c is visible on all hosts. You can
check
locally whether /path/to/foo/bar.o needs updating, but run the compiler
on some remote host.

You will likely get "clock skew" issues, due to the remote host setting
the modification time stamp of the updated target, using a clock that is
not exactly synchronized with the local machine that's actually
comparing
timestamps.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Hi,

Thanks for your reply.

1) Yes retrying every command ONLY when it fails which is less than 1% of
the case, less than .5% of the cases. It is very very rare.

The paths are visible to all the hosts and there are no clock issues. Even
if there are clock issues or path issues, I can debug that and run another
build and remove that host from the list, That is something I can check.
That has nothing to do with retry logic.

Sorry, R&D wont change their implementation. I could not force them.

Neither Target not Pre-requisites are local objects. They are available on
all the hosts. But, as I said, sometimes (very very rarely), the task can
fail because the host is loaded and sync of NFS across hosts take more than
expected time.

I can't build a system keeping only the SUCCESS cases in mind.
 The retry is needed even if there is no remote execution host feature.
Even on the same host if the build fails, then what to do ? ?? Run the
build from scratch ? That is a waste of time and resource.

On Mon, Sep 2, 2019 at 11:53 AM Kaz Kylheku (gmake) <
[hidden email]> wrote:

> On 2019-09-01 23:04, nikhil jain wrote:
> > Hi,
> >
> > Thanks for your reply.
> >
> > Actually, this is not possible in my case. The builds are run by the
> > R&D teams who are using legacy GNUmakefile. So, I can't force them to
> > change their way and there are around 10k+ commands in a makefile.
> > I would rather implement the retry feature in GMAKE like I had
> > implemented the Remote Execution feature. Thanks for your concern.
>
> Oh you want to put in a feature for GLOBALLY retrying every command
> in every recipe?
>
> That's just a non-starter; I can't imagine such a thing would ever
> be upstreamed.
>
> One reason I suspect it would not be upstreamed is that the behavior can
> be
> achieved with an alternative shell interpreter, which can be specified
> via the SHELL variable. Whatever you put into the SHELL is the
> interpreter
> used for all recipe commands. Look more into it in the doc.
>
> If your R&D team would allow you to add just one line to the
> legacy GNU Makefile to assign the SHELL variable, you can assign that
> to a shell wrapper program which performs command re-trying.
>
>
> > Just a point - I think a retry feature is really needed in GMAKE.
> > it is a very useful for everybody.
> > Remote execution , I understand can be implementation specific so you
> > guys left it as a stub. But, retry SHOULD be part of GMAKE.
>
> Remote execution /per se/ should also be doable via SHELL. But, how
> would
> it work if the target and prerequisites are local objects. The purpose
> of a recipe is to update a target when it's missing or older than at
> least
> one of the prerequisites.
>
> If you have a common namespace via NFS mounts or whatever, it can make
> sense. Say that /path/to/foo/bar.c is visible on all hosts. You can
> check
> locally whether /path/to/foo/bar.o needs updating, but run the compiler
> on some remote host.
>
> You will likely get "clock skew" issues, due to the remote host setting
> the modification time stamp of the updated target, using a clock that is
> not exactly synchronized with the local machine that's actually
> comparing
> timestamps.
>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Paul Smith-20
In reply to this post by Kaz Kylheku (gmake)
On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
> If your R&D team would allow you to add just one line to the
> legacy GNU Makefile to assign the SHELL variable, you can assign that
> to a shell wrapper program which performs command re-trying.

You don't have to add any lines to the makefile.  You can reset SHELL
on the command line, just like any other make variable:

    make SHELL=/my/special/sh

You can even override it only for specific targets using the --eval
command line option:

    make --eval 'somerule: SHELL := /my/special/sh'

Or, you can add '-f mymakefile.mk -f Makefile' options to the command
line to force reading of a personal makefile before the standard
makefile.

Clearly you can modify the command line, otherwise adding new options
to control a putative retry on error option would not be possible.

As for your NFS issue, another option would be to enable the .ONESHELL
feature available in newer versions of GNU make: that will ensure that
all lines in a recipe are invoked in a single shell, which means that
they should all be invoked on the same remote host.  This can also be
done from the command line, as above.  If your recipes are written well
it should Just Work.  If they aren't, and you can't fix them, then
obviously this solution won't work for you.

Regarding changes to set re-invocation on failure, at this time I don't
believe it's something I'd be willing to add to GNU make directly,
especially not an option that simply retries every failed job.  This is
almost never useful (why would you want to retry a compile, or link, or
similar?  It will always just fail again, take longer, and generate
confusing duplicate output--at best).

The right answer for this problem is to modify the makefile to properly
retry those specific rules which need it.

I commiserate with you that your environment is static and you're not
permitted to modify it, however adding new specialized capabilities to
GNU make so that makefiles don't have to be modified isn't a design
philosophy I want to adopt.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Something sounds interesting  from your message.

What is this .ONESHELL ?

If I have -

All:
    mkdir dir
    cd dir


So currently in my remote execution design these 2 commands execute on
different host.

So, does ONESHELL will make these 2 commands on the same host ?

Please reply as soon as you can. This will solve few of my purposes and
make my builds faster.

Thanks
Waiting for reply.

On Mon, 2 Sep 2019, 18:08 Paul Smith, <[hidden email]> wrote:

> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
> > If your R&D team would allow you to add just one line to the
> > legacy GNU Makefile to assign the SHELL variable, you can assign that
> > to a shell wrapper program which performs command re-trying.
>
> You don't have to add any lines to the makefile.  You can reset SHELL
> on the command line, just like any other make variable:
>
>     make SHELL=/my/special/sh
>
> You can even override it only for specific targets using the --eval
> command line option:
>
>     make --eval 'somerule: SHELL := /my/special/sh'
>
> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
> line to force reading of a personal makefile before the standard
> makefile.
>
> Clearly you can modify the command line, otherwise adding new options
> to control a putative retry on error option would not be possible.
>
> As for your NFS issue, another option would be to enable the .ONESHELL
> feature available in newer versions of GNU make: that will ensure that
> all lines in a recipe are invoked in a single shell, which means that
> they should all be invoked on the same remote host.  This can also be
> done from the command line, as above.  If your recipes are written well
> it should Just Work.  If they aren't, and you can't fix them, then
> obviously this solution won't work for you.
>
> Regarding changes to set re-invocation on failure, at this time I don't
> believe it's something I'd be willing to add to GNU make directly,
> especially not an option that simply retries every failed job.  This is
> almost never useful (why would you want to retry a compile, or link, or
> similar?  It will always just fail again, take longer, and generate
> confusing duplicate output--at best).
>
> The right answer for this problem is to modify the makefile to properly
> retry those specific rules which need it.
>
> I commiserate with you that your environment is static and you're not
> permitted to modify it, however adding new specialized capabilities to
> GNU make so that makefiles don't have to be modified isn't a design
> philosophy I want to adopt.
>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

Paul Smith-20
On Mon, 2019-09-02 at 20:48 +0530, nikhil jain wrote:
> So currently in my remote execution design these 2 commands execute
> on different host.
>
> So, does ONESHELL will make these 2 commands on the same host ?

Well, I don't know how your remote execution design works.

However, .ONESHELL tells make to invoke the entire recipe in a single
shell invocation so I assume that will cause them all to be invoked on
a single remote host, yes.

See:

https://www.gnu.org/software/make/manual/html_node/One-Shell.html


Note, though, that this can break recipes if they are written in such a
way that they expect each line to be invoked in a separate shell.  For
one simple example:

  foo:
          cd foo && echo in foo
          cd bar && echo in bar

will work very differently with and without .ONESHELL.

More discussion is in the manual, above.

If your makefiles have these assumptions built into their recipes, and
you can't change them, you may not be able to take advantage of this.


_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Remote exec design is quite easy. I just filled in remote-stub.c with
enough code to execute the commands remotely and getting the status back.

On Mon, 2 Sep 2019, 21:06 Paul Smith, <[hidden email]> wrote:

> On Mon, 2019-09-02 at 20:48 +0530, nikhil jain wrote:
> > So currently in my remote execution design these 2 commands execute
> > on different host.
> >
> > So, does ONESHELL will make these 2 commands on the same host ?
>
> Well, I don't know how your remote execution design works.
>
> However, .ONESHELL tells make to invoke the entire recipe in a single
> shell invocation so I assume that will cause them all to be invoked on
> a single remote host, yes.
>
> See:
>
> https://www.gnu.org/software/make/manual/html_node/One-Shell.html
>
>
> Note, though, that this can break recipes if they are written in such a
> way that they expect each line to be invoked in a separate shell.  For
> one simple example:
>
>   foo:
>           cd foo && echo in foo
>           cd bar && echo in bar
>
> will work very differently with and without .ONESHELL.
>
> More discussion is in the manual, above.
>
> If your makefiles have these assumptions built into their recipes, and
> you can't change them, you may not be able to take advantage of this.
>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

David Boyce-3
In reply to this post by Paul Smith-20
I'm not going to address the remote execution topic since it sounds like
you already have the solution and are not looking for help. However, I do
have fairly extensive experience with the NFS/retry area so will try to
contribute there.

First, I don't think what Paul says:

> As for your NFS issue, another option would be to enable the .ONESHELL
> feature available in newer versions of GNU make: that will ensure that
> all lines in a recipe are invoked in a single shell, which means that
> they should all be invoked on the same remote host.

Is sufficient. Consider the typical case of compiling foo.c to foo.o and
linking it into foo.exe. Typically, and correctly, those actions would be
in two separate recipes which in a distributed-build scenario could run on
different hosts so the linker may not find the .o file from a previous
recipe. Here .ONSHELL cannot help since they're different recipes.

In my day job we use a product from IBM called LSF (Load Sharing
F-something,
https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
which exists to distribute jobs over a server farm (typically using NFS)
according to various metrics like load and free memory and so on. Part of
the LSF package is a program called lsmake (
https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
which under the covers is a version of GNU make with enhancement to enable
remote/distributed recipes and also adds retry-with-delay feature Nikhil
requested). Since GNU make is GPL, IBM is required to make its package of
enhancements available under GPL as well. Much of it is not of direct
interest to the open source community because it's all about communicating
with IBM's proprietary daemons but their retry logic could probably be
taken directly from the patch. At the very least, if retries were to be
added to GNU make per se it would be nice if the flags were compatible with
lsmake.

However, my personal belief is that retries are God's way of telling us to
think harder and better. Retrying (and worse, delay-and-retry) is a form of
defeatism which I call "sleep and hope". Computers are deterministic,
there's always a root cause which can usually be found and addressed with
sufficient analysis, etc. Granted there are cases where you understand the
problem but can't address it for administrative/permissions/business
reasons but that can't be known until the problem is understood.

NFS caching is the root cause of unreliable distributed builds, as you've
already described, but most or all of these issues can be addressed with a
less blunt instrument than sleep-and-retry. Even LSF engineers threw up
their hands and did retries but what we did here was take their patch,
which at last check was still targeted to 3.81, and while porting it to 4.1
added some of the cache-flushing strategies detailed below. This has solved
most if not all of our NFS sync problems. Caveat: most of our people still
use the LSF retry logic in addition, because they're not as absolutist as I
am and just want to get their jobs done (go figure), which makes it harder
to determine what percentage of problems are solved by cache flushing vs
retries but I'm pretty sure flushing has resolved the great majority of
problems.

One problem with sleep-and-hope is that there's no amount of time
guaranteed to be enough so you're just driving the incidence rate down, not
fixing it.

Since we were already working with a hacked version of GNU make we found it
most convenient to implement flushing directly in the make program but it
can also be done within recipes. In fact we have 3 different
implementations of the same NFS cache flushing logic:

1. Directly within our enhanced version of lsmake.
2. In a standalone binary called "nfsflush".
3. In a Python script called nfsflush.py.

The Python script is a lab for trying out new strategies but it's too slow
for production use. The binary is a faster version of the same techniques
for direct use in recipes, and that same C code is linked directly into
lsmake as well. Here's the usage message of our Python script:































*$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] [-V]
path [path ...]positional arguments:  path             directory paths to
flushoptional arguments:  -h, --help       show this help message and exit
-f, --fsync      Use fsync() on named files  -l, --lock       Lock and
unlock the named files  -r, --recursive  flush recursively  -t, --touch
 additional flush action - touch and remove a temp file  -u, --upflush
 flush parent directories back to the root  -V, --verbose    increment
verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
created files are sometimes unavailable via NFS for a periodof time due to
filehandle caching, leading to apparent race problems.See
http://tss.iki.fi/nfs-coding-howto.html#fhcache
<http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
forces a flush using techniques mentioned in the URL. Itcan optionally do
so recursively.This always does an opendir/closedir sequence on each
directoryvisited, as described in the URL, because that's cheap and safe
andoften sufficient. Other strategies, such as creating and removing atemp
file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*

The most important thing is to read the URL given above and/or to google
for similar resource of which there are many. While I'm not an NFS guru
myself, the summary of my understanding is that NFS caches all sorts of
things (metadata like atime/mtime, directory updates, etc) with various
degrees of aggression according to NFS vendor and internal configuration.
We've seen substantial variation between NAS providers such as NetApp, EMC,
etc, so much depends on whose NFS server you're using. However, the NFS
spec _requires_ that caches be flushed on a write operation so all
implementations will do this.

Bottom line, the most common failure case is as mentioned above: foo.o is
compiled on host A and immediately linked on host B. The close() system
call following the final write() of foo.o on host A will cause its data to
be flushed. Similarly I *believe* the directory write (assuming foo.o is
newly created and not just updated) will cause the filehandle cache to be
flushed. Thus, after these two write ops (directory and file) the NFS
server will know about the new foo.o as soon as it's created.

The problem typically arises on host B because no write operation has taken
place there after foo.o was created on A so no one has told it to update
its caches and as a result it doesn't know foo.o exists and the link fails
with ENOENT. All the flushing techniques in the script above are attempts
to address this. One takeaway from all this is that even if you do retries,
a "dumb" retry is immeasurably enhanced by adding a flush. In other words
the most efficient retry formula in a distributed build scenario would be:

<recipe> || flush || <recipe>

This never flushes a cache unless the first attempt fails. It presumes that
NFS implementors and admins know what they're doing and thus caching helps
with performance so it's not done unless needed. This is what we built into
our variant of lsmake. However, the same can also be done in the shell.

Details about implemented cache flushing techniques: the filehandle cache
is the biggest source of problems in distributed builds and the simplest
solution for it seems to be opening and reading the directory entry. Thus
our script and its parallel C implementation always do that. We've also
seen cases where forcing a directory write operation is required which the
-t, --touch option does. Sometimes you can't easily enumerate all
directories involved (vpath etc) so the recurse-downward (-r) and recurse
upward (-u) flags may be helpful though they (especially -u) may also be
overkill. The -f and -l options were added based on advice found on the net
but have not been shown to be helpful in our environment.

Some techniques may be of limited utility because they require write and/or
ownership privileges. For instance I've seen statements that umounts, even
failed umounts, will force flushes. Thus a command like "cd <dir> && umount
$(pwd)" would have to fail since the moount is busy but would flush as a
side effect. However I believe this requires root privileges so is not
helpful in the normal case.

In summary: although I don't believe in retries, if they're going to be
used I think they should be implemented in a shell wrapper program which
could be passed to make as SHELL=<wrapper> and the wrapper should use
flushing in addition to, or instead of, retries. We didn't do it that way
but I think our nfsflush program could just as well have been implemented
as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
run the recipe along with added flushing and retrying options. I agree with
Paul that I see no reason to implement any of these features, retry and/or
flush, directly in make.

David

On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <[hidden email]> wrote:

> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
> > If your R&D team would allow you to add just one line to the
> > legacy GNU Makefile to assign the SHELL variable, you can assign that
> > to a shell wrapper program which performs command re-trying.
>
> You don't have to add any lines to the makefile.  You can reset SHELL
> on the command line, just like any other make variable:
>
>     make SHELL=/my/special/sh
>
> You can even override it only for specific targets using the --eval
> command line option:
>
>     make --eval 'somerule: SHELL := /my/special/sh'
>
> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
> line to force reading of a personal makefile before the standard
> makefile.
>
> Clearly you can modify the command line, otherwise adding new options
> to control a putative retry on error option would not be possible.
>
> As for your NFS issue, another option would be to enable the .ONESHELL
> feature available in newer versions of GNU make: that will ensure that
> all lines in a recipe are invoked in a single shell, which means that
> they should all be invoked on the same remote host.  This can also be
> done from the command line, as above.  If your recipes are written well
> it should Just Work.  If they aren't, and you can't fix them, then
> obviously this solution won't work for you.
>
> Regarding changes to set re-invocation on failure, at this time I don't
> believe it's something I'd be willing to add to GNU make directly,
> especially not an option that simply retries every failed job.  This is
> almost never useful (why would you want to retry a compile, or link, or
> similar?  It will always just fail again, take longer, and generate
> confusing duplicate output--at best).
>
> The right answer for this problem is to modify the makefile to properly
> retry those specific rules which need it.
>
> I commiserate with you that your environment is static and you're not
> permitted to modify it, however adding new specialized capabilities to
> GNU make so that makefiles don't have to be modified isn't a design
> philosophy I want to adopt.
>
>
> _______________________________________________
> Help-make mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/help-make
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Thanks for detailed information.

I will see if I can use shell wrapper program as mentioned by you.

I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs. bkill,
lim, sbatchd, mbatchd etc. it is easy to understand and use

lsmake - I do not want to use IBM's proprietary stuff.

Thanks for your suggestions.

Nikhil

On Mon, Sep 2, 2019 at 11:10 PM David Boyce <[hidden email]> wrote:

> I'm not going to address the remote execution topic since it sounds like
> you already have the solution and are not looking for help. However, I do
> have fairly extensive experience with the NFS/retry area so will try to
> contribute there.
>
> First, I don't think what Paul says:
>
> > As for your NFS issue, another option would be to enable the .ONESHELL
> > feature available in newer versions of GNU make: that will ensure that
> > all lines in a recipe are invoked in a single shell, which means that
> > they should all be invoked on the same remote host.
>
> Is sufficient. Consider the typical case of compiling foo.c to foo.o and
> linking it into foo.exe. Typically, and correctly, those actions would be
> in two separate recipes which in a distributed-build scenario could run on
> different hosts so the linker may not find the .o file from a previous
> recipe. Here .ONSHELL cannot help since they're different recipes.
>
> In my day job we use a product from IBM called LSF (Load Sharing
> F-something,
> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
> which exists to distribute jobs over a server farm (typically using NFS)
> according to various metrics like load and free memory and so on. Part of
> the LSF package is a program called lsmake (
> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
> which under the covers is a version of GNU make with enhancement to enable
> remote/distributed recipes and also adds retry-with-delay feature Nikhil
> requested). Since GNU make is GPL, IBM is required to make its package of
> enhancements available under GPL as well. Much of it is not of direct
> interest to the open source community because it's all about communicating
> with IBM's proprietary daemons but their retry logic could probably be
> taken directly from the patch. At the very least, if retries were to be
> added to GNU make per se it would be nice if the flags were compatible with
> lsmake.
>
> However, my personal belief is that retries are God's way of telling us to
> think harder and better. Retrying (and worse, delay-and-retry) is a form of
> defeatism which I call "sleep and hope". Computers are deterministic,
> there's always a root cause which can usually be found and addressed with
> sufficient analysis, etc. Granted there are cases where you understand the
> problem but can't address it for administrative/permissions/business
> reasons but that can't be known until the problem is understood.
>
> NFS caching is the root cause of unreliable distributed builds, as you've
> already described, but most or all of these issues can be addressed with a
> less blunt instrument than sleep-and-retry. Even LSF engineers threw up
> their hands and did retries but what we did here was take their patch,
> which at last check was still targeted to 3.81, and while porting it to 4.1
> added some of the cache-flushing strategies detailed below. This has solved
> most if not all of our NFS sync problems. Caveat: most of our people still
> use the LSF retry logic in addition, because they're not as absolutist as I
> am and just want to get their jobs done (go figure), which makes it harder
> to determine what percentage of problems are solved by cache flushing vs
> retries but I'm pretty sure flushing has resolved the great majority of
> problems.
>
> One problem with sleep-and-hope is that there's no amount of time
> guaranteed to be enough so you're just driving the incidence rate down, not
> fixing it.
>
> Since we were already working with a hacked version of GNU make we found
> it most convenient to implement flushing directly in the make program but
> it can also be done within recipes. In fact we have 3 different
> implementations of the same NFS cache flushing logic:
>
> 1. Directly within our enhanced version of lsmake.
> 2. In a standalone binary called "nfsflush".
> 3. In a Python script called nfsflush.py.
>
> The Python script is a lab for trying out new strategies but it's too slow
> for production use. The binary is a faster version of the same techniques
> for direct use in recipes, and that same C code is linked directly into
> lsmake as well. Here's the usage message of our Python script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] [-V]
> path [path ...]positional arguments:  path             directory paths to
> flushoptional arguments:  -h, --help       show this help message and exit
> -f, --fsync      Use fsync() on named files  -l, --lock       Lock and
> unlock the named files  -r, --recursive  flush recursively  -t, --touch
>  additional flush action - touch and remove a temp file  -u, --upflush
>  flush parent directories back to the root  -V, --verbose    increment
> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
> created files are sometimes unavailable via NFS for a periodof time due to
> filehandle caching, leading to apparent race problems.See
> http://tss.iki.fi/nfs-coding-howto.html#fhcache
> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
> forces a flush using techniques mentioned in the URL. Itcan optionally do
> so recursively.This always does an opendir/closedir sequence on each
> directoryvisited, as described in the URL, because that's cheap and safe
> andoften sufficient. Other strategies, such as creating and removing atemp
> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>
> The most important thing is to read the URL given above and/or to google
> for similar resource of which there are many. While I'm not an NFS guru
> myself, the summary of my understanding is that NFS caches all sorts of
> things (metadata like atime/mtime, directory updates, etc) with various
> degrees of aggression according to NFS vendor and internal configuration.
> We've seen substantial variation between NAS providers such as NetApp, EMC,
> etc, so much depends on whose NFS server you're using. However, the NFS
> spec _requires_ that caches be flushed on a write operation so all
> implementations will do this.
>
> Bottom line, the most common failure case is as mentioned above: foo.o is
> compiled on host A and immediately linked on host B. The close() system
> call following the final write() of foo.o on host A will cause its data to
> be flushed. Similarly I *believe* the directory write (assuming foo.o is
> newly created and not just updated) will cause the filehandle cache to be
> flushed. Thus, after these two write ops (directory and file) the NFS
> server will know about the new foo.o as soon as it's created.
>
> The problem typically arises on host B because no write operation has
> taken place there after foo.o was created on A so no one has told it to
> update its caches and as a result it doesn't know foo.o exists and the link
> fails with ENOENT. All the flushing techniques in the script above are
> attempts to address this. One takeaway from all this is that even if you do
> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
> other words the most efficient retry formula in a distributed build
> scenario would be:
>
> <recipe> || flush || <recipe>
>
> This never flushes a cache unless the first attempt fails. It presumes
> that NFS implementors and admins know what they're doing and thus caching
> helps with performance so it's not done unless needed. This is what we
> built into our variant of lsmake. However, the same can also be done in the
> shell.
>
> Details about implemented cache flushing techniques: the filehandle cache
> is the biggest source of problems in distributed builds and the simplest
> solution for it seems to be opening and reading the directory entry. Thus
> our script and its parallel C implementation always do that. We've also
> seen cases where forcing a directory write operation is required which the
> -t, --touch option does. Sometimes you can't easily enumerate all
> directories involved (vpath etc) so the recurse-downward (-r) and recurse
> upward (-u) flags may be helpful though they (especially -u) may also be
> overkill. The -f and -l options were added based on advice found on the net
> but have not been shown to be helpful in our environment.
>
> Some techniques may be of limited utility because they require write
> and/or ownership privileges. For instance I've seen statements that
> umounts, even failed umounts, will force flushes. Thus a command like "cd
> <dir> && umount $(pwd)" would have to fail since the moount is busy but
> would flush as a side effect. However I believe this requires root
> privileges so is not helpful in the normal case.
>
> In summary: although I don't believe in retries, if they're going to be
> used I think they should be implemented in a shell wrapper program which
> could be passed to make as SHELL=<wrapper> and the wrapper should use
> flushing in addition to, or instead of, retries. We didn't do it that way
> but I think our nfsflush program could just as well have been implemented
> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
> run the recipe along with added flushing and retrying options. I agree with
> Paul that I see no reason to implement any of these features, retry and/or
> flush, directly in make.
>
> David
>
> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <[hidden email]> wrote:
>
>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>> > If your R&D team would allow you to add just one line to the
>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>> > to a shell wrapper program which performs command re-trying.
>>
>> You don't have to add any lines to the makefile.  You can reset SHELL
>> on the command line, just like any other make variable:
>>
>>     make SHELL=/my/special/sh
>>
>> You can even override it only for specific targets using the --eval
>> command line option:
>>
>>     make --eval 'somerule: SHELL := /my/special/sh'
>>
>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>> line to force reading of a personal makefile before the standard
>> makefile.
>>
>> Clearly you can modify the command line, otherwise adding new options
>> to control a putative retry on error option would not be possible.
>>
>> As for your NFS issue, another option would be to enable the .ONESHELL
>> feature available in newer versions of GNU make: that will ensure that
>> all lines in a recipe are invoked in a single shell, which means that
>> they should all be invoked on the same remote host.  This can also be
>> done from the command line, as above.  If your recipes are written well
>> it should Just Work.  If they aren't, and you can't fix them, then
>> obviously this solution won't work for you.
>>
>> Regarding changes to set re-invocation on failure, at this time I don't
>> believe it's something I'd be willing to add to GNU make directly,
>> especially not an option that simply retries every failed job.  This is
>> almost never useful (why would you want to retry a compile, or link, or
>> similar?  It will always just fail again, take longer, and generate
>> confusing duplicate output--at best).
>>
>> The right answer for this problem is to modify the makefile to properly
>> retry those specific rules which need it.
>>
>> I commiserate with you that your environment is static and you're not
>> permitted to modify it, however adding new specialized capabilities to
>> GNU make so that makefiles don't have to be modified isn't a design
>> philosophy I want to adopt.
>>
>>
>> _______________________________________________
>> Help-make mailing list
>> [hidden email]
>> https://lists.gnu.org/mailman/listinfo/help-make
>>
>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

David Boyce-3
I did not suggest using lsmake, I simply mentioned that we use it.

On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <[hidden email]> wrote:

> Thanks for detailed information.
>
> I will see if I can use shell wrapper program as mentioned by you.
>
> I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs.
> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use
>
> lsmake - I do not want to use IBM's proprietary stuff.
>
> Thanks for your suggestions.
>
> Nikhil
>
> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <[hidden email]>
> wrote:
>
>> I'm not going to address the remote execution topic since it sounds like
>> you already have the solution and are not looking for help. However, I do
>> have fairly extensive experience with the NFS/retry area so will try to
>> contribute there.
>>
>> First, I don't think what Paul says:
>>
>> > As for your NFS issue, another option would be to enable the .ONESHELL
>> > feature available in newer versions of GNU make: that will ensure that
>> > all lines in a recipe are invoked in a single shell, which means that
>> > they should all be invoked on the same remote host.
>>
>> Is sufficient. Consider the typical case of compiling foo.c to foo.o and
>> linking it into foo.exe. Typically, and correctly, those actions would be
>> in two separate recipes which in a distributed-build scenario could run on
>> different hosts so the linker may not find the .o file from a previous
>> recipe. Here .ONSHELL cannot help since they're different recipes.
>>
>> In my day job we use a product from IBM called LSF (Load Sharing
>> F-something,
>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
>> which exists to distribute jobs over a server farm (typically using NFS)
>> according to various metrics like load and free memory and so on. Part of
>> the LSF package is a program called lsmake (
>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
>> which under the covers is a version of GNU make with enhancement to enable
>> remote/distributed recipes and also adds retry-with-delay feature Nikhil
>> requested). Since GNU make is GPL, IBM is required to make its package of
>> enhancements available under GPL as well. Much of it is not of direct
>> interest to the open source community because it's all about communicating
>> with IBM's proprietary daemons but their retry logic could probably be
>> taken directly from the patch. At the very least, if retries were to be
>> added to GNU make per se it would be nice if the flags were compatible with
>> lsmake.
>>
>> However, my personal belief is that retries are God's way of telling us
>> to think harder and better. Retrying (and worse, delay-and-retry) is a form
>> of defeatism which I call "sleep and hope". Computers are deterministic,
>> there's always a root cause which can usually be found and addressed with
>> sufficient analysis, etc. Granted there are cases where you understand the
>> problem but can't address it for administrative/permissions/business
>> reasons but that can't be known until the problem is understood.
>>
>> NFS caching is the root cause of unreliable distributed builds, as you've
>> already described, but most or all of these issues can be addressed with a
>> less blunt instrument than sleep-and-retry. Even LSF engineers threw up
>> their hands and did retries but what we did here was take their patch,
>> which at last check was still targeted to 3.81, and while porting it to 4.1
>> added some of the cache-flushing strategies detailed below. This has solved
>> most if not all of our NFS sync problems. Caveat: most of our people still
>> use the LSF retry logic in addition, because they're not as absolutist as I
>> am and just want to get their jobs done (go figure), which makes it harder
>> to determine what percentage of problems are solved by cache flushing vs
>> retries but I'm pretty sure flushing has resolved the great majority of
>> problems.
>>
>> One problem with sleep-and-hope is that there's no amount of time
>> guaranteed to be enough so you're just driving the incidence rate down, not
>> fixing it.
>>
>> Since we were already working with a hacked version of GNU make we found
>> it most convenient to implement flushing directly in the make program but
>> it can also be done within recipes. In fact we have 3 different
>> implementations of the same NFS cache flushing logic:
>>
>> 1. Directly within our enhanced version of lsmake.
>> 2. In a standalone binary called "nfsflush".
>> 3. In a Python script called nfsflush.py.
>>
>> The Python script is a lab for trying out new strategies but it's too
>> slow for production use. The binary is a faster version of the same
>> techniques for direct use in recipes, and that same C code is linked
>> directly into lsmake as well. Here's the usage message of our Python script:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u]
>> [-V] path [path ...]positional arguments:  path             directory paths
>> to flushoptional arguments:  -h, --help       show this help message and
>> exit  -f, --fsync      Use fsync() on named files  -l, --lock       Lock
>> and unlock the named files  -r, --recursive  flush recursively  -t, --touch
>>      additional flush action - touch and remove a temp file  -u, --upflush
>>    flush parent directories back to the root  -V, --verbose    increment
>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
>> created files are sometimes unavailable via NFS for a periodof time due to
>> filehandle caching, leading to apparent race problems.See
>> http://tss.iki.fi/nfs-coding-howto.html#fhcache
>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
>> forces a flush using techniques mentioned in the URL. Itcan optionally do
>> so recursively.This always does an opendir/closedir sequence on each
>> directoryvisited, as described in the URL, because that's cheap and safe
>> andoften sufficient. Other strategies, such as creating and removing atemp
>> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>>
>> The most important thing is to read the URL given above and/or to google
>> for similar resource of which there are many. While I'm not an NFS guru
>> myself, the summary of my understanding is that NFS caches all sorts of
>> things (metadata like atime/mtime, directory updates, etc) with various
>> degrees of aggression according to NFS vendor and internal configuration.
>> We've seen substantial variation between NAS providers such as NetApp, EMC,
>> etc, so much depends on whose NFS server you're using. However, the NFS
>> spec _requires_ that caches be flushed on a write operation so all
>> implementations will do this.
>>
>> Bottom line, the most common failure case is as mentioned above: foo.o is
>> compiled on host A and immediately linked on host B. The close() system
>> call following the final write() of foo.o on host A will cause its data to
>> be flushed. Similarly I *believe* the directory write (assuming foo.o is
>> newly created and not just updated) will cause the filehandle cache to be
>> flushed. Thus, after these two write ops (directory and file) the NFS
>> server will know about the new foo.o as soon as it's created.
>>
>> The problem typically arises on host B because no write operation has
>> taken place there after foo.o was created on A so no one has told it to
>> update its caches and as a result it doesn't know foo.o exists and the link
>> fails with ENOENT. All the flushing techniques in the script above are
>> attempts to address this. One takeaway from all this is that even if you do
>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
>> other words the most efficient retry formula in a distributed build
>> scenario would be:
>>
>> <recipe> || flush || <recipe>
>>
>> This never flushes a cache unless the first attempt fails. It presumes
>> that NFS implementors and admins know what they're doing and thus caching
>> helps with performance so it's not done unless needed. This is what we
>> built into our variant of lsmake. However, the same can also be done in the
>> shell.
>>
>> Details about implemented cache flushing techniques: the filehandle cache
>> is the biggest source of problems in distributed builds and the simplest
>> solution for it seems to be opening and reading the directory entry. Thus
>> our script and its parallel C implementation always do that. We've also
>> seen cases where forcing a directory write operation is required which the
>> -t, --touch option does. Sometimes you can't easily enumerate all
>> directories involved (vpath etc) so the recurse-downward (-r) and recurse
>> upward (-u) flags may be helpful though they (especially -u) may also be
>> overkill. The -f and -l options were added based on advice found on the net
>> but have not been shown to be helpful in our environment.
>>
>> Some techniques may be of limited utility because they require write
>> and/or ownership privileges. For instance I've seen statements that
>> umounts, even failed umounts, will force flushes. Thus a command like "cd
>> <dir> && umount $(pwd)" would have to fail since the moount is busy but
>> would flush as a side effect. However I believe this requires root
>> privileges so is not helpful in the normal case.
>>
>> In summary: although I don't believe in retries, if they're going to be
>> used I think they should be implemented in a shell wrapper program which
>> could be passed to make as SHELL=<wrapper> and the wrapper should use
>> flushing in addition to, or instead of, retries. We didn't do it that way
>> but I think our nfsflush program could just as well have been implemented
>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
>> run the recipe along with added flushing and retrying options. I agree with
>> Paul that I see no reason to implement any of these features, retry and/or
>> flush, directly in make.
>>
>> David
>>
>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <[hidden email]> wrote:
>>
>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>>> > If your R&D team would allow you to add just one line to the
>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>>> > to a shell wrapper program which performs command re-trying.
>>>
>>> You don't have to add any lines to the makefile.  You can reset SHELL
>>> on the command line, just like any other make variable:
>>>
>>>     make SHELL=/my/special/sh
>>>
>>> You can even override it only for specific targets using the --eval
>>> command line option:
>>>
>>>     make --eval 'somerule: SHELL := /my/special/sh'
>>>
>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>>> line to force reading of a personal makefile before the standard
>>> makefile.
>>>
>>> Clearly you can modify the command line, otherwise adding new options
>>> to control a putative retry on error option would not be possible.
>>>
>>> As for your NFS issue, another option would be to enable the .ONESHELL
>>> feature available in newer versions of GNU make: that will ensure that
>>> all lines in a recipe are invoked in a single shell, which means that
>>> they should all be invoked on the same remote host.  This can also be
>>> done from the command line, as above.  If your recipes are written well
>>> it should Just Work.  If they aren't, and you can't fix them, then
>>> obviously this solution won't work for you.
>>>
>>> Regarding changes to set re-invocation on failure, at this time I don't
>>> believe it's something I'd be willing to add to GNU make directly,
>>> especially not an option that simply retries every failed job.  This is
>>> almost never useful (why would you want to retry a compile, or link, or
>>> similar?  It will always just fail again, take longer, and generate
>>> confusing duplicate output--at best).
>>>
>>> The right answer for this problem is to modify the makefile to properly
>>> retry those specific rules which need it.
>>>
>>> I commiserate with you that your environment is static and you're not
>>> permitted to modify it, however adding new specialized capabilities to
>>> GNU make so that makefiles don't have to be modified isn't a design
>>> philosophy I want to adopt.
>>>
>>>
>>> _______________________________________________
>>> Help-make mailing list
>>> [hidden email]
>>> https://lists.gnu.org/mailman/listinfo/help-make
>>>
>>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
haha OK.

If I were you, I would have built lsmake functionality in GMAKE and not pay
IBM lol.

Anyways, have a good day. :)

On Mon, Sep 2, 2019 at 11:45 PM David Boyce <[hidden email]> wrote:

> I did not suggest using lsmake, I simply mentioned that we use it.
>
> On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <[hidden email]>
> wrote:
>
>> Thanks for detailed information.
>>
>> I will see if I can use shell wrapper program as mentioned by you.
>>
>> I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs.
>> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use
>>
>> lsmake - I do not want to use IBM's proprietary stuff.
>>
>> Thanks for your suggestions.
>>
>> Nikhil
>>
>> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <[hidden email]>
>> wrote:
>>
>>> I'm not going to address the remote execution topic since it sounds like
>>> you already have the solution and are not looking for help. However, I do
>>> have fairly extensive experience with the NFS/retry area so will try to
>>> contribute there.
>>>
>>> First, I don't think what Paul says:
>>>
>>> > As for your NFS issue, another option would be to enable the .ONESHELL
>>> > feature available in newer versions of GNU make: that will ensure that
>>> > all lines in a recipe are invoked in a single shell, which means that
>>> > they should all be invoked on the same remote host.
>>>
>>> Is sufficient. Consider the typical case of compiling foo.c to foo.o and
>>> linking it into foo.exe. Typically, and correctly, those actions would be
>>> in two separate recipes which in a distributed-build scenario could run on
>>> different hosts so the linker may not find the .o file from a previous
>>> recipe. Here .ONSHELL cannot help since they're different recipes.
>>>
>>> In my day job we use a product from IBM called LSF (Load Sharing
>>> F-something,
>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
>>> which exists to distribute jobs over a server farm (typically using NFS)
>>> according to various metrics like load and free memory and so on. Part of
>>> the LSF package is a program called lsmake (
>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
>>> which under the covers is a version of GNU make with enhancement to enable
>>> remote/distributed recipes and also adds retry-with-delay feature Nikhil
>>> requested). Since GNU make is GPL, IBM is required to make its package of
>>> enhancements available under GPL as well. Much of it is not of direct
>>> interest to the open source community because it's all about communicating
>>> with IBM's proprietary daemons but their retry logic could probably be
>>> taken directly from the patch. At the very least, if retries were to be
>>> added to GNU make per se it would be nice if the flags were compatible with
>>> lsmake.
>>>
>>> However, my personal belief is that retries are God's way of telling us
>>> to think harder and better. Retrying (and worse, delay-and-retry) is a form
>>> of defeatism which I call "sleep and hope". Computers are deterministic,
>>> there's always a root cause which can usually be found and addressed with
>>> sufficient analysis, etc. Granted there are cases where you understand the
>>> problem but can't address it for administrative/permissions/business
>>> reasons but that can't be known until the problem is understood.
>>>
>>> NFS caching is the root cause of unreliable distributed builds, as
>>> you've already described, but most or all of these issues can be addressed
>>> with a less blunt instrument than sleep-and-retry. Even LSF engineers threw
>>> up their hands and did retries but what we did here was take their patch,
>>> which at last check was still targeted to 3.81, and while porting it to 4.1
>>> added some of the cache-flushing strategies detailed below. This has solved
>>> most if not all of our NFS sync problems. Caveat: most of our people still
>>> use the LSF retry logic in addition, because they're not as absolutist as I
>>> am and just want to get their jobs done (go figure), which makes it harder
>>> to determine what percentage of problems are solved by cache flushing vs
>>> retries but I'm pretty sure flushing has resolved the great majority of
>>> problems.
>>>
>>> One problem with sleep-and-hope is that there's no amount of time
>>> guaranteed to be enough so you're just driving the incidence rate down, not
>>> fixing it.
>>>
>>> Since we were already working with a hacked version of GNU make we found
>>> it most convenient to implement flushing directly in the make program but
>>> it can also be done within recipes. In fact we have 3 different
>>> implementations of the same NFS cache flushing logic:
>>>
>>> 1. Directly within our enhanced version of lsmake.
>>> 2. In a standalone binary called "nfsflush".
>>> 3. In a Python script called nfsflush.py.
>>>
>>> The Python script is a lab for trying out new strategies but it's too
>>> slow for production use. The binary is a faster version of the same
>>> techniques for direct use in recipes, and that same C code is linked
>>> directly into lsmake as well. Here's the usage message of our Python script:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u]
>>> [-V] path [path ...]positional arguments:  path             directory paths
>>> to flushoptional arguments:  -h, --help       show this help message and
>>> exit  -f, --fsync      Use fsync() on named files  -l, --lock       Lock
>>> and unlock the named files  -r, --recursive  flush recursively  -t, --touch
>>>      additional flush action - touch and remove a temp file  -u, --upflush
>>>    flush parent directories back to the root  -V, --verbose    increment
>>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
>>> created files are sometimes unavailable via NFS for a periodof time due to
>>> filehandle caching, leading to apparent race problems.See
>>> http://tss.iki.fi/nfs-coding-howto.html#fhcache
>>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
>>> forces a flush using techniques mentioned in the URL. Itcan optionally do
>>> so recursively.This always does an opendir/closedir sequence on each
>>> directoryvisited, as described in the URL, because that's cheap and safe
>>> andoften sufficient. Other strategies, such as creating and removing atemp
>>> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>>>
>>> The most important thing is to read the URL given above and/or to google
>>> for similar resource of which there are many. While I'm not an NFS guru
>>> myself, the summary of my understanding is that NFS caches all sorts of
>>> things (metadata like atime/mtime, directory updates, etc) with various
>>> degrees of aggression according to NFS vendor and internal configuration.
>>> We've seen substantial variation between NAS providers such as NetApp, EMC,
>>> etc, so much depends on whose NFS server you're using. However, the NFS
>>> spec _requires_ that caches be flushed on a write operation so all
>>> implementations will do this.
>>>
>>> Bottom line, the most common failure case is as mentioned above: foo.o
>>> is compiled on host A and immediately linked on host B. The close() system
>>> call following the final write() of foo.o on host A will cause its data to
>>> be flushed. Similarly I *believe* the directory write (assuming foo.o is
>>> newly created and not just updated) will cause the filehandle cache to be
>>> flushed. Thus, after these two write ops (directory and file) the NFS
>>> server will know about the new foo.o as soon as it's created.
>>>
>>> The problem typically arises on host B because no write operation has
>>> taken place there after foo.o was created on A so no one has told it to
>>> update its caches and as a result it doesn't know foo.o exists and the link
>>> fails with ENOENT. All the flushing techniques in the script above are
>>> attempts to address this. One takeaway from all this is that even if you do
>>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
>>> other words the most efficient retry formula in a distributed build
>>> scenario would be:
>>>
>>> <recipe> || flush || <recipe>
>>>
>>> This never flushes a cache unless the first attempt fails. It presumes
>>> that NFS implementors and admins know what they're doing and thus caching
>>> helps with performance so it's not done unless needed. This is what we
>>> built into our variant of lsmake. However, the same can also be done in the
>>> shell.
>>>
>>> Details about implemented cache flushing techniques: the filehandle
>>> cache is the biggest source of problems in distributed builds and the
>>> simplest solution for it seems to be opening and reading the directory
>>> entry. Thus our script and its parallel C implementation always do that.
>>> We've also seen cases where forcing a directory write operation is required
>>> which the -t, --touch option does. Sometimes you can't easily enumerate all
>>> directories involved (vpath etc) so the recurse-downward (-r) and recurse
>>> upward (-u) flags may be helpful though they (especially -u) may also be
>>> overkill. The -f and -l options were added based on advice found on the net
>>> but have not been shown to be helpful in our environment.
>>>
>>> Some techniques may be of limited utility because they require write
>>> and/or ownership privileges. For instance I've seen statements that
>>> umounts, even failed umounts, will force flushes. Thus a command like "cd
>>> <dir> && umount $(pwd)" would have to fail since the moount is busy but
>>> would flush as a side effect. However I believe this requires root
>>> privileges so is not helpful in the normal case.
>>>
>>> In summary: although I don't believe in retries, if they're going to be
>>> used I think they should be implemented in a shell wrapper program which
>>> could be passed to make as SHELL=<wrapper> and the wrapper should use
>>> flushing in addition to, or instead of, retries. We didn't do it that way
>>> but I think our nfsflush program could just as well have been implemented
>>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
>>> run the recipe along with added flushing and retrying options. I agree with
>>> Paul that I see no reason to implement any of these features, retry and/or
>>> flush, directly in make.
>>>
>>> David
>>>
>>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <[hidden email]> wrote:
>>>
>>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>>>> > If your R&D team would allow you to add just one line to the
>>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>>>> > to a shell wrapper program which performs command re-trying.
>>>>
>>>> You don't have to add any lines to the makefile.  You can reset SHELL
>>>> on the command line, just like any other make variable:
>>>>
>>>>     make SHELL=/my/special/sh
>>>>
>>>> You can even override it only for specific targets using the --eval
>>>> command line option:
>>>>
>>>>     make --eval 'somerule: SHELL := /my/special/sh'
>>>>
>>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>>>> line to force reading of a personal makefile before the standard
>>>> makefile.
>>>>
>>>> Clearly you can modify the command line, otherwise adding new options
>>>> to control a putative retry on error option would not be possible.
>>>>
>>>> As for your NFS issue, another option would be to enable the .ONESHELL
>>>> feature available in newer versions of GNU make: that will ensure that
>>>> all lines in a recipe are invoked in a single shell, which means that
>>>> they should all be invoked on the same remote host.  This can also be
>>>> done from the command line, as above.  If your recipes are written well
>>>> it should Just Work.  If they aren't, and you can't fix them, then
>>>> obviously this solution won't work for you.
>>>>
>>>> Regarding changes to set re-invocation on failure, at this time I don't
>>>> believe it's something I'd be willing to add to GNU make directly,
>>>> especially not an option that simply retries every failed job.  This is
>>>> almost never useful (why would you want to retry a compile, or link, or
>>>> similar?  It will always just fail again, take longer, and generate
>>>> confusing duplicate output--at best).
>>>>
>>>> The right answer for this problem is to modify the makefile to properly
>>>> retry those specific rules which need it.
>>>>
>>>> I commiserate with you that your environment is static and you're not
>>>> permitted to modify it, however adding new specialized capabilities to
>>>> GNU make so that makefiles don't have to be modified isn't a design
>>>> philosophy I want to adopt.
>>>>
>>>>
>>>> _______________________________________________
>>>> Help-make mailing list
>>>> [hidden email]
>>>> https://lists.gnu.org/mailman/listinfo/help-make
>>>>
>>>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
Reply | Threaded
Open this post in threaded view
|

Re: GNU Make 4.2 Query

nikhil jain
Hi,

I have a query. Sorry to bother you again.

Can you please let me know when I do a SIGINT to the running make or do a
ctrl+c, which function is called at the last ? I want to add some logic in
there. Please help. This is urgent.

Thanks in advance.

Nikhil

On Mon, 2 Sep 2019, 23:49 nikhil jain, <[hidden email]> wrote:

> haha OK.
>
> If I were you, I would have built lsmake functionality in GMAKE and not
> pay IBM lol.
>
> Anyways, have a good day. :)
>
> On Mon, Sep 2, 2019 at 11:45 PM David Boyce <[hidden email]>
> wrote:
>
>> I did not suggest using lsmake, I simply mentioned that we use it.
>>
>> On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <[hidden email]>
>> wrote:
>>
>>> Thanks for detailed information.
>>>
>>> I will see if I can use shell wrapper program as mentioned by you.
>>>
>>> I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs.
>>> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use
>>>
>>> lsmake - I do not want to use IBM's proprietary stuff.
>>>
>>> Thanks for your suggestions.
>>>
>>> Nikhil
>>>
>>> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <[hidden email]>
>>> wrote:
>>>
>>>> I'm not going to address the remote execution topic since it sounds
>>>> like you already have the solution and are not looking for help. However, I
>>>> do have fairly extensive experience with the NFS/retry area so will try to
>>>> contribute there.
>>>>
>>>> First, I don't think what Paul says:
>>>>
>>>> > As for your NFS issue, another option would be to enable the .ONESHELL
>>>> > feature available in newer versions of GNU make: that will ensure that
>>>> > all lines in a recipe are invoked in a single shell, which means that
>>>> > they should all be invoked on the same remote host.
>>>>
>>>> Is sufficient. Consider the typical case of compiling foo.c to foo.o
>>>> and linking it into foo.exe. Typically, and correctly, those actions would
>>>> be in two separate recipes which in a distributed-build scenario could run
>>>> on different hosts so the linker may not find the .o file from a previous
>>>> recipe. Here .ONSHELL cannot help since they're different recipes.
>>>>
>>>> In my day job we use a product from IBM called LSF (Load Sharing
>>>> F-something,
>>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
>>>> which exists to distribute jobs over a server farm (typically using NFS)
>>>> according to various metrics like load and free memory and so on. Part of
>>>> the LSF package is a program called lsmake (
>>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
>>>> which under the covers is a version of GNU make with enhancement to enable
>>>> remote/distributed recipes and also adds retry-with-delay feature Nikhil
>>>> requested). Since GNU make is GPL, IBM is required to make its package of
>>>> enhancements available under GPL as well. Much of it is not of direct
>>>> interest to the open source community because it's all about communicating
>>>> with IBM's proprietary daemons but their retry logic could probably be
>>>> taken directly from the patch. At the very least, if retries were to be
>>>> added to GNU make per se it would be nice if the flags were compatible with
>>>> lsmake.
>>>>
>>>> However, my personal belief is that retries are God's way of telling us
>>>> to think harder and better. Retrying (and worse, delay-and-retry) is a form
>>>> of defeatism which I call "sleep and hope". Computers are deterministic,
>>>> there's always a root cause which can usually be found and addressed with
>>>> sufficient analysis, etc. Granted there are cases where you understand the
>>>> problem but can't address it for administrative/permissions/business
>>>> reasons but that can't be known until the problem is understood.
>>>>
>>>> NFS caching is the root cause of unreliable distributed builds, as
>>>> you've already described, but most or all of these issues can be addressed
>>>> with a less blunt instrument than sleep-and-retry. Even LSF engineers threw
>>>> up their hands and did retries but what we did here was take their patch,
>>>> which at last check was still targeted to 3.81, and while porting it to 4.1
>>>> added some of the cache-flushing strategies detailed below. This has solved
>>>> most if not all of our NFS sync problems. Caveat: most of our people still
>>>> use the LSF retry logic in addition, because they're not as absolutist as I
>>>> am and just want to get their jobs done (go figure), which makes it harder
>>>> to determine what percentage of problems are solved by cache flushing vs
>>>> retries but I'm pretty sure flushing has resolved the great majority of
>>>> problems.
>>>>
>>>> One problem with sleep-and-hope is that there's no amount of time
>>>> guaranteed to be enough so you're just driving the incidence rate down, not
>>>> fixing it.
>>>>
>>>> Since we were already working with a hacked version of GNU make we
>>>> found it most convenient to implement flushing directly in the make program
>>>> but it can also be done within recipes. In fact we have 3 different
>>>> implementations of the same NFS cache flushing logic:
>>>>
>>>> 1. Directly within our enhanced version of lsmake.
>>>> 2. In a standalone binary called "nfsflush".
>>>> 3. In a Python script called nfsflush.py.
>>>>
>>>> The Python script is a lab for trying out new strategies but it's too
>>>> slow for production use. The binary is a faster version of the same
>>>> techniques for direct use in recipes, and that same C code is linked
>>>> directly into lsmake as well. Here's the usage message of our Python script:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u]
>>>> [-V] path [path ...]positional arguments:  path             directory paths
>>>> to flushoptional arguments:  -h, --help       show this help message and
>>>> exit  -f, --fsync      Use fsync() on named files  -l, --lock       Lock
>>>> and unlock the named files  -r, --recursive  flush recursively  -t, --touch
>>>>      additional flush action - touch and remove a temp file  -u, --upflush
>>>>    flush parent directories back to the root  -V, --verbose    increment
>>>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
>>>> created files are sometimes unavailable via NFS for a periodof time due to
>>>> filehandle caching, leading to apparent race problems.See
>>>> http://tss.iki.fi/nfs-coding-howto.html#fhcache
>>>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
>>>> forces a flush using techniques mentioned in the URL. Itcan optionally do
>>>> so recursively.This always does an opendir/closedir sequence on each
>>>> directoryvisited, as described in the URL, because that's cheap and safe
>>>> andoften sufficient. Other strategies, such as creating and removing atemp
>>>> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>>>>
>>>> The most important thing is to read the URL given above and/or to
>>>> google for similar resource of which there are many. While I'm not an NFS
>>>> guru myself, the summary of my understanding is that NFS caches all sorts
>>>> of things (metadata like atime/mtime, directory updates, etc) with various
>>>> degrees of aggression according to NFS vendor and internal configuration.
>>>> We've seen substantial variation between NAS providers such as NetApp, EMC,
>>>> etc, so much depends on whose NFS server you're using. However, the NFS
>>>> spec _requires_ that caches be flushed on a write operation so all
>>>> implementations will do this.
>>>>
>>>> Bottom line, the most common failure case is as mentioned above: foo.o
>>>> is compiled on host A and immediately linked on host B. The close() system
>>>> call following the final write() of foo.o on host A will cause its data to
>>>> be flushed. Similarly I *believe* the directory write (assuming foo.o is
>>>> newly created and not just updated) will cause the filehandle cache to be
>>>> flushed. Thus, after these two write ops (directory and file) the NFS
>>>> server will know about the new foo.o as soon as it's created.
>>>>
>>>> The problem typically arises on host B because no write operation has
>>>> taken place there after foo.o was created on A so no one has told it to
>>>> update its caches and as a result it doesn't know foo.o exists and the link
>>>> fails with ENOENT. All the flushing techniques in the script above are
>>>> attempts to address this. One takeaway from all this is that even if you do
>>>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
>>>> other words the most efficient retry formula in a distributed build
>>>> scenario would be:
>>>>
>>>> <recipe> || flush || <recipe>
>>>>
>>>> This never flushes a cache unless the first attempt fails. It presumes
>>>> that NFS implementors and admins know what they're doing and thus caching
>>>> helps with performance so it's not done unless needed. This is what we
>>>> built into our variant of lsmake. However, the same can also be done in the
>>>> shell.
>>>>
>>>> Details about implemented cache flushing techniques: the filehandle
>>>> cache is the biggest source of problems in distributed builds and the
>>>> simplest solution for it seems to be opening and reading the directory
>>>> entry. Thus our script and its parallel C implementation always do that.
>>>> We've also seen cases where forcing a directory write operation is required
>>>> which the -t, --touch option does. Sometimes you can't easily enumerate all
>>>> directories involved (vpath etc) so the recurse-downward (-r) and recurse
>>>> upward (-u) flags may be helpful though they (especially -u) may also be
>>>> overkill. The -f and -l options were added based on advice found on the net
>>>> but have not been shown to be helpful in our environment.
>>>>
>>>> Some techniques may be of limited utility because they require write
>>>> and/or ownership privileges. For instance I've seen statements that
>>>> umounts, even failed umounts, will force flushes. Thus a command like "cd
>>>> <dir> && umount $(pwd)" would have to fail since the moount is busy but
>>>> would flush as a side effect. However I believe this requires root
>>>> privileges so is not helpful in the normal case.
>>>>
>>>> In summary: although I don't believe in retries, if they're going to be
>>>> used I think they should be implemented in a shell wrapper program which
>>>> could be passed to make as SHELL=<wrapper> and the wrapper should use
>>>> flushing in addition to, or instead of, retries. We didn't do it that way
>>>> but I think our nfsflush program could just as well have been implemented
>>>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
>>>> run the recipe along with added flushing and retrying options. I agree with
>>>> Paul that I see no reason to implement any of these features, retry and/or
>>>> flush, directly in make.
>>>>
>>>> David
>>>>
>>>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <[hidden email]> wrote:
>>>>
>>>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>>>>> > If your R&D team would allow you to add just one line to the
>>>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>>>>> > to a shell wrapper program which performs command re-trying.
>>>>>
>>>>> You don't have to add any lines to the makefile.  You can reset SHELL
>>>>> on the command line, just like any other make variable:
>>>>>
>>>>>     make SHELL=/my/special/sh
>>>>>
>>>>> You can even override it only for specific targets using the --eval
>>>>> command line option:
>>>>>
>>>>>     make --eval 'somerule: SHELL := /my/special/sh'
>>>>>
>>>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>>>>> line to force reading of a personal makefile before the standard
>>>>> makefile.
>>>>>
>>>>> Clearly you can modify the command line, otherwise adding new options
>>>>> to control a putative retry on error option would not be possible.
>>>>>
>>>>> As for your NFS issue, another option would be to enable the .ONESHELL
>>>>> feature available in newer versions of GNU make: that will ensure that
>>>>> all lines in a recipe are invoked in a single shell, which means that
>>>>> they should all be invoked on the same remote host.  This can also be
>>>>> done from the command line, as above.  If your recipes are written well
>>>>> it should Just Work.  If they aren't, and you can't fix them, then
>>>>> obviously this solution won't work for you.
>>>>>
>>>>> Regarding changes to set re-invocation on failure, at this time I don't
>>>>> believe it's something I'd be willing to add to GNU make directly,
>>>>> especially not an option that simply retries every failed job.  This is
>>>>> almost never useful (why would you want to retry a compile, or link, or
>>>>> similar?  It will always just fail again, take longer, and generate
>>>>> confusing duplicate output--at best).
>>>>>
>>>>> The right answer for this problem is to modify the makefile to properly
>>>>> retry those specific rules which need it.
>>>>>
>>>>> I commiserate with you that your environment is static and you're not
>>>>> permitted to modify it, however adding new specialized capabilities to
>>>>> GNU make so that makefiles don't have to be modified isn't a design
>>>>> philosophy I want to adopt.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Help-make mailing list
>>>>> [hidden email]
>>>>> https://lists.gnu.org/mailman/listinfo/help-make
>>>>>
>>>>
_______________________________________________
Help-make mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-make
12