Monday, February 20, 2012

Oracle 9i -> SQL Server 2005: Snapshot agent aborts suddenly uninterrupted

Hi - I hope some of you can help me with this problem.

I'm trying to run the snapshot agent for 77 published tables, but the agent stops suddenly uninterrupted. It stops in the middle of taking a snapshot of a table containing 81,719,531 rows. The error message is pasted below.

Message
2006-06-14 05:33:33.53 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57200000 total rows copied)
2006-06-14 05:34:12.57 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57300000 total rows copied)
2006-06-14 05:34:51.23 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57400000 total rows copied)
2006-06-14 05:35:29.99 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57500000 total rows copied)
2006-06-14 05:36:09.82 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57600000 total rows copied)
2006-06-14 05:36:49.38 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57700000 total rows copied)
2006-06-14 05:37:28.56 [89%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_21_21" (57800000 total rows copied)
2006-06-14 05:37:35.36 [89%] The replication agent was shut down in response to an abort request.

Does anyone know what is causing this error and how I can possibly solve it?

Best Regards,
JB Plum

Was SQLServerAgent stopped by any chance?

-Raymond

|||

If this problem reproduces consistently without any attempts to manually shutdown the agent (ctrl-c, logoff, job or SQLServerAgent shutdown etc), it would appear that there is a very bad memory corruption somewhere and we would very much like to track it down. It would be really helpful to us if you can come up with an isolated repro (the rows that are causing trouble) and post it here or contact our product support engineers who will collect the relevant information. Are you running 64bit version of SQL Server by any chance?

-Raymond

|||

The SQLServerAgent was not by any chance stopped. I've produced the error twice.

JB Plum

|||

As mentioned in another post, there have been no attempts to manually shut down the agent.

As a matter of fact, I actually am running the 64bit version of SQL Server - is it of any importance?

Finally; I'm going to do some other work the next few days on the rest of the published tables, and then I'll try to take a snapshot of only the one table that seem's to cause the error - If you wish I can get back to you with the result of that test?

Best regards,

JB Plum

|||

To be honest, the situation would make more sense to me if the snapshot agent crashes on you instead of printing out the abort message which comes from a rather deliberate code path. An interesting thing to try would be to run the snapshot agent from the command line and see if you can still reproduce the problem. The reason why 64bit may be important is that we have different code paths for communiticating the "shutdown handle" between SQLServerAgent and the snapshot agent due to different handle sizes on different architectures. Since a bug in that area was fixed for SP1, I would not be surprised if there are more issues lurking in that area.

-Raymond

|||

Before going off to widely exotic theories, I should probably ask if there are any funny messages in the sqlserveragent log. It is also possible that a timeout is triggered somewhere given the relatively large interval between different bcp messages (which you can reduce by specifying a smaller bcp batch size for the snapshot agent)

-Raymond

|||

Ok - I'll try to run the snapshot agent from the command line and get back to you with the result in a few days.

If I get you right, you are saying, that If I can run the snapshot agent from the command line without errors, opposite to the error I'm getting by running the agent from SQL Server Management Studio, then there might be a bug related to the 64bit version of SQL Server?

//JB Plum

|||

So I'll try to specify another bcp batch size for the snapshot agent also. Could any other parameters be relevant also? For instance the commit batch size?

JB Plum

|||

Reducing the bcp batch size will be sufficient, thanks for your help.

-Raymond

|||

Ok, now I've runned the snapshot agent with a BcpBatchSize of 10000 rows and I stille get the error (see below).

Message
2006-06-16 18:50:10.27 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72860000 total rows copied)
2006-06-16 18:50:12.41 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72870000 total rows copied)
2006-06-16 18:50:14.87 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72880000 total rows copied)
2006-06-16 18:50:16.96 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72890000 total rows copied)
2006-06-16 18:50:19.37 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72900000 total rows copied)
2006-06-16 18:50:21.93 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72910000 total rows copied)
2006-06-16 18:50:24.55 [30%] Bulk copied 10000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72920000 total rows copied)
2006-06-16 18:50:26.15 [30%] The replication agent was shut down in response to an abort request.

Do you have any other idea in this area of what could cause the error ?

Now I'll run the snapshot agent from the command line to see if this produces the same error as running the snapshot agent from the SQL Server Management Studio. Then I'll get back to you again, if you haven't any other ideas?

Best regards,

JB Plum

|||

JB, if you can run the snapshot agent to completion from the command line, chances are good that the problem is somewhere in the interfacing code between SQLServerAgent and the snapshot agent (which may only happen on 64bit platform). In any case, I can't really say anything for certain given that I can only rely on your description of the problem on these forum postings. I may sound like I am nagging, but 1) have you checked the SQLServerAgent log and see if there were anything unusual that happened when the snapshot agent abort? 2) Are you running a clustered instance of SQLServer[Agent]?

-Raymond

|||

Ok, Raymond - here's our current situation.

I've now runned the snapshot agent from the command line without any difference. I used the "-OutputVerboseLevel 2" parameter - below is a sample of the output from the errorfile.

2006-06-19 00:31:35.58 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72000000 total rows copied)
2006-06-19 00:31:58.76 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72100000 total rows copied)
2006-06-19 00:32:23.00 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72200000 total rows copied)
2006-06-19 00:32:46.59 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72300000 total rows copied)
2006-06-19 00:33:10.37 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72400000 total rows copied)
2006-06-19 00:33:34.34 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72500000 total rows copied)
2006-06-19 00:33:58.22 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72600000 total rows copied)
2006-06-19 00:34:22.22 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72700000 total rows copied)
2006-06-19 00:34:45.37 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72800000 total rows copied)
2006-06-19 00:35:08.34 [30%] Bulk copied 100000 rows from "ORAMSSQLREPL"."HREPL_ARTICLE_77_77" (72900000 total rows copied)
2006-06-19 00:35:16.11 [30%] The replication agent was shut down in response to an abort request.

Yes - I have checked the SQL Server Agent Log and nothing unsual happens when the SQL Server Agent aborts.

No - I'm not running a clustered instance of SQL Server

Now I'm going to start the snapshot agent with the "-BcpBatchSize 1" parameter. Then I will have a look at the first row not copied to see if the data in that row could have something to do with the error.

How can I else get this error solved?

//JB Plum

|||

This is as strange as I have feared, the only way we can track down the root cause is to use the debugger. Are you running the agent through a terminal service\remote desktop session? From your previous outputs, it doesn't look like the problem is deterministic either, but if possible, it would be really helpful if you can isolate the set of rows causing the problem. Are there anything interesting in the system event logs?

-Raymond

|||

JB, if you are interested, sent a private email to rmak@.microsoft.com so I can send you a few bootleg binaries that you can try out in an attempt to isolate the problem.

-Raymond

No comments:

Post a Comment